Language Model Tokenizers Introduce Unfairness Between Languages

Conference on Neural Information Processing Systems (NeurIPS) 2023

Aleksandar Petrov1,2, Emanuele La Malfa1, Philip H.S. Torr2 Adel Bibi2,

1Department of Computer Science, University of Oxford 2Department of Engineering Science, University of Oxford

Modern language models can speak many languages...

It is impressive that language models can understand many different languages, even some lower-resource ones, especially considering that most of them were built targeting solely English text. However, unsurprisingly, their performance varies greatly across languages: models show much better command in their target language.

Example text in different languages
Example text in different languages

But they are treated drastically differently already at the tokenization stage

The tokenization lengths for some lanugages can be more than 15 times longer than English. This results in some language communities having much larger cost of accessing API-based services (which often charge per token), processing times and latency, and smaller amount of content that can be provided as context.

See for yourself

Select the languages and models you want to compare and see how they differ in tokenization length. The tokenization length is computed over 2000 sentences from the FLORES-200 parallel corpus. You can also change which language is used to normalize the tokenization lengths.

*For the tokenization premiums for ChatGPT and GPT-4 refer to the cl100k_base tokenizer.

Missing a tokenizer? Add it with a pull request here.

Compare tokenization of sentences

You can compare the tokenization of the same sentences across languages and tokenizers. The sentences are selected from the FLORES-200 parallel corpus.

Sentence:
X Tokens, Y% Unknown:
Token IDs:
Sentence:
X Tokens, Y% Unknown:
Token IDs:

For more details, read our paper:

Language Model Tokenizers Introduce Unfairness Between Languages

Aleksandar Petrov, Emanuele La Malfa, Philip H.S. Torr Adel Bibi,

Recent language models have shown impressive multilingual performance, even when not explicitly trained for it. Despite this, concerns have been raised about the quality of their outputs across different languages. In this paper, we show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. The same text translated into different languages can have drastically different tokenization lengths, with differences up to 15 times in some cases. These disparities persist across the 17 tokenizers we evaluate, even if they are intentionally trained for multilingual support. Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs. This induces unfair treatment for some language communities in regard to the cost of accessing commercial language services, the processing time and latency, as well as the amount of content that can be provided as context to the models. Therefore, we make the case that we should train future language models using multilingually fair tokenizers.

Cite as:
@inproceedings{petrov2023token_unfairness,
    title = {Language Model Tokenizers Introduce Unfairness Between Languages},
    author = {Petrov, Aleksandar and La Malfa, Emanuele and H. S. Torr, Philip and Bibi, Adel},    
    booktitle = {Advances in Neural Information Processing Systems},
    url = {https://arxiv.org/abs/2305.15425},
    year = {2023}
}