Linguistic Hegemony and Large Language Models

10 min readJul 4, 2023

Some large language models like GPT-4 do best in English. Others, like Baidu’s Ernie Bot or SenseTime’s SenseChat 2.0 do best in Chinese. What follows from these facts?

The topic reveals interesting and surprising interconnections between recent history, fine details of technology, and more speculative questions about intelligence and language.

First I’ll present some interesting facts, then some speculations.

The first and most basic thing to note is that even though it can seem as if something like ChatGPT or Bing is able to hang in many languages, in general big language models do better at some than other languages and show the best performance in a language when optimized for that language. Thus GPT-4:

This isn’t the only case. Recent Chinese language models show better performance in Chinese than they do in English. Thus we read of Baidu’s Ernie Bot:

One test cited by the Chinese science journal was based on standard admission and qualification exams such as those required to get into college or qualify as a lawyer. Ernie 3.5 surpassed ChatGPT and GPT 4 in Chinese.
However, it was behind GPT 4 in English even though it was ahead of ChatGPT. …
Another test focused on Chinese and contains over 13,000 multiple-choice questions covering more than 50 different subjects. Ernie 3.5 was ahead of ChatGPT and GPT 4 in this evaluation.

And consider:

This figure, I believe, shows that the Sense Chat 2.0 performs better than GPT-4 in Chinese; it does not perform better in Chinese than GPT-4 performs in English. (Taken from here; as far as I can tell, there is no English language write up of this.)

Note that we should already be at least a bit surprised by this, and it’s suggestive of conclusions about big picture questions on intelligence. One might have thought that question answering is in a sense language independent: that one uses intelligence to answer a question, creating a pellet of pure thought which one then elaborates into the language one happens to be using. If that were so, then provided one could articulate oneself equally in different languages, one should manifest the same intelligence in different languages. Since this seems not to be so, but the claim about articulation does seem on the right lines, we can tentatively conclude that intelligence or at least some aspect of its articulation is language-dependent. This will come back later.

The second thing to note helps explain the first thing. The best model we have is GPT-4 in English. There are things to say about why GPT-4 in particular — why that particular variation of transformers with that (I think?) somewhat proprietary training data and that (presumably) unique reinforcement learning with human feedback — is the best, but the thing we’re interested in is English language supremacy.

I don’t have the whole answer but a quick and powerful one is just below:

That depicts the representation of languages in Common Crawl which, per my limited understanding, is basically a massive archive of the internet that is updated regularly and in turn makes up a very large part of LLMs training data. English is vastly overrepresented relative to other languages. Surely a big part of the supremacy of GPT-4 is the ubiquity of English on the internet.

I have one more fact that I want to consider before turning to analysis. The first fact tells us English is currently the language of our most intelligent machines with Chinese close; the second provides a partial explanation for why this might be. The third, and perhaps the most interesting, can help us see how this situation might change. It is that, as of 2023 and models like GPT-4 (notably not the primarily Chinese language models), it is considerably cheaper to use GPT in English. This economic fact could be of great consequence.

The reason for this is owing to tokenization. The set of inputs of a language model is infinite, because the set of sentences of a natural language is infinite. We can’t program a computer to deal with infinite cases, but we don’t have to, because the set of primitive vocabulary items of a language is finite. By combining finitely many items we can produce an infinite set of sentences. If we according represent incoming sentences in terms of the finite primitive that compose them, the computer can deal with them. So we don’t treat

John talks
John walks

As two indivisible units, but rather as entities composed of parts. If we assign John the integer 1, talks 10 and walks 11, we’d have {1, 10}, {1,11} as our inputs.

But tokenization is more complicated than that. The problem, a fairly active one in contemporary philosophy of language (and I think some branches of linguistics) is getting clear as to what a linguistic unit is. Consider the variants:

John talked
John walked

What we’d like to do is capture the fact that despite not really being a word, ‘-ed’ at the end of a verb shares properties with bona fide words: it bears a standing semantic significance across the sentences in which it occurs (of course I could have made the same point with our original sentences and ‘-s’). So we might assign it its own number, let’s say 30, so we get {1,10,30}.

And so it goes. We’ve already dipped our toes into grammar — the past tense. We can really see the complexity of the issue when we realize the complexity of grammar. Consider the incredibly simple phrase ‘in the house’, and some of its translations (into Hungarian, Russian, Mandarin, Latin):

a hazban
v domu
zài fángzi lǐ
domi

In Hungarian, we express in-ness by putting a suffix on the end of the word — ‘ban’ (it could equally be a kertban, in the garden, a boltban, in the store). In Russian, we use a preposition (roughly like English ‘in’) as well as adding an ending to the word (here, ‘u’ to the stem dom). In Chinese, we use something like prepositions and postpositions to make precise the inness (so I think it’s something like at and in the house). In Latin, we add an ending to the stem.

So think about it: how should we represent inness? It seems it can either be an ending to a noun, an ending plus a preposition, a suffix, a preposition, or two prepositions. Or, simplifying, it might be a question of a mere preposition (‘in’) an ending to a stem (‘i’) or an ending to a word (‘ban’).

Without getting into too many details, languages differ massively, and so the nature of a token is going to be something language-relative. And that opens up the possibility that different ways of generating tokenizers work better or worse for different languages.

(We don’t get into the details of how this chunking is done, but it’s pretty interesting.)

So, two facts: we need to tokenize; and, tokenizings for different language might have different properties, in light of the fact that languages differ greatly in their grammatical structure.

And then the third fact: the token is the unit of account, literally: in using large language models like ChatGPT, you pay in proportion to the tokens used in asking or prompting the model and receiving the answer from it.

These facts conspire. Tokenizers vary in their performance for different languages and if the leading tokenizers do badly for your particular language, you’re out of luck. Consider the following, which I take from this paper which goes into great detail about these issues:

These show how many more tokens it takes to tokenize a text in languages other than English, and although the numbers over 10 are eye-popping, note that for massively well-represented and common languages the figure is still substantial. It costs more to talk to GPT-4 in any language other than English.

I did a little experiment in the same vein that the reader can check themselves. Pasting in an arbitrary paragraph from a newspaper article, google translating it, and using a tool to tokenize, this is what I got:

English 67
Chinese 94
Russian 157
Polish 114
Irish 135
Norwegian 96
Persian 225
German 97
French 104

(It’s interesting to note how unpatterned the data seems to be: Polish and Russian are grammatically similar. Persian, people tell me, is an easy language; Norwegian is often noted as one of the simplest languages relative to English while Chinese one of the hardest for English speakers.)

This isn’t nothing. Last week I used my university access to ChatGPT to do something like sentiment analysis on social media; I tried the same thing again with Russian language posts, and the context I could feed the model was small enough to make something that was already a pain for English unworkable. As of right now, there are use cases for LLMs that depend on the language one happens to be trying to analyse.

Let’s finish up my considering some consequences. What is intriguing is that this fairly small technical detail could have interesting big picture ramifications both for how we got here and where we’re going. I consider them in turn.

Past

I was in Budapest and visited the house where John von Neumann lived. At 29, in 1932, he wrote Mathematische Grundlagen der Quantenmechanik. 12 years later, with Oskar Morgenstein, he wrote Theory of Games and Economic Behaviour. These two works, among the most important and consequential of the twentieth century, illustrate something important: English’s preeminent role, especially in academia but also in real life, is pretty new. Read about the career of Bertrand Russell, for example, and you’ll find him trying to get his hands on Peano on Italian and corresponding with Frege in German; a French foundations of logic conference was one of the intellectual impetuses for his work. People who say we could all be speaking German now or we will all be speaking Chinese are seldom people one wants to listen to, but they are right about the contingency of English’s role as lingua franca.

Here’s the obvious point: assume that GPT is something akin to an intelligent machine. A partial and flawed one, but something greater than we’ve ever seen so far and something capable already of doing enough. The fact that GPT is smartest in English isn’t a reflection on English or anglophones. Rather, it’s a reflection on the vagaries of the 20th century, of whatever facts on a microlevel that made von Neumann switch languages in the 40s (obviously it’s tempting to monocausally say that those facts are the Nazis but presumably it’s more complicated than that, even if for von Neumann himself the story fits pretty nicely).

Another consequence is that without English being hegemonic, not enough training data would have been accrued to enable GPT-4. Imagine Common Crawl represented texts in proportion to how much they are spoken. Would that have given enough of any one language to enable the systems we see today? Well, certainly I don’t know, but if one takes seriously the idea that increases in performance have been spurred by increases in data, then with less data there’d be less such increases.

People talk about the importance of the net and big data for LLMs; perhaps, on a bigger view, the dominance of English is too. (There is obviously and very interestingly a different story to be told about Chinese as well, but I can’t even begin to gesture at it.) That’s point one to take from this post: English as lingua franca should figure in our history of AI.

The Future

Let’s look forward. Imagine this possibility: you’re a Bulgarian businessperson or leader. You try to digitalize your country to make it competitive, making documentation and FAQs that can be questioned. But starting from a disadvantage — something like $12k gdp per capita (not-PPP, which seems like the right statistic for this point) — you can’t afford it. Your system remains undigitized.

Across the border, you’re a Romanian businessperson or leader. Your Romance language has an easier time being tokenized, and so, although you’re roughly as well off as your neighbour, you can avail of our new intelligent machines, digitize your state, make money, afford more, make more money, and so on, in a virtuous cycle of development.

Finally consider a crushingly mediocre Anglophone (one living, say, on the Ards peninsula of Northern Ireland). The same situation, but you digitize everything easily and cheaply, leading ahead of your smarter rivals merely by the exorbitant privilege of English being your birth language. The world becomes worse as a result.

These are just-so stories, of course, but a possible future is one in which the economic inequality ripples out according to the incredibly strange measure of how difficult your language is for an English-tokenizer to tokenize. And that possibility is the second thing to take from this post as something to be aware of.

Let’s end with one massively sci-fi scenario, just because it’s fun. Many are worried about running out of training data — we’ve already more than taken the low-hanging fruit, and good data is going to be increasingly hard to come by. Now recall one speculative possibility bruited earlier: that intelligence somehow adheres to the language in which that intelligence is expressed, and isn’t instead pure pellets of brainium that are then clothed in words. And remember it seems like intelligence scales with the quantity of language available. Then there’s a simple (!) solution: invent a megalanguage comprising both Chinese and English. Its corpus would be the union of the English and the Chinese web; and so its intelligence would be greater than either language-bound intelligence taken individually. In such a language, fangzi and house would receive very similar embeddings and so the probable continuations of an occurrence of that token (once we add the necessary attentional magic) would include continuations like you san ge comfortable de fangjian but more importantly could manage to include everything sinophones and anglophones know about the topic in question.

Of course, this is unbelievably weird, to the extent that it’s not clear it’s coherent. What is a superlanguage? As far as I know, they don’t exist. Would they even give the advantage I claim? I don’t know. It’s fun to think about, though.

History unfolds weirdly: small things make big differences. This post has tried to make the case that the small thing of tokenization, something that normally gets relatively short shrift in tutorials about language models, is worthy of our attention as a possible big difference maker.

Linguistic Hegemony and Large Language Models

Past

The Future

Written by Matthew McKeever