What can Baidu’s AI do?
Last week was another year in deep learning. GPT-4 stole the show, Google seemed to whiff it, right now I’m in a library trying to get LLaMA on my $450 PC. China’s Baidu also released, or at least introduced, their own model. Called ELMO in English or 文心一言 in Chinese, it’s another transformer that aims to do for Chinese netizens and businesses what Bing and ChatGPT is doing for Westerners.
If you’ve followed this story, you’ll know that there’s been a bit of expectation management. At the Two Sessions last week, a tech minister said ChatGPT-style performance is ‘hard to achieve’. Perhaps to avoid a repeat of the Google unpleasantness, the press conference last week revealed a demo not immediately accessible for people to try out.
Nevertheless people have gotten access and some stories have been coming out about what ERNIE can do. My aim here is simply to present the findings from one article published yesterday. As far as I can tell, there isn’t a huge amount of such articles in English, and so though one can’t really conclude much from n=1 (or maybe n=0.5 since I’m reporting what one person did under I don’t know what conditions), nevertheless it’s better than nothing, and gives me something to do while my downloads keep on breaking. (If anyone got alpaca 7b on Windows this weekend, hints appreciated.)
Big but instructive caveat lector: my linguistic skills are nowhere near good enough to be able to do this well. In fact it’ll be an interesting conclusion that potential AI arms races will have this feature: assessing a given product will require a deep cultural knowledge. Presented with a rival’s nuclear sub, an engineer of any nationality will be able to assess it. This isn’t so for deep learning systems whose impressiveness often turns, as we’ll see, on extremely subtle cultural and linguistic knowledge.
In the article I read, ERNIE was tested on five relatively standard tests: one called literary creation; advertising copywriting; basic mathematics; Chinese language comprehension; and multimodal generation. Let’s look in turn.
The first task was first to summarize the ‘core contents’ of Liu Cixin’s Three Body series, and then to make suggestions for how to continue it. Here’s google translate version of the answer about the first book:
1. The first part of “The Three-Body Problem” tells the story of the astronomer Ye Wenjie who got in touch with the Three-Body Civilization in the universe, betrayed human beings and sent the coordinates of the earth without authorization, resulting in the exposure of the earth. Nano scientist Wang Miao came into contact with the secret organization (ETO) formed on earth to deal with the arrival of Trisolarans through the online game “Three-Body Problem”. Man decides to invade Earth. The trisolarans have locked down the basic science of mankind by using technology far superior to that of human beings, and the fleet has begun to sail to the earth, and the end of mankind is approaching.
Here’s the suggested ways ERNIE suggested to continue based on the first answer:
1. Character shaping: It can describe the character, background and experience of the main characters in a deeper way, making them more real and believable.
2. Plot design: More plots and turning points can be designed to enhance the tension and suspense of the story.
3. Cultural background: Different cultural and social backgrounds can be explored to increase the depth and breadth of the story.
4. Philosophical thinking: You can explore deeper philosophical issues, such as the meaning of life, the relationship between human beings and the universe, etc.
The author asks for ways in particular to continue writing so as to further explore the philosophical aspect, and here’s some of the answer. One could write about:
1. The meaning of life: How do human beings, as organisms, exist in the universe and realize their own value? This may involve issues such as human consciousness, morality, and emotion.
2. The relationship between humans and the universe: How do humans affect the operation and evolution of the universe? This may involve issues such as the origin, structure, and evolution of the universe.
3. Morality and ethics: How are human moral and ethical concepts formed and evolved? This may involve human beings’ awareness of morality, justice, rights, etc.
These answers are notably generic. They are not particularly Three Body specific. The author and other commentators note this:
Some internal test users also said that the same question was thrown to ERNIE and GPT-4, and there was a certain gap between the answers of the two. For example, in the continuation of “The Three-Body Problem”, the answers of ERNIE are more abstract, such as the meaning of life and the relationship between human beings in the universe, while GPT-4’s answers are more specific and conflicting, such as the rise of humanoid robots , the challenge of the law of the dark forest, etc.
In support of this, I asked Bing to answer basically the same question (“I’d like to write a work of philosophical fiction loosely based on the Lord Of The Rings. Do you have any ideas for what I could write about?”) and it produced:
This is maybe a bit better — while the bullet points are still generic, they are perhaps more clearly on theme than some of the ERNIE ones, and the paragraph that follows is interesting.
The second task was to produce a business idea, name, slogan, and press release for some invented companies. I won’t discuss this because this sort of thing — branding — depends on extremely fine-grained linguistic knowledge that I don’t have. (For example, Macadamia might be a good name for a company that makes Apple-like products (‘Macademia’ sounds nice) whereas Conkers wouldn’t be a good name for a fictitious python package called Constrastive Keras because ‘conkers’ very slightly suggests testicles in British English, a fact most speakers of English might not know. The point is, I wouldn’t even get close to understanding the Chinese version of a Macademia vs Conkers distinction.)
The third is a (I guess?) famous algebraic word puzzle that goes like so:
There are several chickens and rabbits in the same cage, counting from the top, there are m heads, and counting from the bottom, there are n feet. Ask how many chickens and rabbits are in the cage?
ERNIE gets the answer, and indeed successfully corrects the questioner when they post an mistakenly insoluble version. And it shows its working.
The fourth is a test of Chinese linguistic and cultural knowledge. Since again this is beyond me, I’ll just quote
It is worth mentioning that ERNIE is rooted in the large language model of the Chinese market, so it has advanced natural language processing capabilities in the Chinese field, and has better performance in Chinese language and Chinese culture. In the on-site demonstration, ERNIE correctly explained the meaning of the idiom “Luoyang Zhigui” and the corresponding economic theory of “Luoyang Zhigui”, and created a Tibetan acrostic poem with the four characters “Luoyang Zhigui”.
The mentioned idiom perhaps plays something like the roll a classical or Shakespearian allusion might play in educated English. I thought I’d see how Bing does with something in that ballpark, the sort of thing one knows if one has studied a great books course:
The fifth and final thing is that ERNIE can generate pictures, audio, and video. The author asks it to make a poster and blurb for a conference, and has it speak the blurb in a Sichuan accent. It succeeds, apart from the video generation, which hasn’t yet been rolled out to users owing to computational expense.
The author summarises the views of people with access:
According to the actual measurement of ERNIE by many media, the following conclusions are given: common functions perform normally, but there are still many loopholes. During the trial period, it was found that there were still many loopholes in the communication of ERNIE, and there were often inconsistencies. In addition, there is a lack of echo between contexts during the conversation, which is more like a one-to-one answer.
It goes on to point out some smallish advantages ERNIE has over ChatGPT such as being able to get accurate weather information and avoiding at least one factual mistake ChatGPT makes.
This is nowhere near enough information to be able to benchmark ERNIE. The important thing I want to end on is the difficulty-in-principle of regular people, including regular people who work for software companies, being able to grok other systems. Apart from coding, which seems to come out as a sort of new mathesis universalis, in order to assess deep learning systems will require a lot of cultural knowledge.
We know this, of course. Many of the most impressive (again, excluding coding) feats are artistic, like parodies, and are most likely culture-bound. My favourite, which I regularly re-read:
It’s very unclear to me — I could be wrong — that this sort of absurdist humour will travel well. If it doesn’t, then someone not pretty deeply connected to Anglophone culture might fail to understand its impressiveness, thus be led to underestimate the system in question. The difference between this and a failed parody might be essentially imperceptible to someone for whom English is a second language.
The same applies here mutatis mutandis. Some Chinese benchmarks involve producing classical Chinese poetry. But surely the difference between a good such poem and a bad one is something only a highly select group of Sinophone people will be able to appreciate. We could imagine someone in Silicon Valley to whom the newest of new LLMs leak, but who can’t understand the first thing about it because they don’t know that its outputs are truly and for the first time indistinguishable from Li Bai (of whom they’ve never heard.)
This matters, and suggests various rather extreme scenarios. Imagine a unipolar environment: one LLM to rule them all. We all use the best, whichever its base language is (again if we consider coding, it’s not at all impossible that the best model could be one that excels at Chinese and Python but is only so-so at English creativity). But the best is based, at least in part, on a property that no monoglot Anglophones (/Chinese-speakers) can judge, namely creativity in Chinese (/English). Deep learning is often castigated as opaque — we could imagine it much more opaque, where one just has to take it on trust that a system is the best in part because it excels at a task you can’t judge.
(One more crowded thought: one might imagine this differing cultural relativity of deep learning outputs — poems very relative, code not — could exert pressure on what such systems are optimized for. In particular, one could imagine a fine-tuner faced with getting an LLM to produce better code or better poetry. If the whole world can appreciate the code but only some the poetry, maybe the poetry will get weeded out. And then if the prognosticators are right that deep learning systems become the world’s economic dynamo, then given that dynamo will be a code-producer, the economic foundation of the world might become code (or, more generally, culture-agnostic products, perhaps like some images or some games). These sorts of potential socio-econo-culturo-technological consequences are unbelievably interesting and worth exploring at length.)
If there comes to be an AI arms race, then, it might be an arms race with peculiar characteristics. At the risk of being annoying: it might be a novel sort of culture war, in the sense that the fought-over devices are devices that depend intimately on cultural knowledge for assessment, in a way that nukes and chips don’t.