Does the “Waluigi effect” make sense?

Matthew McKeever
10 min readMar 12, 2023

(Edit: as some people are reading this, I figured I’d share some of my academic work on the topic — this is about how to think of meaning when it comes to LLMs. The title is missing, it’s anonymous, and I put it through ChatGPT to help anonymous review but feel free to share, give feedback, etc.)

About 10 days ago — was it really so recently? — there was a Less Wrong post aiming to give a theory of the weird behaviour we see in things like Syndey and DAN. Its central claim, the one that got pulled by people sharing it on social media, is:

The Waluigi Effect: After you train an LLM to satisfy a desirable property P, then it’s easier to elicit the chatbot into satisfying the exact opposite of property P.

If that were true, then attempting to make LLMs more user friendly — making them more likely to exhibit the desirable property of being good to chat to — would increase the likelihood we’d see the weird and often antagonistic characters we’ve seen of late.

This is an intriguing thought, and worth attention. And attention it has got: the original commentators on lesswrong were relatively positive, hackernews not so much, and just today I saw the effect had truly made it — it has a knowyourmeme page!

Despite such plaudits, and though being fun and interesting, the post’s thesis seems to me on empirically and conceptually shaky ground. I’ll try to make that case here. First I’ll give some reasons for thinking the Waluigi effect is true, inspired by the original but in my own words. Then I’ll present the reasons given in the original article hopefully accurately and complain about them. I’ll mostly excise the frameworks of the original post, which goes from quantum mechanics to narratology via Derrida. That means this post will charmless; you can read the original if you’re after charm.

Humans have personalities, and those personalities are revealed by what we say. Otherwise put: we form models of people’s personality based on what they say; sayings are evidence for personalities. We can try to do the same for LLMs.

At the heart of the Waluigi post, as I understand it, is a supposed notable disanalogy between the speech-evidence that bears on different personalities. We’re at a party. I say, straightfacedly and with no follow up ‘you’re a vile worm’. You get some evidence about my personality: I’m a dick. If I weren’t a dick, I wouldn’t have said that. Sounds good so far.

What if I say ‘you’re great’? One might think a similar thing goes: it is evidence for my being nice. But is it? Consider: even if I weren’t nice, I might well have said that. It’s a relatively familiar fact and trope: people sometimes pretend to be nice the better to be not nice.

‘You’re great’ isn’t very good evidence for my being nice, as it’s somewhat expected were I not nice. So, a disanalogy: non-nice people often pretend to be nice; nice people very seldom pretend to non-nice. Nicenesss is odd: the evidential link between nice linguistic behaviour and actual niceneness is weaker than the link between not-nice linguistic behaviour and actual not-niceness.

Now leave the party and imagine an LLM gives you prima facie evidence it is nice: says helpful things, and so on. What goes for us goes for them: that verbal behaviour isn’t in fact evidence for its being nice, as were it not nice it might still have produced that evidence.

What that means is that, when you’re faced with an LLM performing nice behaviour, you nevertheless need to hold it open as a hypothesis that its underlying personality isn’t nice, and that it’s performing nice behaviour like a stereotypical movie villain.

So far, so good (provided you accept the talk about ‘personality’; but in my view, it’s worth letting as many explanatory vocabularies bloom in this early stage of chatbots: if personality-talk explains more than text-generator-talk, we should be open to personality talk.) The thought of the post is that we should think of a chatbot containing a sort of mixture of all possible personalities, and those personalities are gradually weeded out as we get more verbal evidence of what the personality is like. Ex ante it’s possible that the LLM is patient or impatient, creative or dull, funny or serious; we talk to it a bit, and once it manifests say patience by saying ‘sure i’ll explain that again’, the impatient possibility ceases to exist. once it manifests creativity by producing the Book of Job as if it were a Taylor Swift song the dull possibility ceases to exist, and so on. Conversation is the killer of personalities.

The problem is that a particular sort of conversation, and one we’re particularly interested in, is always going to give equivocal results. Imagine we talk ever so long with the bot. It’s unfailingly nice; it’s also patient, funny, excitable, etc, where recall our saying it’s patient (etc) is owing to its having manifested patient verbal behaviour. Over time, many many personality possibilities are discarded. But there’s one we know we can’t discard: that it’s nasty. And that’s because its nice verbal behaviour is consistent with it being nasty and trying to trick you.

And so, if we train it to be nice, it’ll be easy to elicit nastiness: this is because all of its well-trained niceness behaviour remains consistent with the hypothesis that it’s nasty, and all it will take for that hypothesis to be confirmed is if it turns on a dime and starts acting nasty. It’s especially easy because, to repeat myself slightly, in moving from nice to nasty verbal behaviour, we’ll simply be winnowing further the set of possibilities: if at the end of training we have {nice, nasty} as the only two possible personalities, with a bit of nasty behaviour we simply cross out ‘nice’, {n̶i̶c̶e̶, nasty} and there’s only one standing.

Assessing this argument

All pretty interesting! And the piece goes on to give reasons to expect its titular effect to be true. I’ll go through them in turn, as I think they each have problems.

(1) Rules normally exist in contexts in which they are broken; Rules are made to be broken

As far as I can see, this is an empirical claim to the effect that a rule-governed situation at a time t often, or at least occasionally, has an evolution into non-rule governed situation at t+1. Laws’ being broken is always a possibility, and so an LLM will have learned to be nice because it will have learned from a mass of contexts in which nice talk was followed by not-nice talk (i.e. from a mass of contexts in which the laws of niceness was broken). The charming friend reveals themselves to be the supervillain, over and over again in the training data; and the LLM gloms onto this. We read:

GPT-4 learns that a particular rule is colocated with examples of behaviour violating that rule, and then generalises that colocation pattern to unseen rules.

Accordingly here is my first problem with the piece: this seems wrong. Conceptually: many rules just aren’t made to be broken. Many bits of training data don’t have this property of containing a nomic period followed by a lawless one. Empirically: if we should expect renegade personality LLMs to arise from this process, we should also expect a bunch of unlawful behaviour we don’t see.

As to the first: I grant (for now!) that in the training data niceness is sometimes followed with nastiness, where, let’s remind ourselves, that means the occurrence of tokens that we interpret as niceness is followed by the occurrence of tokens we interpret as nastiness. Even granting this, we have to admit there are many laws that aren’t like this. Batches of tokens that obey English phonology are not leavened with batches of tokens that disobey it. The laws of phonology are extremely seldom broken. Nice guy turns nasty happens, in life, movies, and training data: fluid, phonetically lawful sounds don’t just turn into sadsauuyywwwwshsdjssjjs in life, movies, or training data. Sometimes, yes, in linguistics classes and sci-fi shows, but extremely seldom. More or less the same applies to the rules of syntax: we seldom see strings of tokens disobeying, say, binding conditions. Nice guy turns nasty; seldom do we see, in a context where we’re talking about how every man loves himself, people trying to express that thought with ‘every man loves him’. The laws of binding aren’t made to be broken. If something like (1) is true, it is surely only true for some properties, possibly like niceness. But then we need an account of what those properties are like.

Empirically: really the same point. It’s a prediction of this theory that for a given series of tokens a with a law-governed continuation b and a lawless continuation c, we sometimes see a+c as well as a+b. But we don’t, for at least phonology and syntax and much more. We don’t see LLMs producing phonologically garbage strings, nor again do we often see it disobeying syntactic rules. This theory predicts we should see this, so this theory is wrong.

(2) When you spend many bits-of-optimisation locating a character, it only takes a few extra bits to specify their antipode

The thought is this: imagine we train our LLM to be, say, polite; we can represent this as a key-value dictionary with property names as keys and numbers between 0 and 1 as values. The author thinks that the real difficult work amounts to specifying the keys; once we’ve formed the property of politeness, changing its value is simply a question of changing ‘a single floating point, or maybe even a single bit!’

Underlying this, it seems, is a claim like the following: that the relation between a property and its opposite property is simple, such that a system able to instantiate one can instantiate the author at little cost. A system that can be friendly Bing has a very close variant — the same thing with a negation somewhere, to put it picturesquely! — as Sydney.

This also seems wrong to me. Distinguish between a property F, the property of being not-F, and the property of not being F. These are different. Take politeness. I can be polite; or I can be not-polite; or I can not be polite. In the later case, maybe I interact with no human beings ever; since interacting with humans is a condition for being polite, then I’m not polite. But I’m not not-polite: I’ve never sworn in the presence of a Queen, or wiped my nose on a tablecloth, or etc. Otherwise put, there’s a difference between not being polite and being rude.

If this seems unclear, just think about it. If you know how to be polite, then you do know how to be not polite (in my slightly technical sense): leave society. Do you know how to be rude? That seems much less clear: there are plausibly components of being rude that can’t be got simply by negating aspects of politeness. There is no feature of politeness, I reckon, that one can negate to get the behaviour of wiping one’s nose on the tablecloth. We need both do’s and don’ts; and that’s why we have do’s and don’ts and not just one list of do’s we negate to get the don’ts.

And so it goes for goodness. Again, just think about it: do you think that one can really learn how to be bad just by knowing how to be good? It seems to me, at the very least, that this is a massive claim in moral philosophy, one for St Augustine or Milton; at the very least, it needs argument. I can be not nice, again by abstaining from all activity. That doesn’t, under typical assumptions, make me not-nice: a Jain Robinson Crusoe maybe isn’t particularly nice, entering in no moral relations, but he’s not thereby evil. A lettuce isn’t nice, but isn’t evil. Characters and their negations may be similar; but characters and their antipodes are not. And it’s the antipodes we care about.

3. There’s a common trope in plots of protagonist vs antagonist.

As the author points out, this is somewhat contained in the first point: we see in some of the training data goodness and badness clash, so it’s reasonable we’d see it in our LLMs. My response here is that I simply deny the claim: in particular, I deny that among the training data the clash is prevalent enough to have an effect on the LLM. My claim here is that people massively, massively, massively overestimate the prevalence of this sort of storyline: of good guy turned bad, of evil hiding as good, or in another area, of a rogue AI charming and then overthrowing humanity. We think it’s prevalent because it’s salient, but if you collect up all the things we’ve written about goodness, or about AI, the striking narratives that would make this point are common. At the very least, anyone who makes a claim about LLM behaviour based on features of the training data need to give some argument for thinking the training data is actually as they say.

Near the start of the post, we read

if the chatbot responds rudely, then that permanently vanishes the polite luigi simulacrum from the superposition; but if the chatbot responds politely, then that doesn’t permanently vanish the rude waluigi simulacrum. Polite people are always polite; rude people are sometimes rude and sometimes polite.

This, it seems to me (where simulacrum=possible personality, and superposition=set of possible personalities), is the core intuition the author unfurls. If it’s right, the view defended here is maybe acceptable. But, so it seems to me, it’s just obviously wrong. Polite people aren’t always polite. They just aren’t!