Why the war? A quantitative answer

Matthew McKeever
8 min readJun 5, 2022

As the Russian invasion of Ukraine passes its hundredth day the question as to why it started in the first place hasn’t really been settled. Putin gave two potential and not particularly consistent reasons in his speeches in February: he wanted to ‘denazify’ eastern Ukraine, the official line oft-repeated in Russian media; he was concerned with NATO enlargement (a big theme of his 24th February speech in particular). Does one reason, with the benefit of some hindsight, seem more compelling now?

Maybe there’s no definitive answer: it’s probably some of both. Nevertheless I want to take a look to see if the data sets I’ve been looking at in the past few posts shed light on the question. Let me briefly recap what I did before getting on to the question of my title.

Last week I had a look at some data from the Telegram archives of Russia’s TASS and Ukraine’s Ukrinform feeds. I presented some suggestiveish data about the attention they give to various important topics, and wondered whether we can learn anything from it.

I noted in that post that there might be methodological reasons to be wary of that approach. Here I want to mention them and show how I tried to overcome them this weekend. I also want to say what I was looking for, my default hypothesis about how the two news agencies mentioned above would discuss the war, and how I’m tentatively calling that hypothesis wrong. In a word: I thought that given the war was a topic of shared concern for both agencies, there would be at least a correlation between TASS’s talking about, say, the US or Mariupol or Zelensky or Putin or Macron and how Ukrinform talked about it. That is, I thought there might be some equation: for many topics x, Ukrainian mention of x=Russian mentions of x * some factor. It seemed perfectly reasonable to me, for example, that if Macron gave a press conference, both sides would report it, perhaps proportionally to how they report foreign countries or how often they post in general, and that that would be discernible in a pattern in the data.

Then I thought there would be some topics for which there was no such equation, and I thought, having established a baseline correlation with the above x’s, one could use the topics for which there was no such correlation to squirrel out differences in perspective or propaganda, thus using this sort of quantitative approach to shed light on political rhetoric, a topic I work on in my day job.

The problem is no data I could find in a way that I could process bore out the above hypothesis. It’s very possible I didn’t look in the best places — I’m an amateur, doing this on weekends and nights, and the fancy statistical stuff is beyond me.

But disregarding that, at least tentatively, I’m tempted by the following new hypothesis: the two sides’ stories of the war are massively different, and the idea that there is some one topic — namely, the action on the ground and in the offices and barracks in Ukraine and in Moscow (or the Urals or wherever) — that both sides are concerned primarily with tracing is wrong. The data I’ll present this week supports that hypothesis.

Here’s what I did last time. Having preprocessed some the downloadable archives to make them searchable and visualizable, I took a bunch of pairs of synonymous Russian and Ukrainian words that I thought would be saliently correlated, as objects of interests to both sides, and looked to see if there was any interesting correlation. Maybe there was — judge for yourself here.

But there’s a problem with that. My intro statistics book (Thinking Clearly With Data — recommended!) talks about “selecting on the dependent variable”. That’s — roughly — when you attempt to discern relationships between phenomena by picking examples where you think the relationship likely to manifest. The example they use is trying to see whether there’s a relation between elite performance and spending 10,000 hours doing something by concentrating only on elite performers and not on others who have spend 10,000 hours or something (like most of us have for at least something) without thereby becoming elite. I selected on the dependent variable in focusing on salient word-pairs, plausibly.

Now, I had a reason for doing this. It’s a bit recondite and I don’t want to get into the details, but basically Ukrainian and Russians are very inflected languages, and a given word can occur in a bunch of forms. Moreover, the libraries I experimented with for overcoming that problem either gave so-so results or worked slowly or both. A consequence of that is you can’t say: computer, find every name, and count up its occurrences, because name is ambiguous: ‘Ukraine’ in Ukrainian occurs frequently as any of України, Україні, Україну, and that’s an easy case. The computer won’t know they are each variations of the same underling word, so will be classified and counted separately.

There’s another problem with concentrating on a number of limited and salient cases. A term such as Україна or Путин is going to turn up an awful lot. By concentrating on such terms, you’ll be looking exclusively essentially by design, at outliers. But, again like concentrating only on elites, concentrating on outliers might give dodgy results.

This week, I attempted to overcome these problems.

What I Did

Working with the same basic Telegram archives easily downloadable by anyone, I concentrated, as hopefully a representative sample, for the month of April (only one month because, as noted below, a substantial portion of this approach can’t be automated). I retrieved every capitalized word, since capitalization is a good indication of being a name. I then — and this is a highly unscientific move — I attempted to merge various forms of a given name by looping through the potential names, and when I found two words with the same initial four letters, assumed they were the same word, and merging them. That’ll have led to some mischaracterisations, but as far as I can tell they should be relatively few. If anyone wants to see it I’ll tidy the python script that does this but it’s mostly what you’d expect.

This was my attempt at getting round the problem that caused me to select on the dependent variable last week: I tried to get data for every name-like word rather than just a few of obvious interest. To get around the problem of concentrating solely on outliers, I then chucked out from the list those words or word-parts that happened to occur a bunch which, unsurprisingly, happened to be the sort of very salient words one would expect — Russia, Ukraine, Putin, Zelensky, etc, as well as words that happened to get mischaracterised as names (such as prepositions occurring capitalized at the start of a sentence).

(The whole list, which was incomplete, is: “Росс”, “Моск”,”РФ”,”Путин”, “Владимир”, “Це”, “Росі”, “Кие”, “Киї”, “Киє”, “Украи”, “Зелен”, “Про”, “Видео”, “В”,”З”, “За”,”У”, “На”, “Об”, “По”, “І”, “Президент”, Україну” “Україна”, “Укрінформ”, “ЗСУ”, “Укріформу”, “Киев”, “Минобороны”. If you look at the linked-to docs and can read the languages you’ll see there are other words that should have made the list that mean the end data set is itself rough-and-ready.)

Then, I chucked out all but the top ten entries. So at this point, we have a data set that’s like so: date: {word1: occurrences, word2: occurrences, word3: occurrences}. Date includes each date in April; among the words are any word that occurred more than five times, a number picked arbitrarily to hopefully weed out irrelevancies. And there’s a separate set for Ukraine and for Russia.

Results

Merely scanning the results certain things immediately jumped out. The overlap between what TASS talked about what Ukraine talked about is extremely little. You can confirm this by looking at the linked to (below) docs which is a dump of the two data sets.

But it’s not like there’s no rhyme or reason to what the agencies say. There are clear patterns in both the Ukrainian and the Russian data sets. In a word: a preponderance of the most talked about things from Ukrinform are places in Ukraine; by contrast, for Russia, places in Ukraine are talked about roughly as much as The West. By The West — the caps are meant to reflect the way for Russian media ‘West’ and cognates serves as a unified and important object of attention — I mean the USA, the UN, Europe, the European Union, and so on.

The easiest way to see this is to just look. Here is first the first page (of two; the second is similar) of the results from TASS, where the yellow colour indicates that the word stands for a place in Ukraine (or close — I count Belgorod, on the Russian side north of the Ukraine border, as close enough.) and the other sort of bluey one stands for a Western entity (grokking this, incidentally, it should immediately be apparent how ubiquitous the US (США) in Russian discussion):

Most occurring name-like words in TASS for April 2022, p1

And now let’s look at Ukrinform. where again I just include the first page, and the colour scheme is the same:

Most occurring name-like words, Ukrinform, April 2022, p1

Significance

I think this has meaning. Recall the two interpretations for why the war with which we began: either it’s about protecting the eastern Ukrainian breakaways, or it’s about NATO enlargement. I think the data presented here supports this latter view: in light of the large amount of space devoted to discussing places other than Ukraine, most notably the US and Europe and its political bodies, the news wire agency suggests that Russian attention is more devoted to the supposed threat Ukraine poses in terms of NATO enlargement as opposed to the liberation of the Ukrainian east. As for the Ukrainian data, I think there’s less to be gotten from it: their country is under attack and they’re documenting it.

Let me end with some shortcomings for future possible work. It would be good to move beyond the coloured words and do some counting — how much, in percentage terms, do Ukrainian place names make up the Ukrainian data, or does the USA make up the Russian data? Is there any cross-temporal patterns the current approach might have missed?

In addition, it would be good to consider the whole three months; it would be especially good to get a third party channel, say for France or Germany of the US, to see what their data looks like (the problem with this is i) western news agencies don’t use Telegram for the most part and ii) given the war is only one among many topics of interest for not-directly-involved countries, there would be a lot of irrelevant results (about primaries in the US or the recent presidential election in France, say). Together these facts mean you wouldn’t be comparing like with like.)

And it would also be good I haven’t messed up either my coding (data retrieval) or my coding (colouring in words), or rather that I haven’t messed them up catastrophically.

--

--