An OS tool for storing, searching, and displaying news wires

Matthew McKeever
10 min readMay 16, 2022

(Update, a few days later: don’t use this. I claim three benefits for it but i) the first is moot as Telegram — unbeknownst to me at the time of writing! — lets you export in a computer-readable format ii) the lemmatizer, as I note in the text, is imperfect especially concerning the most important items for this project, so not needed. Together iii) that means mostly the python part is pointless. The html part remains useful, but a better workflow is: user sends json data to web app==> the graphing library runs in the user’s browser. This is a no-code solution that is acceptable — I think — from a copyright pov, so wins overall. So I’ll do it now and upload under a new — and more riparianly appropriate — name.)

My aim here is to introduce a couple of tools for monitoring media channels. I do so inspired by two pretty different sorts of role model. On the one hand, and as with many of my geriatric millennial generation, I’m inspired by Noam Chomsky and his often repeated thought that one can learn about propaganda simply by reading the means by which it is propagated, such as newspapers and government reports. The second is the OSINT movement and groups such as Bellingcat that use tools like Google Maps to geolocate objects, or the darkweb to buy flight manifests, enabling stories about people and places to be tested, with sometimes spectacular results (for which see e.g. the recent documentary Navalny, currently on BBC iPlayer). I admire both a lot, and want to try and emulate them.

In more detail, what I present here is a way of storing, comparing, and nicely presenting the output of the state news agencies of both Russia and Ukraine. It should be said the case study I’ll be considering is pretty niche: you need to be able to read both Russian and Ukrainian to benefit. However, one of the main components (the nicely presenting one) requires only one language, and — and more importantly — there’s nothing to prevent this framework from working for other languages (though i) I haven’t at all tried and the ii) use of Telegram (see below) as a media channel is somewhat limited outside Russia and nearby places).

While discussing caveats: it’s in progress, buggy as hell, and won’t fail informatively. I would very much like it if I could post a version (including parsed data; see below) so no-code people could use it, but I assume copyright law prevents me from hosting others’ (the agencies’) data, even in the RSS-sizes chunks that presumably count as fair use. If anyone can correct me on this, please do! But absent that, you’ll need to be able to deal with setting up a python environment and making a couple (tiny but non-optional) changes to the code in places. So it requires unfortunately a bit of computer literacy. I’m very happy to give time to good causes that want to use this but maybe don’t want to deal with my lousy set-up.

I’ll treat the three components separately and in turn.

Storing (comparing, nicely presenting)

As to storing: the first thing we want is a nice way of capturing historical stories easily. It’s difficult — in my experience — to scrape the web to do so. It’s easy, however, to export the data from messaging service Telegram. As it so happens, both the agencies (TASS for Russia and Ukrinform for Ukraine) I’ll be using have Telegraph accounts.

It’s trivial to download history from Telegram: open the app on your desktop, go to channel>export history, select the dates (uncheck everything, at least for our purposes — we only want text), then wait. If you’re doing it for the first time, you’ll have to wait 24 hours and sign in across devices.

While easy to download the last n months’ stories from the news agency (or paper, or politician, etc.) on Telegram getting the data in a workable format takes a bit of work.

There are a couple of reasons for this. One basic one is that what telegram gives you is a collection of html files in a zip as opposed to a more pliable format like a spreadsheet, or something a programming language can understand like a JSON. The first, relatively trivial, thing that Volga does — which is what I call my tools, both for the big river and because it kind of sounds like a word in a title of a work by Dante about language — is generate, from the html files, a python dict/a Javascript Object, where the keys are the date/time stamps the stories were posted and the values the stories themselves. We can then computationally deal with the data more readily.

(Update: I embarrassingly misnamed it. I was looking for the river running through both Russia and Ukraine, so should have called it Dnieper, though that lacks the Dante pun. Maybe I should just call it quits and call it ‘Kenneth’.)

The less basic reason why it’s not easy to get readily search history is more involved. Russian and Ukrainian are (without getting too much into the technical linguistic speak, where I’m on shaky ground) languages in which the same word considered as a meaningful item can appear in very many different forms.

We can see this with an English example. Imagine you were super interested in stories about the singing activities of a given people. You wanted to know all about it. Then you’d be interested in stories with these headlines:

  • A philosopher sang ‘Let It Be’ in Trafalgar Square last night
  • Leading entertainers will sing ‘Let in be’ tomorrow night
  • Singing drunkenly, a passenger caused a delay at rush hour
  • The queen has sung the anthem

Here’s the problem: what do you search for to capture each of the deadlines? It can’t be the word ‘sang’, ‘sing’, ‘singing’, or ‘sung’. In each case, that’ll only get one of the results. Maybe you could use disjunctions: sang or sing or singing or sung. But that’s a bit of a pain.

The problem is much more drastic for languages like Russian and Ukrainian. A given Russian or Ukrainian noun (verbs are a bit easier, ish, kind of) can have roughly 15 forms. Unless you want to disjoin 15 words every time you search a term, we need something better.

Luckily there is something better, and this is the second thing Volga does. We’ll use what’s called a lemmatizer to generate basic forms of a word. The basic form might be — exclude the slight jargon — the first person singular indicative, or the infinitive (for verbs) or the nominative masculine singular (for nouns). The lemmatizer I use (this one) is quick and imperfect, as we’ll see below. As a matter of fact, in order to get around the lemmatization problem, especially with names, you’ll have to resort to using stem forms (such as Мариупол and Маріупол, for Russian and Ukrainian respectively). The lemmatizer should work reasonably for most non-name items, and stems are easy to guess mostly for names, so I don’t think this is a fatal problem, but it’s definitely a bummer.

The function to turn htmls into a computer-readable format is batchwrite() in volga.py. It takes an array of html files, that you want to read, and an array of files you’ll write the data to. You have to specify these manually. Take a look below at a test file, included in the repo as ‘test.py’

import volga#this script reads and lemmatizes one month's worth of telegram messages.
#it does that in #part1. You specify the input files as the files array
#you also indicate some files to write intermediate outputs to.
#strictly speaking, this part of the process could be avoided and will be changed later.
#you also indicate the language you're working on, which will be Russian or Ukraine
# in theory, though, this should work for *any* languages the python library simplemma
#works for, which is quite a lot.
#presupposing you've already read and lemmatized one other month's worth, we move on to part2
# here you compare an array of words from one language with those of another, hopefully
# to discern patterns in how the differing news wires talk. I haven't done much experimentation
#with this yet--but I hope it will be useful.
#the included index.html gives you access to the second functionality of volga
#part1
files=["messages.html", "messages2.html","messages3.html","messages4.html","messages5.html","messages6.html","messages7.html","messages8.html", "messages9.html"]
files2=["m.txt", "m2.txt","m3.txt","m4.txt","m5.txt","m6.txt","m7.txt","m8.txt", "m9.txt"]
volga.lemmatize(volga.batchwrite(files, files2, "513_raw.txt"), "513_lemmatized.txt", 'ru')#part 2
volga.places(["Мариуполь"],["Маріуполь"],"513_lemmatized.txt", "513_lemmatized_ua.txt")

The thing to note is that, once you’ve got your archive, you’ll need to list them in the files array. The above presupposes you’ve got test.py in the same directory; otherwise, the path should go in the names in the files array. You also need to make another array called files2 which stores output files. (You won’t actually use them so I should probably delete this part, but if I do everything I should do I’ll never post this.

You don’t necessarily need to use batchwrite(). If you look at the test.py, you’ll see that in calling the lemmatizer (volga.lemmatize()), we feed it what batchwrite() returns, which is a dict of all the stories. The lemmatizer, well, lemmatizes, and takes three argument: a dict of unlemmatized stories, an output for the lemmatized dict, and a flag for the language that you set manually, either as ‘ru’ or ‘uk’).

(Storing,) Comparing(, Nicely Presenting)

The second main thing that one can do is compare occurrences of words in different data set. You might be curious, for example, whether TASS talks about, say, Mariupol, more than Ukrinform does. You can easily check this by getting the Russian and Ukrainian words, and the lemmatized texts (the ones lemmatize() writes) for each language, and calling places() as again you see in the example. places() takes respectively an array of Russian words, an array of Ukrainian words, and the two lemmatized files of the Russian and Ukrainian news, which will have been created by two uses of lemmatize(). (It works best with just word vs word comparison — too much and the graph gets hard to read.)

Let’s try it, then, as per the above, with the Russian and Ukrainian words for Mariupol, and the data set, in each case, March 2022:

Blue is Russia, dotted red Ukrainian

Note that this seems super weird! Russian media really didn’t mention Mariupol after the 25th? That seems wrong! And is: it’s the problem of the parser mentioned above, namely that it doesn’t recognize declined forms of Mariupol as the same language.

So, we need to cheat a bit. The way things are set up, we can search for substrings, so if we search for the stem of the word (Мариупол, Маріупол) we get:

That looks more right. Let me take the opportunity to note that the vertical axis simply counts occurrences of a word. This means we’re comparing oranges with apples, as TASS publishes roughly twice as much as Ukrinform. But that’s both fixable and anyway I don’t think a massive problem, provided you’re aware of it.

Lemmatizing issues notwithstanding, once one has recourse to such tricks as searching stems, this is pretty useful, and for a range of terms (Putin, Donbas(s), Zelenskyy, etc.) should yield helpful comparisons. I say ‘should’ because I haven’t really explored much so far.

(Storing, Comparing,) Nicely Presenting

It’s perhaps in this third bit that we get the most advantage. In a separate html file called volga.html, you can make interactive graphs of single word/stem occurrences. What that will do is give you a graph with date along the bottom and number of occurrence of the word on that date on the vertical. Then — and this is the useful bit — by clicking the graph, you can bring up all the stories (in readable, not lemmatized form) containing the stem from that day. What this lets you do is read in context the narrative a given media agency was presenting about the topic on the day in question. For example:

On the left, you have a graph recording the frequency of the keyword, here Mariupol (in Russian). On the right, having clicked a point in the graph, you get a list of all the stories (in Russian), in readable, not lemmatized, format, that contain the term.

That’s useful! You can, with a click, jump back in time to see what the Russian state news was saying about a topic. And you can compare that with what Ukraine was saying. It’s noteworthy, for example, that the Ukrainian wire seems more active, despite its smaller volume:

The graph for Mariupol in Ukrainian, with articles in Ukrainian

Chomsky says we can learn about propaganda simply by reading it. As someone interested in political rhetoric both as an academic and a human, I hope that this tool will help people ‘read’ the vast quantities of media we’re flooded with every day, and hope it will help visualize things we might otherwise miss.

The github repo is here: https://github.com/mittmattmutt/volga. The md file contains some important info you need to get it to work. There’s 100% chance there’s problems — I hope they don’t make it not work. If so please let me know!

--

--