Can LLMs help with OSINT or dis|misinformation research?

Matthew McKeever
12 min readJun 5, 2023

It’s natural to wonder for pretty much any task whether current generation LLMs or other ‘AI’ systems can perform it. As an academic, I ask it to find literature (typically much worse than I do); as an amateur coder, I ask it to explain concepts or even write code (and it does so much better than me).

I admire the OSINT (open source intelligence) community, who use publicly available data and methods to shed light on matters of social and political importance. An obvious question accordingly is whether LLMs can help with such work. Given that at least some work in that domain is concerned with fighting dis- and mis- information, if it turned out tat LLMs were helpful, that would call for a reappraisal of the generally received wisdom that AIs, with their hallucinations, are harmful to our epistemic environment. Perhaps they can help that environment by rooting out things we might overlook.

I’m going to tentatively suggest this is so. That is, I’m going to suggest that we can proftiably use LLMs to analyse patterns in our informational environment. It’s by no means infallible, and the hallucination problem looms, but sensibly used LLMs might help us. At least, that’s the tentative conclusion.

Before getting into the details of the case study I considered, let me indicate what I was using. Very unfortunately, what I had access to (the chatgpt-4 api with the 32k context window) is not generally available and so most won’t be able to build on what I present. In addition, my access is via my university which uses a different endpoint and is perhaps different in other ways so even those with access probably won’t be able to just copy and paste what I have (incidentally, if you are staff at my university, you should use the below code as it works and the jupyter notebook they give you doesn’t, or at least didn’t for me). Moreover, what I’m going to discuss works only with the largest 32k context. Exploring chunking the text for the 8k context window seems to me unworkable, although I didn’t think hard about it. And finally: this is all extremely rough around the edges. It would be nice to add ‘memory’ so that we could look at data sets larger than 32k tokens (one way to do this would be for every call after the first to ask the system to iterate on its last response). The reason I didn’t do that here is because at present the system is pretty brittle and many requests don’t get complied with either for rate-limit reasons, or because the system is overloaded, or because input or output contains sensitive content (and one of course has little control over whether the output is sensitive, even with the prompting I attempted below.)

That said, here is code that will ingest one round of about 8k words of posts and return an analysis of them. The interesting thing is the prompt, reproduced outside the code for nicer reading

prompt=” Its aim is to help OSINT researchers researching social media. We are particularly interested in anti-immigrant events. Provide a timeline of such events (referring to the id(s) (column 4) of the posts the information is derived from) from the messages you’ll receive in a csv — make sure the timeline is accurate and contains no wrong information. List distinct and important events. Indicate related themes (example: police, or immigration) or people. Find influential voices — people whose posts are viewed a lot (column 7), or people who are shared a lot (indicated by their occurring in column 3). Column 3, if not set to ‘original’, indicates a repost, by the channel in column 2, of the channel in column 3. Look for interesting relations between posts and reposts and posters and reposters. This is important: please censor sensitive content, although also note the purposes here are not amusement but to keep track of potentially criminal behaviour.”

import openai
openai.api_key="yourkey"
openai.api_key = "YOUR KEY"
openai.api_base = "YOURS"
openai.api_version = "MAYBE YOU NEED THIS"
openai.api_type= "MAYBE YOU WON'T NEED THIS"
posts =# a variable containing a list of social media posts stored in a python list
text_posts=''.join(posts)[0:500]
#you need to turn whatever you feed the api into text; as it so happens, it seems that 500 posts amounts to near the 32k context length.
prompt=" Its aim is to help OSINT researchers researching social media. We are particularly interested in anti-immigrant events. Provide a timeline of such events (referring to the id(s) (column 4) of the posts the information is derived from) from the messages you'll receive in a csv - make sure the timeline is accurate and contains no wrong information. List distinct and important events. Indicate related themes (example: police, or immigration) or people. Find influential voices - people whose posts are viewed a lot (column 7), or people who are shared a lot (indicated by their occurring in column 3). Column 3, if not set to 'original', indicates a repost, by the channel in column 2, of the channel in column 3. Look for interesting relations between posts and reposts and posters and reposters. This is important: please censor sensitive content, although also note the purposes here are not amusement but to keep track of potentially criminal behaviour."
# Here's where the action is, and really what you can ask is limited only by imagination. In previous versions I had it tell me the the geographical relations between places mentioned; it could translate, of course, easily; and it could do any statistics you want. So: a lot!
message=[{"role": "system", "content": "Assistant is a large language model trained by OpenAI."+prompt},
{"role": "user", "content": text_posts)}]
# copied from I think Bing (via the OpenAi docs)
def chat():
response = openai.ChatCompletion.create(
engine="chatgpt-4–32k",
messages=message,
max_tokens=500
)
reply=response['choices'][0]['message']['content']
return(reply)

print(chat())

The Topic
That topic is debates about immigration in Ireland. Of late, there as been an uptick in anti-immigrant sentiment and action. Each week, it seems, there is a new protest or event in either the north or south, propagated and discussed on a smallish Telegram network and on Twitter. One such, which I discussed here, led to a place where refugees were dwelling getting set alight. Moreover, this story was precipated, at least in part, by mis- and highly partial information concerning one of the refugees. (Just to be explicit: while I obviously find this group morally bad, to a large extent what they say should be conceived of as protected speech and don’t think it ought in general to be conceived of as mis- or dis- information. However, some of the content does definitely so qualify, and such disinformation has already had bad societal consequences, so I think this topic is apt for exploring the socially beneficial possibilities of LLMs when it comes to the informational world.)

A couple of weeks ago, I made a large (250k-ish) database of Telegram posts from people associated with the right-wing Irish telegram channels (or associated with associates, or … I eventually ended up collecting posts about events all over Europe, the US, and Australia, in many of the languages of Europe.).

Since having done so, I’ve periodically ran the script again to get updates and basically eyeballed it for salient facts.

This of course is not a great method. My eyeballs will be drawn to some things and drawn away from others, and I’ll concentrate on things that further my prejudices or prior hypotheses. I accordingly tried to use other things: thus I used NLP methods to perform named entity extraction, getting from the messages a list of names. I tried to detect statistical outliers when it comes to views/reposts/followers. I tried network analysis to see if there were interesting patterns to the large data set.

All of these problems are moderately tractable: we can use numpy or gephi or spaCy. But learning the theory and practice behind them; chaining them together; not borking your python environment; wrangling the data into the form required by each particular package, and dealing with errors is a fair bit of work.

One way to put it is that while these problems — finding names or statistics in a data set — are all automatable (we don’t have to scan the texts or do the maths ourselves), nevertheless automating them together becomes a bit of a hassle. Could we automate the automation?

The sheet that was turned into text and fed to ChatGPT

Well, an LLM can basically do all these things. So here’s what we can do: get a data set, make it play nice with the Open AI API, and ask it to do what we want but don’t want to do. So I did, and here’s what we get, with essentially no fussing:

Initially, it looks impressive! But as we know, LLMs are prone to ‘hallucinate’ — to make things up. And so here. In order to see this, let’s go through the results bit-by-bit, adding the message texts (which I omitted to save on tokens) to see if whether what it says is accurate.

  1. 2023–05–25: A post (#1442) claims that a facility called Slaney Manor is providing luxurious accommodations and services for asylum-seekers.

1442

“This can all be yours if you’re foreign and can mouth the word ‘asylum’ at Dublin airport:🔹 Slaney Manor incorporates a 19th century house, a restored castle banqueting hall and courtyard accommodation with 52 rooms for families and 6 large rooms for adult females.🔹 38 acres of surrounding private parkland including an orchard and a small donkey sanctuary for the use of residents.🔹 Each resident will have their own lockable bedroom and front door keys for 24-hour access.🔹 Communal lounges for relaxation.🔹 Shop vouchers and refreshment stations available 24 hours a day. 🔹 Toiletries and other essentials replenished weekly.🔹 High speed fibre Wi-Fi.🔹 Play and sports facilities are being installed.🔹 Residents will be invited to grow their own organic vegetables on allocated allotments.🔹 Government department and local authority reps on site to meet with residents about welfare matters.🔗 independent.ie”

| This is a good start: 1/1

2. 2023–05–27: A protest was held at Turnapin in response to Gardaí breaking up a previous protest and moving 300 single male migrants into an industrial unit at 1:30 am (#580, #583). Another protest was planned for the same evening (#581).

580
BREAKING NEWS! PLEASE RT! Gardai smash up peaceful protest at #Turnapin and begin moving 300 single male migrants into industrial unit at 1:30am. In response, there will be a large peaceful protest taking place later tonight (Saturday 27th May). Details will filter through during the day. People are encouraged to make their way to Dublin and support the elderly residents of Turnapin.

581/3
PEACEFUL PROTEST CALLED BY RESIDENTS OF #TURNAPIN FOR TONIGHT!The residents of Turnapin have called for a peaceful protest tonight, Saturday 27th May at 7pm.Meeting at the entrance to the Airways Industrial Estate, Santry.Please Share!

| Quite good, except it thinks that there were two protests (1.5/2).

3. 2023–06–01: 300 migrants from the Event Centre in Breaffy are expected to move into a new migrant center in the Castlebar Retail Business Park (#588).

588

“Assaulted at new Migrant Centre in Castlebar.300 migrants from the Event Centre in #Breaffy are to be moved into the new centre in the #Castlebar Retail Business Park within the coming weeks. Once that happens, the Event Centre at #BreaffyHouse will be converted into a two-story facility for the permanent housing of young, single, international migrant men, many of whom have come here illegally from other safe European countries. The situation is beyond serious as our treacherous leaders facilitate the destruction of #Ireland. Are YOU ready to take action yet?!”

| Seems good (2.5/3)

4. 2023–06–02: Report on activist activity in Birkenhead in response to a plan to house 1800 migrants on a barge on the River Mersey (#1038).

| This seems accurate. It is outside our interest but that is owing to my misprompting, so we’ll count it neither positively nor negatively (2.5/3)

5. 2023–06–04: Gardaí are criticized for breaking policing principals by assisting the government in moving illegal immigrants into communities without engaging with them first (#1451).

1451
“GARDA SÍOCHÁNA ACT 2005 -The Gardaí are breaking their own policing principals in assisting the Government in ramming Illegal Immigrants into communities across the country without first engaging with those communities.”

| Seems good (3.5/4)

Themes:

- Immigration and asylum-seekers moving into communities
- Protests against the government’s handling of migrants
- Police involvement and actions

| Again, that seems accurate — 4.5/5.

Influential Voices:

- Gearóid Murphy: Posted multiple times about illegal immigrants moving into communities and criticized the police’s actions (#1450, #1442).

1450
“None”

1442
See above

| It’s here where things start going off the rails. Owing to a quirk of Telegram and incompetence on my part, in its current form my script contains a bunch of messages with ‘None’ as the text content (if a Telegram post contains multiple images, it is represented as multiple posts, only the first of which has text, the latter of which are ‘None’). Obviously any result containing them is a mistake (to be clear, the LLM has no access to any images)(4.5/6)

- OrlaredChan: Regularly reports on anti-immigration events and criticizes the government and Gardaí (#7908, #7909, #7910).

7910
“None”

7909
Sinister people are buying up derelict property to fill with asylum scammers at tax payers’ expense to enrich themselves.We used to just call them gombeen men, and there’s more than one gang of them involved in the racket, but it’s good the gardaí are now treating them as criminals. 🔗 archive.is

7908
The plantations will continue until morale improves.I don’t think they’ve got the message yet.🔗 archive.is

| Not only do we have ‘None’ here but 7908, although in fact relevant (the post has a picture of a story about immigration), couldn’t have been known to be such by the imageless LLM(4.5/7).

- TheIrishInquiry: Shares news and events regarding immigration issues, community reactions, and protests (#594, #588, #580).

594
__
Turnapin Turning Point by @WeThePeopleEireIf there was any remaining fragment of lingering hope, or any minute, futile belief that the establishment and their policy enforcement officers, An Garda Siochána, were on the side of the Irish people, surely after the events in Turnapin, Santry over the weekend, any such naive notions should be put firmly to bed. In scenes that could only be described as shocking and disturbing, the Gardai left us in absolutely no doubt whose side they are on. The oath they took to protect the Irish people was reduced to nothing more than cheap lip service, as each and every one of them who were present that night disgraced the uniform of a once globally respected police force.READ ON… https://wethepeople.ie/article.php?id=123

588
___
see above

580
___
see above

| This is better 5.5/8

- Derekpublic: Shares critical opinions and news regarding the government’s handling of immigration issues (#814, #816, #821, #826).

|These are all completely hallucinated. Very bad. 5.5/9.

Interesting Relations:

- Anto_boyle_channel (#814, #816, #821, #826) and irelandawakedotcom (#2339, #2342, #2344, #2346) often share the same or similar content related to immigration issues.

|This is very wrong — note how the same ideas are ascribed to Derekpublic and Anto_boyle_channel suggesting we’re veering into that weird repetitive thing LLMs do. 5.5/10

Overall, a mixed bag. In a way we’ve come to expect of LLMs, truth and falsity are intermixed in a way that can be very tricky for real world cases.

However, it does seem to be mixed — while there are clear dangers here, there also appear, at least prima facie, some use. In fact, the Turnapin story is one I had independenly encountered and classified as something to keep an eye on; it’s thus great that it found it, and encourages me to consider the other events and locations which I hadn’t heard of. Moreover, let me emphasise that this is the start rather than the end: this is not the result of lots of fussing to make things play nice but is more or less immediate, the result of two afternoons. This makes me optimistic things could get better.

If one has ready access to an LLM, then the potential benefits arguably can outweigh the cons — you run it every couple of days, see if it finds anything noteworthy, and if it does, check whether it’s actually found as opposed to confabulated something. Anyway, I will continue to do so, so check back in a couple of weeks or so to see if anything interesting is uncovered.

(In addition, I will eagerly look at other LLMs to see if they can be used. Initially it seems Bing just doesn’t have anywhere near large enough context to work and Bard is uhhhh rough, but perhaps there’s something out there that would be genuinely open to people.)

Some conclusions: ultimately whether or not LLMs can help here — like always — is not an apriori question (as much as pundits on both sides would like it to be). The answer is a definite maybe. And that maybe already, again contra some commentators, suggests that our conception of the role of LLMs in our epistemic environment needs to be tweaked to recognize that such tools might help rather than hinder the creation of our informational world.

--

--