Written by Dave Aitel | Acknowledgment: William Grayson Hilliard - AppGate Data Engineering on August 08, 2018
Data Carving the Internet Research Agency Tweets
Recently FiveThirtyEight released on GitHub a set of CSV files which contained the downloaded tweets of all of the now suspended accounts of the Russian Internet Research Agency Information Operations team.
This data was gathered and analyzed originally by two Clemson researchers, Darren L. Linvill and Patrick L. Warren. Their methodology can be summarized as gathering the Tweet data from Salesforce's Social Studio app (which has access to the Twitter firehose data), and then going through a process of manually labeling the roughly 3800 accounts into various buckets, and then checking with a friend to see if those buckets made sense. They also filtered out data that was from accounts that did not seem to be connected to the Internet Research Agency (IREA).
They also unfortunately removed accounts which did not tweet predominantly in English.
This left ~1.8M tweets from 1311 accounts. They then loaded this data intoto do some basic statistical analysis. This data set spans from 2012 to 2018, with the vast majority from 2015 to 2017.
More information is available about their work here:
Their stated goal with this research was to answer the following questions.
- Can IREA Twitter handles employed between June 19, 2015 and December 31, 2017 be categorized by their content into discrete types, and if so, what characterizes those types?
- If so, are differing types of Twitter handles employed by the IREA in ways that are different from one another?
Issues with their Methodology
With this many tweets it is nearly impossible to manually go through and and perform filtering. The bucketing system they used for manually labeling is highly subjective, to say the least. Likewise the tools used (STATA especially) are not tuned for this sort of natural language data, and ignore some of the more interesting facets of it.
In particular, the authors of the original paper by necessity are mapping their own preconditioned understanding of the US political dimensions to the Russian IO effort. While these may be valid (in that they are essentially "Right" and "Left"), a better analysis needs to come from more natural groupings of the data, ideally without subjective taint. Otherwise we are perilously close to auto-ethnography.
There's no easy way to get “clean” data from a large data set like this. Removing tweets can tune the data in unexpected ways, and some of the more interesting questions can only be answered from looking at tweets that are targeted TO the IREA trolls, since those indicate either uncaptured trolls or members of the public who have started to join their community. An analyst is constantly wondering if what they are seeing is a “trend” or if it is a natual result of how the data was shaped and filtered before it reached the visualization and analysis phase.
A major confounding factor with this analysis is that we don’t know how the original handles were discovered. The data in the set may lean strongly towards pushing a certain political agenda, or have a set of hashtags or themes, but that may also be how it was found by Twitter engineers in the first place! It’s possible there is an equally large set of tweets that are Pro-Hillary that simply did not end up tagged as part of the Russian IREA IO effort and hence are not in this data set.
Other Recent Work
The Duo Security R&D team put together a paper on using statistical classifiers to automatically find botnets:
They used mostly statistical features to identify bot-related accounts, including a lack of bots “sleeping” and the connections between bots and other accounts. They did this on 88 Million account profiles, gathering over 570M tweets from 3.1M accounts (using the Twitter API, which offers a richer set of data).
To put it into context, the Clemson report looked at the content to classify a small number of accounts, whereas the Duo Team did large scale processing of network structure and statistical classification of account metadata. The larger unselected Duo dataset better handles bias, although statically classification can generate unsatisfying fuzzy results.
This paper (and Immunity’sin the area, which automated the twitter-handle-finding problem Duo had by using a modified PageRank-style approach) does a bit of both, using the Clemson data set, but looking at structure and content with machine learning and other automated statistical techniques as made possible with already-in-place Brainspace algorithms.
Some other interesting work includes:
- (machine learning using QUID)
- (nice relationship mapping)
- (nice search interface)
- (interesting visualization)
Brainspace Data Import
While this data was ostensibly gathered via the Twitter API, it did not have many of the fields that the standard API offers (such as their parsed hashtag and mentions list), which potentially were stripped off by the Salesforce database. It did contain the subjective analysis the original researchers did to label each account (aka, "Right" or "Left" and the subgroups "RightTroll", etc.)
Initially, the Unicode in the provided data appeared to be multi-encoded (this could be an artifact of our import process?) . We used a specialized Python script to reformat the data into a Brainspace DAT and then also parsed the text to re-generate a list of mentions and hashtags which we could use directly as search terms and to provide for a communications destination, as Brainspace is best used when you have a From and a To.
We also manually regenerated fake IDs for each tweet since the originals were stripped and Brainspace needs an ID value per data line. Likewise, we conjoined all the provided CSV files into one large DAT file for import. Any mentions were used as the "To" field in Brainspace, and we enabled entity extraction for easier searching on various defined subjects (such as Phone numbers). This slowed the import but came in highly useful during analysis later. Figuring out what metadata to use as the “To” and “CC” fields for a Brainspace import is often quite important when doing this kind of work as it will define network structure for the visualizations.
Our import process resulted in a Brainspace repository of 2.9M documents with 53k unable to be parsed and roughly 2.3M "original" documents.
Initial Findings - Replicating the Clemson Report
As a first step, we wanted to make sure we could replicate their data and analysis, which we can do using the Brainspace advanced search based on this little quiz in their twitter feed:
Brainspace gets 473 original documents mentioning ("To") @jaketapper.
We get roughly twice as many for @seanhannity, as the tweet would suggest.
Sean Hannity traffic is driven heavily by two large information pushers (Covfefenationus and Ameliebaldwin). This is characteristic of troll networks, which often divide work into generators and amplifiers. It’s difficult to see with this particular data set, since every account is “bot connected” but there are often these starburst patterns in Brainspace connectivity graphs when looking at bot traffic on a large scale and they stand out quite clearly.
One thing that immediately sticks out in this sub-dataset (which we made a Brainspace “Focus”) is the hostility towards Keurig the Coffee company (seen below as the second most used hashtag in the set after #maga). It turns out there's an article on thisrelating to an incident where Sean Hannity supported a controversial candidate and Keurig pulled their ad campaign on his show.
This kind of data carving’s goal is to find unknown unknowns. It’s an iterative exploratory process in which you pose questions to the dataset, and conduct searches to help answer those questions if possible. This process is illustrated by many of the screenshots below, which unfortunately you have to zoom in on individually to truly follow, as many of the connectivity graph fonts are small and the clusters are of tiny dots on a dark background. However, it’s worth doing to follow the process.
Attempts to Obtain Direct Physical Action
There are several types of easily identifiable attempts to coopt US persons to take action within this data set.
The first one is quickly found by using the Entity Search and filtering on 202 numbers to get phone campaigns for DC.
This gets a number of tweets urging people to call their congressmen in favor of various subjects. This is an interesting way to directly move US policy based on IO, assuming they can get US people to call in (which is impossible to verify).
The second attempt is a fomented rally for Tamir Rice. The IREA set up a Facebook event and invited people and spread the word, since they are active on both social networks (and almost certainly have a number of other front organizations doing propaganda publishing – see our previous work on the subject). A sample tweet is below.
Facebook has taken this flyer down so it’s impossible to determine if it happened or how large it was.
More on this attempt is here:
Another, more disturbing IO campaign is the effort to get Americans to enlist in a “Patriot Army”.
This website seems to still be active, and the campaign (which started with user “heyits_toby”) appears to now have its. You can read the many essays on their web page, written in the casual style of traditional Russian propaganda sites highlighted in our previous presentation. They even have uniforms to buy (they suggest camouflage).
These are innovative and interesting methodologies to turn IO into direct physical action and are worth looking into further to determine how effective they were.
Another example of this kind of site found via link analysis of this Tweet set is:
It’s also amazing how few times many terms you would expect to be in this large of a set of Tweets appears. “Harold Martin”, “Kaspersky”, “Shadow Brokers”, ”Gay”, etc. These terms are much rarer than you would think they should be based on any normal model of conversations. Normally if you get any set of three million posts from random people interested in politics at any level, more than 265 (which is how many are in this set) will mention Harry Reid. More work could be done to find topics these tweets avoid.
One issue with this dataset is that it defines accounts by their current name, and not by their ID. Account names are changeable over time and it’s possible that accounts now named things like “BlackLivesMatter” were originally “FREEMUSIC123”. Tracking account name changes over time would be a useful thing to do, but difficult unless you are Twitter itself.
The below image is a high level view of every tweet sent in 2017 in this data set. It has an strange shape if you’re used to looking at these sorts of things.
On the right there is an entire cluster of highly connected accounts which focus primarily on promoting Soundcloud.
Interestingly the top cluster is mostly Russian (Ukrainian?) text, the bottom is Turkish and other languages, and the left cluster (mostly right-wing US tweets) and then right cluster (mostly commercial). (This is just 2017 to cut down the dataset a bit). Letting the NLP and communications tracking of Brainspace do the grouping avoids any pre-conditions the analyst has about different potential classifications. That said, the original paper largely classified accounts purely by content whereas in our efforts we tend to use both content and communications paths (as naturally demonstrated by the colors in the below graphic).
Looking at accounts marked as Left produces mostly accounts doing Soundcloud or other music promotion. This is useful in the sense that while we agree with the authors that there was not as much focus on promoting the Democratic party (Bernie in particular) it’s worth noting that the accounts which did so were largely recycled from other commercial efforts of the team.
Looking just at the soundcloud keyword in 2017 demonstrates a drop off in this kind of activity from these accounts, which could indicate a lower priority on that contract after a certain time period.
Searching on all the hashtags for those accounts could tell us if any of them were re-used for political activity.
Just top senders about Soundcloud:
Yes, looks like they re-used accounts for both Soundcloud spam (aka, boost your hits on Soundcloud for $$$) and for political activity:
You can see this duality visually by making a focus on one of the users and seeing the combination of soundcloud (top) and political activity (bottom):
We can find out who purchased their promotion services by the characteristic starburst shape.
Other Statistical Analysis
We can replicate the Clemson paper’s statistical results by looking at October 2016. Interestingly, #foke is the top phrase as highlighted by the following screenshot. One major problem with almost all papers in this area is that they tend to take a graph like the one below, but then attach an external event to the graph, and try to assume the spike of activity and the external event are correlated in some way without doing any semantic analysis of the content! This is bizarre because for all we now a spike in data means someone was utilizing the botnet to sell a particularly excellent track from a Russian R&B artist that just happened to drop a few days before the DNC emails.
Taking a quick five second look at October 2016’s most used keywords can provide some additional verification:
As classified automatically with Brainspace there are four main groups in October 2016 (we have both IN and OUT selected here, which is not usual for us, since “IN” tends to be targets of the botnets, aka political figures, news sites, or other notables the IREA wants to influence).
What I don’t see is a natural grouping into five different types based on their communication network (as opposed to some semantic grouping of their Tweet content):
This is the shape of the data the day of the Wikileaks dump on 10/6/16. Often you see a double-star in this kind of data, where one person is controlling two accounts, which Brainspace identifies by having similar color (content) and connectivity and grouping them close together automatically.
The Mueller search is quite fun:
And the way this was done is fascinating with two clusters connected by only a few nodes.
Unfortunately, a lot of these twitter accounts will specialize in posting images which need to be loaded, compared, and OCRed to do true analysis on them. Future effort could be done in this area with a bit of usage of the Google APIs for such things. Likewise, this dataset stripped off the URL resolver information so we need to manually go to Twitter and annotate the dataset with all the original URLs – many of them are just RT.com but it would be interesting to get detailed statistics on this.
There’s surprisingly little in this dataset on Syria, a common Russian topic with other bot networks. They’ve named the accounts that push this topic appropriately as well. The same is true for Crimea, which leads us to believe we are examining a bit of selection bias with the original account set.
There have been two major attempts to push campaigns around the NSA:
There are a lot of different languages according to Brainspace’s automatic classification. Analyzing the activity in each of these languages can be done with the aid of automatic Google-translate when importing, if necessary, but it’s best to have a linguist on staff, which we have only for several of these languages. That said, this is a fruitful and interesting effort, although it requires slightly more time.
In Dutch there is a lot of effort to push the #islamkills hashtag, especially after certain events happen in the real world to help amplify this approach.
Japanese language time histogram is quite interesting:
This is August:
Hmm. These look like client-side exploit attempts:
Following the link:
Hmm. Most of these accounts look like other Russian bots. Some sort of C2 perhaps?
Assange is both “rare” in this dataset and interestingly clustered (here we only selected “IN” to find targets of influence as opposed to senders, and Jeff Sessions and Donald Trump/POTUS are the main efforts ):
Using the CMML to Find Misclassified Accounts
You can find accounts which are marked ”left” by the Clemson team and in fact tweeted things on both ends of the political spectrum quite quickly and easily with Brainspace.
Training theon known accounts with a few thousand messages of PRO and ANTI Trump messages takes a few minutes (you select clusters of documents in the concept wheel and tag them all at once).
Here we find all the people who are PROTRUMP (from the training with CMML) but are marked as Account:Left by the Clemson authors. Note the use of “classfier_4_score of .9 or above” in the advanced search.
This could probably be automated to produce clearer classifications, as a future work, and track changes in classifications over time.
The Clemson data set is fascinating and this effort (probably in total under eight hours after import) has led us to some understanding of how these systems work and how broadly they are applied, although it has not told us how effective they are.
If you have a Brainspace license and would like the import Python script let us know.