Last year, Michael Phillips, a data science intern at Cambridge Analytica, posted the following scripts to a set of “work samples” on his personal GitHub account.
The Github profile, MichaelPhillipsData is still around. It contains a selection of Phillips’ coding projects. Two of the “commits” — still online today — appear to be scripts that were used by Cambridge Analytica around the election. One of them even has his email address. The rest of his current work, Phillips notes on his Github profile, he unfortunately “cannot share.”
MichaelPhillipsDataMichaelPhillipsData has 2 repositories available. Follow their code on GitHub.github.comThe first of Phillips’ two election data processing Github scripts is titled GeoLocation.py, a list-completing and enrichment tool that can be used to:
MichaelPhillipsData/GitSampleCode #Geolocation.py“complete an array of addresses with accurate latitudes and longitudes using the completeAddress functionIncludes another function compareAPItoSource for testing APIs with source latitude longitudes.”Phillips describes the geolocation list completion tool as performing the following tasks. It appears to enrich the clients’ personal information files:
“Essentially what it does is: For each address in the addresses file, try to get an accurate lng/lat quickly (comparing available datafrom Aristotle/IG to the zip code file data to determine accuracy), but if we can’t, we fetch it from ArcGIS.”>Don’t miss the line item called “Voter_ID”
The second work-related script sitting on Phillips’ Github repo is called Twitteranalysis.py.
MichaelPhillipsData/GitSampleCode #Twitteranalysis.pyPhillips offers a quick starter for how the Twitter mining code works:
For starters, we will just get sentiment from textBlob for tweets containing keywords like “Trump”, “Carson”, “Cruz”, “Bern”, Bernie”, “guns”, “immigration”, “immigrants”, etc.Twitteranalysis.py also finds the Twitter user IDs amongst the sample it collects in order to “retrieve all the user’s recent tweets and favorites.”
Looking in more detail, it then:
Separates users’ tweets into [control] groups containing each keyword Produces a “sentiment graph” of the whole group using textBlob and matplotlib As a real-time social media mining tool which uses common tools like tweepy and matplotlib, this doesn’t appear to be science fiction or extremely complex. However, this is not what makes the code interesting as a key research, political evidence, and cultural object.
The most fascinating part of the Twitter sentiment-miner that Phillips’ posted is how it appears to pull users’ ID and find their “recent tweets” and even favorites to expand the company’s corpus of keywords around specific objects of election sentiment (ie, immigration, border control, etc.).
Looking below, nearly all “sentiments” within the lines of code involve “hot-button” 2016 election topics such as abortion, citizenship, naturalization, guns, the NRA, liberals, Obama, and Planned Parenthood.
See for yourself, here’s the actual code:
#each sentiments list will have tuples: (sentiment, tweetID)#note: could include many more keywords like “feelthebern” for example, but need neutral keywords to get true sentiments. feelthebern would be a biased term.In any case, here are the “sentiments” the script was set to look for via Twitter’s API:
hillarySentiments = [] hillaryKeywords = [‘hillary’, ‘clinton’, ‘hillaryclinton’]trumpSentiments = [] trumpKeywords = [‘trump’, ‘realdonaldtrump’]cruzSentiments = [] cruzKeywords = [‘cruz’, ‘tedcruz’]bernieSentiments =[] bernieKeywords = [‘bern’, ‘bernie’, ‘sanders’, ‘sensanders’]obamaSentiments = [] obamaKeywords = [‘obama’, ‘barack’, ‘barackobama’]republicanSentiments = [] republicanKeywords = [‘republican’, ‘conservative’]democratSentiments = [] democratKeywords = [‘democrat’, ‘dems’, ‘liberal’]gunsSentiments = [] gunsKeywords = [‘guns’, ‘gun’, ‘nra’, ‘pistol’, ‘firearm’, ‘shooting’]immigrationSentiments = [] immigrationKeywords = [‘immigration’, ‘immigrants’, ‘citizenship’, ‘naturalization’, ‘visas’]employmentSentiments = [] emplyomentKeywords = [‘jobs’, ‘employment’, ‘unemployment’, ‘job’]inflationSentiments = [] inflationKeywords = [‘inflate’, ‘inflation’, ‘price hike’, ‘price increase’, ‘prices rais’]minimumwageupSentiments = [] minimumwageupKeywords = [‘raise minimum wage’, ‘wage increase’, ‘raise wage’, ‘wage hike’]abortionSentiments = [] abortionKeywords = [‘abortion’, ‘pro-choice’, ‘planned parenthood’]governmentspendingSentiments = [] governmentspendingKeywords = [‘gov spending’, ‘government spending’, ‘gov. spending’, ‘expenditure’]taxesupSentiments = [] taxesupKeywords = [‘raise tax’, ‘tax hike’, ‘taxes up’, ‘tax up’, ‘increase taxes’, ‘taxes increase’, ‘tax increase’]taxesdownSentiments = [] taxesdownKeywords = [‘lower tax’, ‘tax cut’, ‘tax slash’, ‘taxes down’, ‘tax down’, ‘decrease taxes’, ‘taxes decrease’, ‘tax decrease’]Drilling down to the list of terms that are linked to each election sentiment keyword (in the code as #(nameOfTuple, sentimentList, keywordList ), we can see:
personSentimentList = [(‘hillary’, hillarySentiments, hillaryKeywords), (‘trump’, trumpSentiments, trumpKeywords), (‘cruz’, cruzSentiments, cruzKeywords), (‘bernie’, bernieSentiments, bernieKeywords), (‘obama’, obamaSentiments, obamaKeywords)]issueSentimentList = [(‘guns’, gunsSentiments, gunsKeywords), (‘immigration’, immigrationSentiments, immigrationKeywords), (‘employment’, employmentSentiments, emplyomentKeywords), (‘inflation’, inflationSentiments, inflationKeywords), (‘minimum wage up’, minimumwageupSentiments, minimumwageupKeywords), (‘abortion’, abortionSentiments, abortionKeywords), (‘government spending’, governmentspendingSentiments, governmentspendingKeywords), (‘taxes up’, taxesupSentiments, taxesupKeywords), (‘taxes down’, taxesdownSentiments, taxesdownKeywords) ]Phillips also provides a snippet of code “for taking random twitter IDs” to create a Twitter “control group.” This part of the code appears to “skim the most recent tweets that have mentioned one of our [Cambridge Analytica’s pre-defined] keywords.”
Phillips explains in the notes in his code about the practicalities of sentiment mining that it’s not big data (all the tweets) that were being sought out:
“it turned out that skimming all of the tweets found very very few occurances of keywords since “twitter is such a global/multilingual platform.”Next, Phillips provides a snippet to parse *any* text that CA was “looking for through non-tweets (like transcripts of some sort),” noting that the tool is set up to “find sentiment and adds [it] to the respective keywords’ data list”:
Interesting functionality, indeed. The code follows with a final function that Phillips states:
“goes through tweets of each user, looks for keywords, and if the keyword is there, we find the sentiment for that tweet and add it to the sentiment data list”Next, the code compiles the collected and refined data — Phillips describes:
“compiles the sentiment data for each keyword group into an easier to work with format (dataframe) … it is only meaningful if compared with a control group, since keyword selection is impossible to employ neutrally.”The final output of the Twitteranalysis.py is a list of tweets and Twitter users (via IDs) located within a pre-defined set of keywords (abortion, NRA, Hillary, Obama, lower taxes, guns, immigration, liberals, etc.). all relate to #Election2016 campaign issues. Also, this code appears to be extensible to mine the text from focus groups and survey respondents.
These scripts normally wouldn’t be that interesting. But provided both were added by a Cambridge Analytica data intern (at least at the time) and contain a running dialog of what the tools do, how they work, and why they were built— and the fact that they are *still* available on Github — I thought I’d share.
Wait, there’s one more thing. When Phillips committed his original Twitteranalysis.py script, he accidentally left the working Twitter API keys in the code (via the consumer key and consumer “secret”). This part contains the alphanumeric strings which are used to use the developer account to access data from Twitter’s API.
Interestingly, on Feb 23, 2017 (yes, 2017), Phillips removed the API keys:
Two days later, another Github user added a comment about Phillip’s mistake:
Was the API key Cambridge Analytica’s? Or SCL’s ? While both scripts— the first including Phillips’ @cambridgeanalytica.org email address, clearly are voter data and election sentiment related, from the commentary in the script, it’s not clear who the API key belonged to. This could have been Phillips’ own.
Regardless, this shows the inner workings of targeted voter file geo-data “enrichment” and presumably automated voter file processing for clients by Cambridge Analytica.
This code also provides proof in showing once and for all how Twitter users’ emotional reactions and real-time discussions (even users’ favorites were getting pulled from the API) are mined in real time and used to create test phrases, establish control groups, and apparently even to provide sets of future word around keywords related to political campaign issues.
The fact that Cambridge Analytica was using this kind of code to mine emotional responses that surfaced from “recent tweets” referencing a defined set of 2016 presidential campaign “trigger words” is interesting.
What’s Missing From The Trump Election Equation? Let’s Start With Military-Grade PsyOpsWhat do Nelson Mandela, Thom Tillis, Trump’s possible Secretary of State pick John Bolton, Brietbart Chairman Steve…medium.comI’m confident Phillips provided this in earnest, as he provides an excellent working description in the purposes and uses of these scripts. He was an CA intern who wanted to show his work to get a job in the future. Yet, this is part of the arsenal and tools used by Cambridge Analytica to geolocate American voters and harness American’s real-time emotional sentiment.
I’d argue the question of the ownership of Cambridge Analytica — a foreign business previously registered in the United States as a foreign corporation (SCL Elections ) just became a bit more relevant.
Foreign…sound familiar?
CA Election Data Processing Scripts - dataset by d1gid1gi is using data.world to share CA Election Data Processing Scripts data with the worlddata.worldAnd that fact that a working Twitter developer API key — possibly one of Cambridge Analytica’s own — was left sitting on GitHub by a data intern for anyone to use is, well, another story. The code will likely be removed soon, so it’s availabe here:
📌#Election2016 #FakeNews CompilationSomething like Mr Robot meets House of Cards meets academic hackathon deep-data journalism. IDK. All open data and…medium.com
↧