Datamining Contest Puts Roberts on Panel at 2013 NICAR Conference

CAR 2013

Brandon Roberts will be on a panel March 2nd, at the 2013 National Institute for Computer-Assisted Reporting (NICAR) conference in Louisville, KY. You can read about the conference here, about the presentation here, and more about the data mining contest that led to this panel here.

Roberts was a finalist in the Kaggle "Follow the money" data science contest, put on by the Center for Investigative Reporting and Investigative Reporters & Editors, and will be presenting with a panel of other finalists. You can find his entry here:

With more money being spent on elections across the US than ever before, some important questions arise: who are these contributors and what do they expect to get in return? My proposal and related tools aim to help journalists and data miners answer those questions and explore the idea of powerful contributors influencing a politician’s actions. I accomplish this by taking occupation/employer word frequencies and sources of data that represent actions (Senate votes, council minutes, etc., see below) and using a decision tree learner to uncover patterns.

The ideas and tools presented here are as flexible as possible so that they can be used not only at the Federal level, but especially at the local level where all kinds of corruption goes on and sometimes goes unnoticed until it’s too late.

To achieve this goal, I wrote a series of tools (a series of python 2.7 scripts and linux shell scripts: breakdown.py, transform.sh, combiner.py, pdfs2cleantxt.sh, txt2arff.sh, senatevotes2arff.py, and a few others, you can find them at the end of this text) that make it easy to see who is contributing, grouped by occupation or employer, once the contributor data is loaded into a database. With these scripts you can convert this data into an ARFF format, used by the WEKA data mining software. I’ve also included some tools that change data between formats (PDF-to-ARFF, for example) so that loading data into WEKA for analysis is easier.

First, I looked into which industries, as a whole, had the biggest impact on the last presidential election. To do this, I got the 2008 FEC Campaign Contribution data and loaded it into a MySQL database. Using the tools, I grouped each candidate’s contributors by their stated occupation (breakdown.py). Then I converted the occupation strings into word frequencies using WEKA’s StringToWordVector filter (transform.sh, or you can use WEKA explorer). This left us with a giant database of every occupation (hundreds) for every candidate. So I combined all the word frequencies into a single instance per candidate with a classification of whether the candidate won or lost. This was done using combiner.py. Then everything was ready for machine learning.

Using WEKA’s J48 decision-tree learner, I built a decision tree that found strong correlations between high frequencies of contributors who labeled themselves “owners,” “presidents,” “doctors” and “brokers” and candidates who won in 2008. Using models built from the decision tree, it is possible to predict the outcome of some of the 2012 races, based on these influential industries. The python script breakdown.py also has the capability to group contributors by employer information, which might not be as meaningful at the Federal level, but at the local level it could be a good predictor of powerful contributors. (Candidates: now you know who to ask for money!)

2008 Election – Winners Losers by Occupation of Contributors

With these tools you can also look at a competitive race between a winner and a loser (in my example it was SNL’s Stuart-Smalley-gone-political Al Franken vs. Republican incumbent Norm Coleman. Franken won by a margin of about 300 votes.) to give us a view of how two candidates differ from each other and who is likely to support them financially.

First, I pulled employer, occupation, and dollar amount data on each contributor for each candidate. I normalized the inverse word frequencies from both the employer and contributor fields. Then, using a decision tree learner (I used REPTree because it wasn’t as CPU-intensive on this huge amount of data), I found what types of employers and occupations were most likely to contribute to who. This decision tree was based on words related to occupation/employer and contribution amount.

In the 2008 Franken / Coleman race I found:

Franken vs Coleman 2008 – Coleman Contributors by Employer and Occupation

1) Law firms heavily supported Coleman. The terms “LLP,” “DLA” (a MN-based law firm), “NATH” (another law firm) correlated with a strong vote for Coleman.

Franken vs Coleman 2008 – Franken Contributors by Employer and Occupation

2) Teachers supported Franken. The words “teacher,” and “classroom,” meant a strong Franken vote. Also, “homemakers” voted Franken 2:1 and a longshoremen’s PAC was a major contributor.

You can run with this idea and, using the same tools, find relationships between candidates and number of PACs vs. independent contributors, raw amount of contribution dollars, etc. This method isn’t restricted to any level of government. I took Campaign Finance Reports from my city’s local mayoral election and loaded them into the database the same way I did with the FEC data (see examples bundled with the tool). I ran this through the J48 decision tree learner and found which employers were powerful contributors. (Hint: real estate lawyers!)

Matching contributors with election outcomes and candidates might be fun, but it doesn’t answer the more important question facing journalists: whether or not a politician’s actions are influenced by those contributors.

To explore this, I wrote an example tool (senatevotes2arff.py) that automatically downloads any Senator’s voting record and converts it to the data-mining-friendly ARFF format. The file contains a list of every vote the candidate made and the description of the bill. Using this tool, I got the first two sessions of Al Franken’s first Senate voting record. I turned the descriptions into word vectors, using the inverse frequency, and built a J48 decision tree that looked at what words correlated with a “Yea” or “Nay” vote in the Senate

Franken Senate Votes – 111th Senate, Session 1

Franken Senate Votes – 111th Senate, Session 2

There is nothing that unusual about Al Franken’s voting record in relation to his contributors. Also notice how the decision trees change with the time (note ACORN in the first session). As a liberal-type candidate he opposed bills containing the words “forces” (related to war) and voted for “funding” bills. Using this raw ARFF file and WEKA, you could explore this same data in a lot of other ways: like parsing the words using a dictionary that leaves us with only certain kinds of nouns and adjectives. Or by taking the frequent words found in the contributor decision tree, pulling senate voting records related to those terms, and building a decision tree to see how the politician voted on those bills. Correlations between a candidate’s voting habits and his/her contributors would become obvious to journalists using these methods.

I wanted journalists to be able to run with this idea at the local levels of government, too, so I came up with a few ideas on how to check council member’s or mayor’s contributors:

1) Use text classification on council agendas or minutes (downloaded automatically with a tool like httrack) and compare them with contributor decision trees.

2) Take closed caption logs from council meetings, break them up by agenda item, use WEKA’s string to word vector tool to transform what they said into word frequencies, and then using data mining to learn a decision tree based on sentiment (for more about this, see the work done with twitter sentiment). Matching this information could be useful in finding patterns between contributors and a politician’s actions.

3) Parse government contracts awarded to businesses, use text classification (awarded or not) and word frequencies to find correlations between contributors (or industries) and politicians in charge of the contracts.

To help journalists wanting to experiment with this, I wrote a couple scripts to help them transform PDF documents (the most common format that governments are using to digitalize everything) into an ARFF file: pdf2cleantxt.sh (this converts a directory full of PDFs to a clean text format) and txt2arff.sh (a program to turn those text files into a single, data mineable ARFF file).

It’s not always easy to see the connection between a contributor and a politician working in their interest. But by uncovering the structures behind who is contributing in relation to a politician’s actions, patterns about corruption or good public service work will reveal themselves. The question of who a politician really serves, their contributors or their constituency, is as crucial as ever, especially when you consider the insane amount of money being spent on elections at every level of government these days. Luckily for journalists, as much as we like to deny it, the things humans do usually result in obvious patterns. Politicians aren’t any different. Using these methods and tools, data mining can help journalists uncover these patterns.

You can download all the tools & example data here: Political Data Mining Tools (ZIP)

Comments

blog comments powered by Disqus