by Brandon Roberts
Last year I was working on an entry to a datamining contest and decided to include analysis of our mayor’s campaign contributions. I got hit with a harsh truth about the way most local/state governments publish their campaign finance records: they're kept in a format that makes them extremely difficult to examine.
PDFs, the format that most campaign finance reports are published in, display structured rows of information, but not in a way that someone can easily export for analysis. Copying and pasting a whole page gives us a jumbled mess of sentence fragments instead of something we can use. In campaign finance journalism, this is a huge problem.
Consider this: most political candidates are required to publish campaign finance reports (CFR) during an election. The CFR is mostly made up of a form with contributor information on it. In Texas, for example, CFRs look like this: there is name, date, address, employer, position, type, and amount of contribution fields. There are five contributers per page. Each CFR has anywhere from 50 to 100 of these pages. In Austin, each candidate publishes three (or more, depending on the circumstances) of these CFRs per election cycle.
To get contribution data into a spreadsheet, most people would be forced to copy and paste each value one by one. Doing this manually is a painfully tedious job that nobody should consider willingly. This is 2013. Can’t we do better?
Yes. With a tool we’re calling Campaign Vulture. It gives you the power to painlessly convert a CFR to a spreadsheet in less time than it would take someone to copy and paste one page's values. With Campaign Vulture, you'll bypass the tedious tasks and jump right into doing actual journalism.
Right now the software is experimental and not user friendly at all. It’s also not free or open source. But in our tests, it took us 97.43 seconds to extract contributor information from a 300-page Texas-style CFR with 99.47% accuracy (6 errors out of 1,135 rows).
Help make Campaign Vulture free, open source, and easy-to-use by funding our work.
We've set up a page for individual donations:
Any organizations interested in funding Campaign Vulture should contact Brandon Roberts directly by calling the Austin Cut office at 512-221-2136 or by e-mail:
Campaign Vulture is currently being developed based on an experimental parser for Texas-style campaign finance reports, written by Brandon Roberts. It is written in Clojure, a lisp running on the JVM. This means that any modern system capable of running Java will be able to run Campaign Vulture. You can find this experimental demo code on github at https://github.com/brandonrobertz/parse-tx-cfr. More information will be made available as things develop.