Data-Jitsu: Dawn of the data journalist


Here’s an unedited version of my story from June’s World Conference of Science Journalists in Seoul this year, published June 9 2015 on :

By Natalie Heng

At the World Conference of Science Journalists in Seoul, Korea, a room-full of science communicators are hot on the trail to uncovering how many National Security Agency funded research papers can be found on Google Scholar. “This is not public-information,” our workshop instructor John Bohannon, informs us.

He would know, because he’s tried asking them for it. Even classified information can leave a paper trail however. Bohannon is teaching us how to use Google search terms and common sense to follow digital breadcrumbs.

It isn’t long before most of us figure out that “MDA904” is an NSA grant code prefix, and that including Google quotation marks make all the difference. His point is that there is much journalists can do in today’s digital age, and tricks like ‘web-scraping’ – which basically means mining the internet for useful information, are gateways to immense amounts of data.

Once we complete the workshop exercise, and find 241, 262 and 260 hits respectively for the years 2013, 2012 and 2011 with the exact term “MDA904” on Google Scholar, Bohannon shows us how to go a step further, pulling up a bar chart.

It’s plotted with the number of papers published on Google Scholar every year from the 80s leading up to recent times, based on the inclusion of a variety of NSA grant pre-fixes.

The pattern was obvious: hits for one particular code declined whilst the another increased – a pattern consistent, it seems, with changes in US foreign policy. The chart hinted at the beginnings of a story, and the data on which the bones for that story was based on came from Google Scholar, not some government-issued press release.

As technology moves forward, the amount of data available to us is increasing at a phenomenal rate. Scientists and businesses have been busy figuring out how to harness the potential of this data, Bohannon’s message is: so should journalists.

People like Bohannon are at the forefront of a movement that aims to equip journalists with the skills to make sense of this overwhelming access to data. A regular contributor to publications like Science and Wired, Bohannon is known for writing investigative pieces that make use of large-scale data-analysis.

Lesson number one is: don’t underestimate your Excel Spreadsheet. In our class he gave us the passenger list for the Titanic, and got us to work out the survival rate for passengers with first, second or third class tickets, and how those odds for survival either increased or decreased if you were male or female; a great data-based news article that could be spun around in about an hour if anyone had access to Excel at the time. And a story made all the easier to write if you pick up some simple Excel tricks, like how to use Pivot Tables to organise, analyse and visualise your data.

Some people think webscraping exercises have to involve complicated code. “It doesn’t,” says Bohannon, though there are programs with sophisticated filter systems that can make the job a lot easier; and these are key to wrangling large reams of data.  A neat tip for the modern journalist would be to learn how to use free open-source software programs like IPython, for example.


Can’t be bothered to manually plug in tedious lists into Excel? IPython will capture source code for web pages, and let you feed uncategorized data (which if copy and pasted, would appear as a mess in Excel) into a program for conversion into tables for analysis.

More importantly however, it makes the processing and analysis of large volumes of data possible, revealing important and telling trends that would simply be impractical for a journalist to do without a bit of computer science on their side.

It’s a revolution in how the lay-person makes use of information, some are calling it “data-jitsu”.

L1060153And the number of journalists making use of “data-jitsu” is growing. Jonathan Stray, one of the facilitators at the Bohannon’s workshop, is a freelance journalist and computer scientist who teaches Computational Journalism at Columbia University. He says that although the data journalism community numbered in the hundreds ten years ago, last year’s National Institute of Computer Assisted Reporting conference attracted almost 2000.

Stray thinks data literacy is, and should be, the future of journalism – especially investigative journalism. “Journalists have a long tradition of disdain for numbers. It’s traditionally a literary tradition right? You went into journalism if you wanted to write a novel. But that’s changing, and it has to change, because the way I think about it our job isn’t really writing, it’s analysis and communication, and getting information.

L1060154And that requires data literacy,” he says. Making sense of data means being able to tackle more complicated things. In fact Stray thinks doing investigative journalism today would be difficult without data work – too many questions today are quantitative.

At any rate, interpreting and communicating data is a basic skill that’s valuable to researchers, policy-makers and business. “Data journalists are just the group of people that want to do that in the public’s interest.”

For more information on data journalism, check out these links: