In the wake of the leaks that revealed the National Security Agency’s (NSA’s) PRISM surveillance program, several recent articles have responded with criticism of “big data.” “The advantages of big data could prove to be ephemeral,” author Andre Mouton writes in USA Today, but “the costs…will probably be sticking around.” And Andrew Leonard at Salon directly blames the technology, writing, “By making it economically feasible to extract meaning from the massive streams of data that increasingly define our online existence, [distributed processing platform] Hadoop effectively enabled the surveillance state.”
Pictured: Michael Flowers, civic data icon and Analytics Director of the City of New York’s Office of Policy and Strategic Planning. Photo: DataGotham
But criticizing “big data” itself is a curious thing. In its original form, “big data” was just a catchall term for those technologies—borrowed mostly from statistics and computer science—which still worked on data analysis problems that would overload a typical processor. The connotation of “big” as in “big tobacco” was added retroactively. Many practitioners prefer the broader term “data science” for this very reason: they aren’t members of some kind of shadowy syndicate. They aren’t even in the same industry. They just use the same tools.
Unfortunately, blaming tools for societal problems is all too common, whether it’s Erik Brynjolfsson claiming greater productivity is causing unemployment or Nicholas Carr saying that Google is making us stupid. But where does this line of reasoning leave us?
Simple machine learning algorithms, which form the basis of many data mining operations like the ones PRISM enables, can be implemented by a skilled programmer in minutes. If the criticism is that “big data” itself is to blame, what could policymakers possibly do to change that? Criminalize machine learning? Outlaw a few dozen lines of code? No, the tool itself is not to blame. That argument is, at best, counterproductive to efforts to stop the misuse of data. At worst, it will inhibit positive uses of data science that can save and improve lives—not to mention grow the economy.
While data science, like any other technology, can be misused, it’s important for policymakers not to discourage its use in general. Fears about science and technology can have serious consequences: consider the children who have died because their parents incorrectly feared a link between vaccines and autism and did not get them vaccinated. When commentators vilify data science itself rather than those who abuse it, they unfairly stigmatize a family of technologies that lend themselves to beneficial (and often economically stimulating) applications in a host of fields, from health care to disaster response. In many cases, efforts to delay these programs on principle will have negative consequences in the real world.
To take one example of this, the National Oceanic and Atmospheric Administration (NOAA) has been using increasingly advanced sensing technology to model the growth of tornadoes, as well as behavioral insights to inform the language used in tornado warnings. This means that government agencies can not only warn residents of weather-related threats sooner, but they can also issue more effective warnings. Just this year, residents of Moore, Oklahoma received 36 minutes notice that a tornado was heading their way during a major storm, up from an average lead time of 14 minutes. The difference undoubtedly saved lives, and can be attributed, in part, to better collection and analysis of data.
And the use of data science is not limited to just the federal government; state and local governments benefit from it too. The City of New York’s Office of Policy and Strategic Planning (OPSP) began an effort to centralize and analyze the data from various city agencies in 2009. The OPSP has doubled the city’s success rate in finding stores selling bootleg cigarettes, accelerated cleanup efforts following Hurricane Sandy and helped housing inspectors pinpoint buildings at high risk for fires.
Citizens are right to be angry about the lack of transparency in projects like PRISM. Many data scientists are angry too. But we should direct our anger at such misuses of data, and not stigmatize the tools that are being used for so much good. If policymakers feel obligated to act, one good step would be for more of them to become more “data literate” so that they can better analyze data-related policies. Policymakers can achieve this, at least in part, through open dialogue with technologists and engagement in community-oriented open data initiatives.
We should continue to raise concerns about the misuse of data, but it’s important not to throw the baby out with the bathwater. A campaign against “big data” as a concept will not only fail, it will represent a waste of efforts that could have been spent preventing unsavory uses and using data to build a better world for us all.