There have been a number of attempts to chronicle exactly what is “big data” and why anyone should care. Last year’s The Human Face of Big Data by Rick Smolan and Jennifer Erwitt focused on telling the personal stories behind big data (and accompanied these stories with some great photographs). The year before, James Gleick wrote The Information: A History, A Theory, A Flood which chronicled how information (and not just big data) has changed our world. The latest entrant is Big Data: A Revolution That Will Transform How We Live, Work and Think by Viktor Mayer-Schönberger and Kenneth Cukier which focuses heavily on explaining some of the more interesting impacts of living in a big data world. (Personally, I’m still not a fan of the term big data because 1) the term scares off people who think this is equivalent to “Big Oil” and 2) the term underrepresents the innovation happening around “small” data. But since this is the term used in the book, I’ll stick with it for this review.)
The first part of this book provides a fairly compelling vision of how big data is changing how we use data. Unlike some technology proponents who simply ignore the past, Mayer-Schönberger and Cukier make a point to highlight that the use of data itself is not new, but that information technology (IT) has made it possible to collect and analyze data on a scale not seen in the past. The authors explore three main changes they see arising from big data. First, we will have significantly more data available than in the past. This means that we will be able to approach N = all for some datasets rather than just using population samples. Second, as we increasingly quantify the world, we will have more measurement error in our data, but that is okay because with much larger datasets the messiness of data becomes less important. Third, we will focus much less on understanding causation (“why”) and more on understanding correlation (“what”). (For a detailed look at this last point, see Chris Anderson’s essay “The End of Theory.”)
While these chapters are interesting, Mayer-Schönberger and Cukier are at their best later in the book when they describe the economic consequences of big data, both in terms of how data is creating economic value and how data is disrupting many industries. Unlike other economic resources, the value from data is not exhausted after its initial use. Instead, data can be reused an unlimited number of times, either directly or by combining it with additional information. In addition, “data exhaust” that would have been discarded in the past can now be put to practical use, such as Google using typos entered by users in its search engine to create a better spell check program.
This is a crucial point. It is not always possible to know how data will be used when it is collected, and even if some uses are identified, the value of big data comes from its reuse. Policymakers stuck in the old way of thinking want to impose data minimization requirements which would effectively create a “use once” policy for data. Instead, to take advantage of data-driven economic value, we need policies that allow and encourage responsible reuse of data.
Mayer-Schönberger and Cukier offer one of the best metaphors for the new type of thinking that we need around data. Using a normal camera, a photographer must decide when taking a photo where to focus the lens. In contrast, plenoptic cameras, like the new Lytro camera, capture light field information and allow photographers to change the focus of a picture after the picture has been taken. Like photographers, most data users have been stuck having to decide how to use data at the outset. But with increasingly lower costs for collection, storage and processing, users are now free to explore possible uses after collecting it.
The authors also discuss the new value chain created by companies involved in big data. They identify three primary value propositions: those providing data, those providing the skills, such as the technology and the analytics, and those providing business opportunities. One of their more interesting insights is that new business models are being created to take advantage of data opportunities that do not fit into existing organizations. For example, the health insurers formed the non-profit Health Care Cost Institute to combine data sets for research that individually they could not perform. Similarly, UPS spun off its internal data analytics unit because it could provide substantially more value if it had access to data from UPS’s competitors, but this would never happen if it remained part of the parent company. The authors argue that most of the value will be in the data part of the value chain, but that it isn’t there now. Unfortunately, such an assertion is impossible to prove or disprove. We are still in the early stages of assigning value to data, both at the macro-economic level and the firm level. Government statistics agencies need to include more than just goods and services if they want to accurately measure the data economy (Mike Mandel has written a thoughtful piece on this exact point).
While the authors also carve out a chapter to explore the “dark side” of big data, including privacy and misuse, they mostly avoid the overwrought handwringing that typically characterizes writing on this subject. And they recognize that much of the big data revolution does not involve personal data. With regards to personal data, my primary criticism is that they unfairly dismiss de-identification techniques, mostly relying on the critiques leveled by Paul Ohm, while ignoring the shortcomings of his work described by individuals such as Jane Yakowitz or the continued advancement of differential privacy research. They also get wrapped up in a surprisingly lengthy discussion of the risk of criminal profiling similar to what was seen in the movie Minority Report, where individuals were arrested for crimes before they were actually committed. While perhaps an interesting thought experiment, the authors provide little evidence that this is anything but a far-fetched science-fiction nightmare.
Overall, the book is an enjoyable read if for nothing else than some of the great nuggets of big data trivia that show just how much data has changed. For example, Mayer-Schönberger and Cukier report that the Sloan Digital Sky Survey generated 140 terabytes of information in about 10 years; it’s successor, the Large Synoptic Survey Telescope in Chile will generate as much every 5 days. In addition, the way they handle the risks section of their book bodes well for the future of data—it seems the more people come to understand it, the fewer concerns they have.
Photo credit: Chatham House