The Center for Data Innovation spoke with Chris Mentzel, program director for the Data-Driven Discovery Initiative at the Gordon and Betty Moore Foundation in Palo Alto, California. Mentzel discussed how data-driven discovery is changing research and the challenges in emphasizing data science in academia.
Joshua New: Why is your foundation focused on “data-driven discovery”? Doesn’t all research rely on data and information?
Chris Mentzel: Absolutely, all scientists use and analyze data in their research. Some researchers are theorists who use very little data, some are experimentalists who generate data, some are computational who simulate data, and some are data-driven who principally collate and aggregate data. The term data-driven discovery is meant to emphasize the kind of research that begins and ends with collating, aggregating, analyzing, and publishing scientific data in pursuit of new discoveries. A good primer on the distinction between data-driven research and other modalities is the book The Fourth Paradigm, by Tony Hey et al., from his time at Microsoft. Another way to think about data-driven discovery is as the application of data science to domains of inquiry like life, physical, earth, or social sciences. But then you’d have to ask me what is data science.
New: The Data-Driven Discovery Initiative gave its first award in April 2013. How has the field changed since then as data technologies like AI and the Internet of Things have matured?
Mentzel: Quite a lot, but first let me correct the notion that data-driven discovery is just one field. A better way to think about it is as a paradigm shift in all of research; the amount of new knowledge we can generate has exploded due to the increases in the volume, variety, and velocity of modern science data mixed with the incredible power of modern scientific computing capabilities.
So, all fields of science have increased their use of data technologies, from databases to machine learning, or at the very least have begun to put these terms in their proposals! One important change we have been tracking is that university departments have begun to realize that the skills they need to look for in new professors include data science. In 2012 when we began, those researchers with a mixed skill set of data science plus a domain expertise were often passed over for people with more of a traditional background. We now see explicit calls for ecologists, life scientists, and physicists with both the traditional background plus “data science” skills.
That said, there are still some problems that stubbornly persist, like the way research software engineers don’t have stable, long-term career paths in academia, or the challenges of sustaining open source research software. So, while technology has advanced quite a bit, and there have even been some important cultural shifts in academia, we are a long way from “solving” all data challenges for all researchers.
New: The Moore Foundation evaluated the impact of its awards and found that investing in software tools has proven to be very effective. Why do you think this is? Do the tools researchers need simply not exist?
Mentzel: The simple answer is that one, well thought-out, and well-crafted software tool can impact the research of hundreds, if not thousands of researchers. In terms of where to place limited resources, software tools can be an efficient bet. Often researchers are leading the way in defining new use cases for data software, so in a way, they are in the business of creating the next generation of tools that don’t exist today. A great example is Jupyter Notebooks, which came from research efforts, driven by research need, and now provides data science infrastructure for commercial entities and academia alike.
There is a misconception that new data science tools always come from industry and then flow to science. The ecosystem of data tools is much more complex. New tools come from both academia and industry, and sometimes partnerships between the two sectors.
New: What has been the biggest challenge you’ve faced in encouraging more data-driven research in academic environments? Are there just cultural challenges, or more tangible obstacles?
Mentzel: The biggest challenge we faced in advancing data-driven research in academia was the lack of recognition for the value data-driven researchers had in the pursuit of new discoveries in the natural sciences. The profile of a data-driven researcher has more data science experience than a traditional background. This can make those researchers look less serious about the core discipline to the traditional community.
We had some of our earlier researchers report to us that they had to hide their data science skills to get past the first hiring hurdles, and then bring them out later as a sort of “special sauce.” These days researchers can lead with data science, as academia, at least for new faculty, seems to be looking more explicitly for these skills.
New: Could you describe some of the data-driven research that was made possible by the Moore Foundation? What are some interesting case studies?
Mentzel: All the post-docs and fellows at the Moore-Sloan Data Science Environments at New York University, University of California, Berkeley, and University of Washington in Seattle are driving new discoveries using data science. They are all worth checking out.
Also, any of the research by our Moore Investigators in Data-Driven Discovery can be attributed to a focus on applying data science in the pursuit of new discoveries in the natural sciences. Examples include: New computational imaging techniques, out of Dr. Laura Waller’s lab at UC Berkeley, which brings the computation, data analysis, and hardware instrumentation together in novel ways to get large scale, high-resolution images in real time; Data science techniques applied to ecological forecasting are being pioneered in Dr. Ethan White’s lab at University of Florida, harnessing large-scale data aggregation, and new forecasting models, in open-source tool chains to drive new insights into the way ecosystems change and evolve; and Dr. Kim Reynolds, at UT Southwestern, who is using data-driven techniques to predict gene expression, and then using these predictions to design new experiments. This back and forth—where new data analysis techniques point to possible new hypotheses, that can then be tested experimentally—is at the heart of what is enabled in data-driven discovery.