The Center for Data Innovation spoke with Brett Hurt and Matt Laessig, co-founders of the data science collaboration platform data.world. Hurt and Laessig discussed the need for a platform for open data that can accomplish what GitHub did for the open source movement, as well as the value of corporate data philanthropy.
This interview has been edited.
Joshua New: data.world is described as a social network for data scientists that emphasizes sharing insights and collaboration. Why is this kind of platform so import in the field of data science?
Brett Hurt: It is true that data.world is a social network for anybody interested in collaborating on and sharing data, but it goes further than just data scientists. Data scientists are obviously a very important audience for us, but there are a large group of people that data.world caters too as well, such as business analysts.
Unlike code, looking at historical data from the 1980s to see how it compares to recent data can actually have a lot of value. It’s one of those raw materials that unfortunately hasn’t been networked. If you look at data about climate change or poverty, for example, that data has never really been linked together before. Because of this, even though this data could help people solve so many important problems, network effects have never really been brought to bear before. This is the problem we’re trying to solve—we want to democratize what’s called the “Semantic Web” and technology that’s previously only been available to the wealthy few, which will dramatically improve people’s ability to access and work with data.
If you work with data, you probably spend about 80 percent of your time on what’s known as data janitorial work—the work involved in finding useful data and cleaning it up to make it suitable for analysis. We want to reduce the amount of data janitorial work necessary to get to the analysis part, which is what’s important. This will be useful for people solving important problems like climate change, but also for people working on more trivial and fun things like sports analysis. data.world can make both of those kinds of problems easier—our mission is to build the most meaningful, collaborative, and abundant data resource in the world.
Matt Laessig: To build on Brett’s point about audience, the data scientist segment is key, but our platform is definitely intended for a much broader audience. If you look at the world at large and see how many fields of study are becoming data-driven, there are a lot of people that want to become data-driven, but don’t have the technical resources or skills to become data scientists. We want to help these people access and work with the data they need and collaborate with each other to help make more informed decisions.
New: What makes data.world different than other online computer science focused collaboration platforms, like GitHub?
Hurt: We are entirely focused on solving problems for data professionals. That’s different from the issues other platforms are focused on, such as GitHub with open source code. GitHub built a great community and an incredible user interface around Git, which was a very esoteric language that only a few programmers used to know how to use, and I would argue GitHub moved the bell curve of programming skill worldwide to the right. It essentially gives every programmer access to the “Library of Alexandria” of code and the toolset to use this code. There was just no equivalent for this in the world of data until we came along.
There are already 18 million open datasets in the world, buried in all kinds of places. These datasets cost many billions of dollars to develop and publish, but unless people can actually find this data, it’s not particularly useful. People say that data is the new oil. If that’s true, then what we have now is very crude oil in need of refinement and often times we don’t even know where the oil is buried in the ground. The people that are most interested in using this data to solve problems need a way to find and use this data collaboratively.
New: In October 2016, the U.S. Department of Commerce partnered with data.world to improve how federal agencies share data with the public. How has this partnership improved upon standard open data practices in the government?
Laessig: We describe what we’re currently in as “open data 1.0.” President Obama’s executive order on his first day in office to require agencies to publish open data kicked off Data.gov, which is a great portal, but it has limitations. A lot of the data on the portal uses different formats, much of it is not machine readable, and many datasets are missing useful context. A lot of the janitorial work in data can simply be finding out basic information about a dataset. The Department of Commerce and other agencies are interested in evolving into “open data 2.0,” in which data is available to a much broader audience, is in a usable format, and is on a live platform that allows for collaboration. Data.gov is simply a “flat” file that you need to download to your machine so you can work on it by yourself. That same file can live on data.world and allow for people to see how the data has been used by other users, add notes, and collaborate, which can eliminate a lot of the janitorial work agencies and open data users might need to do.
Hurt: The benefit of having all the data live on data.world is that it can all be linked together to let you see how different data is used. Census data, for example, which is immensely valuable for a lot of businesses, can be linked with other useful datasets to increase this value. In the long term, we want data.world to be able to automatically discover these linkages.
New: Can you talk about your recent partnership with the Anti-Defamation League (ADL) to help leverage data to combat hate crimes?
Hurt: I had a conversation a few days ago with the CEO of ADL, and he told me that they have all this data on hate crimes, which have been increasing over the past year and particularly since the election, but they don’t have a way to get this data into the hands of people that can use it effectively. By using data.world as a platform for this hate crime data, we can bring a lot more attention to it to help citizens and governments fight back against this increase, and even help drive policy. This kind of application is the exact reason we created data.world as a public benefit corporation.
This data is already available on data.world, and this is just the beginning of our work with them. This is indicative of how we plan to work with nonprofits in the future and solve some incredibly important problems.
New: Both of you have previously discussed data philanthropy—corporations donating data, expertise, and technologies to solve important social challenges. Do you think enough companies are aware of this practice? How do you incentivize more data philanthropy?
Hurt: I believe that what happened with corporate code in response to the open source movement is going to happen with corporate data. I founded a business called Bazaarvoice and we had initially used GitHub, but just internally. After about eight or nine months however, we became one of the leading contributors on GitHub to open source projects because we found that we got a lot of value from the GitHub community. I think data.world will have the same effect. Companies will initially likely just want to use data.world internally for their own data so that their different divisions can collaborate more effectively. Eventually, as data.world brings together more open data and builds a bigger community, many of these companies will likely realize that a lot of the data they’re sitting on isn’t necessarily proprietary, nor does it give them a competitive advantage to keep it private, but could also be incredibly useful to help solving real problems in the world. I think we’re just at the beginning of this movement.
Laessig: On data.world, we already have examples of corporate data philanthropy and data donations. Some of these are really inspiring. For example, an agricultural company called Syngenta donated a dataset of agricultural efficiency indicators collected from thousands of farms in 41 different countries. Syngenta realized that this data could be leveraged by all sorts of farming and agricultural interests to improve their own production. This data can broadly help the entire agricultural ecosystem and help grow more food, which means less people going hungry.
Data philanthropy is a movement that could be greatly accelerated by the right platform, just like open source was accelerated by Github. We believe that by showing the value of opening and linking data, data.world will convince businesses to participate in a more meaningful way.