Although the federal data repository data.gov contains over 155,000 data sets, there is no way for open data users to easily share with others using these data sets the code they use to transform the data, such as removing a decimal place or converting words from uppercase to lowercase. For popular data sets, this results in a lot of duplicated effort. “dat,” a new open source software project released in alpha earlier this month, seeks to remedy problems like this and ultimately make open data sets more useful. The software is a collection of tools for storing, transforming, and most importantly syncing data, which will allow users to collaborate seamlessly and rapidly on the same data set.
Open data software development is slower than it should be for at least three reasons. First, developers hoping to make changes to multiple copies of the same data set must run their code on each set individually, introducing the potential for human error. Second, when multiple collaborators are involved, reverting errors can be a complicated process. Third, there is no standard way for developers to determine if a dataset is accurate and from a trusted source.
The nonprofit U.S. Open Data Institute created dat to overcome these problems and allow developers to write code that collaborators can then use with their own copies of the data. This allows developers to build off one another’s work, reusing code instead of writing it anew to solve problems others have already addressed. The software will also make it possible to revert data sets to earlier versions, transparently create parallel versions of the same data set for different projects, and help assign credit—and blame—to collaborators on a project.
Max Ogden (pictured above), a veteran developer who led the dat project, says he hopes one application for the software will be streamlining the process by which open data users work with frequently updated data sets. For example, a city might update its municipal crime data set with new crimes each night. Using dat, the city could synchronize its data set so that anyone who has downloaded an old version could automatically update to the new one while preserving their modifications to the data. In addition, users could subscribe to alterations to the data set made by other users and automatically receive updates with those modifications. By linking information about modifications—additions, deletions, mathematical operations, format conversions, etc.—to the data set itself, data providers and users can share not just original data, but also changes they make to that data.
The creators of dat have high hopes for future applications built on the foundation their software provides, including a distributed computing framework to share extremely large data sets and a data project repository inspired by software collaboration platform GitHub. GitHub, which is built on a synchronization protocol similar to dat, has been a major force driving innovation in software development and has become a central resource for development teams working together, individual programmers looking for code to reuse, and companies evaluating potential hires on the basis of programming skill. Eventually, dat hopes to replicate—or, at least, draw inspiration from—GitHub’s model to spur innovative data applications and strengthen the data science community.
Although the software is only in its very early stages, it may eventually address several of the major challenges open data users face when working on the same data sets. Although websites like Data.gov have helped make open data discoverable and open formats have made it cheaper to work with, software like dat can help make it collaborative.