10 Bits: The Data News Hot List
This week’s list of data news highlights covers October 5-11 and includes articles on GE’s ‘industrial internet’ offerings and a new method for remotely detecting human rights violations.
General Electric announced this week that it would expand its push into the industrial internet, with 14 more products that will be equipped with internet-linked sensors and performance management software. The products span the fields of aviation, energy, transportation and healthcare; one offering is a power generation product designed to optimize turbine usage. GE has already deployed 10 industrial devices equipped with sensors and performance tracking tools; the company hopes these enhancements will help it better predict products’ failure and design future products more cheaply.
Craig Mundie, Microsoft’s senior advisor to the CEO, argued this week that it is impossible to control the collection and retention of data in our current environment. Instead, he suggested that additional metadata and cryptographic wrappers could be added to personal data to limit its uses. Combining technological efforts with strong legal penalties could deter misuse.
The U.S. Holocaust Memorial Museum (USHMM) and the U.S. Department of State have proposed a new approach to detecting mass human rights violations with satellite imagery. Using publicly-available NASA imagery data, researchers tested the hypothesis that smoke from burning villages would change the surrounding area’s reflectivity in a manner that could be automatically detected by satellite. The methodology was applied to a database of destroyed villages around Darfur, Sudan, and offered the most accurate picture to date of when these villages were destroyed.
Advances in genomics, biomedical imaging and other technologies have signaled a shift in the biological data landscape, and biologists are sprinting to keep up with the large amounts of newly available data. Big Science initiatives like the Human Genome Project and the human microbiome—an effort to map the influence of bacteria on human growth, development and disease—have forced biologists to use new hardware and learn new skills for storing, wrangling and analyzing large datasets. Moreover, funding for these technological advances is often overlooked. Although this creates costs for the biomedical community in the short term, it also presents considerable opportunity: understanding the microbial environment in human bodies could lead to new insights into treating obesity, allergies, Crohn’s disease and other disorders.
Earlier this year, the data scientists at OpenSignal, an open mapping project that documents global cell phone signal coverage, launched an initiative called WeatherSignal to collect weather data from smartphones. The creators hope that smartphones will be a major source of weather forecasting data in the future, and their initiative has the potential to decrease forecasting costs through the use of pre-existing sensors. The initiative repurposes sensors built into Android devices that collect data on humidity, barometric pressure, temperature and light intensity. The project, while still in the proof-of-concept stage, has already spurred the creation of a number of new applied data science techniques, such as converting battery temperature readings to ambient temperature.
Nonprofit organizations could increase their effectiveness considerably by making better use of data science, but that will require a meaningful, sustained effort. Rayid Ghani, former chief scientist at Obama for America, is helping lead the charge with the University of Chicago’s Data Science for Social Good fellowship and a startup called Edgeflip that helps nonprofits conduct better donor targeting. One way nonprofits can get with the program, Ghani says, is by sharing data among organizations solving similar issues. However, a large foundation or a consortium of large nonprofits could help accelerate this process by pooling resources on a common data-sharing platform.
With all the technology that has emerged in recent years to help analyze big data, it is easy to overlook the fundamental work mathematicians are doing to simplify large datasets before they reach the processor. Stanford researchers are developing two techniques in particular that help simplify big data so that it can be analyzed with less computing power. One takes a geometric approach, attempting to represent large data sets using networks and then reduce them down to smaller and more tractable networks that preserve most of the same geometric properties. Another takes a similar approach but treats the data as a signal, attempting to compress the data like a digital audio file so that most of the information can still be recovered while taking up considerably less space.
A study published in open access journal PeerJ last week showed that papers with publicly available data are 9 percent more likely to be cited than papers that do not make their data available. The study’s authors looked at more than 10,000 genomics papers, and controlled for such factors as publication date, journal impact factor and open access status. The researchers also found that around 20 percent of datasets released between 2003 and 2007 had been reused at least once by other researchers.
Vermont’s director of web services announced this week that his state would begin pursuing an open data initiative, beginning with the release of 10 data sets on a web-based portal. The announcement came during the Vermont Open Data Summit, and marks a welcome change for the state; Vermont earned a D+ for public access to information in a 2012 nationwide survey of state transparency. Another presenter at the summit noted that the State of Massachusetts estimated that it saved $2 million in 2012 by allowing the public to access information without needing to submit a FOIA request; although Vermont is only one-tenth as populous as Massachusetts, advocates expect open data efforts to drive savings in Vermont as well.
Designers at the Polytechnic University of Milan have created a user-friendly web tool for creating attractive, scalable data visualizations. The utility democratizes visualization, allowing users to import data from popular platforms such as Microsoft Excel and create charts and cluster analyses with only a minimum of setup. The project, which is still only an alpha release, is open-source, and the creators invite modifications and extensions.