10 Bits: The Data News Hotlist
This week’s list of data news highlights covers August 17-23 and includes articles on postsecondary education ranking and data analytics for soccer player recruitment.
In response to a New York Times op-ed that downplayed the economic impacts of “big data” technologies, several commentators offered reasons why it is too early in the technologies’ life cycle to speak in broad economic terms. One reason is that data per se, not just “big data,” has already had significant impact in a range of industries, from utility companies to consumer services like Amazon. Another important point is that large scale data analysis has made much of its initial impact in areas outside online business; scientific data innovation, including climate modeling and high-throughput biological and genomic analyses has improved markedly in recent years.
On Thursday, President Obama announced a plan to create a new college rating system and eventually tie student aid to the schools’ performance. The ratings will be based on metrics such as graduation rates, workforce outcomes and the amount of debt accrued by the average student. Students at high-value schools would then be eligible for larger federal grants and more affordable student loans.
The elaborate statistical significance tests that help scientists identify meaningful evidence for their hypotheses often produce misleading results. Numerous research results in fields from neuroscience to pharmacology have been shown to disappear when applied to different data; epidemiologist John P.A. Ioannidis noted that “false findings may be the majority or even the vast majority of published research claims.” This may stem from an incorrect interpretation of the statistical concept of p-values, traditionally used to measure the likelihood that a particular result is due to random chance. When researchers blindly rely on p-values to determine the validity of their experiments, both false positive and false negative results can abound.
Individual behavior data could be used to redesign municipal systems from healthcare to transport. Biometric data is beginning to be applied in healthcare, from portable sleep sensors to heart rate monitors, but there is a great deal of potential elsewhere. GPS data could be used to model travel patterns and optimize transit networks, and real-time biomedical data could be used to predict and explain public health outbreaks. These data sources, however, are not widely interoperable and are underutilized.
The White House has released a supplemental guide to its Open Data Policy executive order, providing clarification and greater detail on some of the requirements put forth in that document. These requirements include creating a data inventory and public data listing, and creating a process to engage with the public to help prioritize future data releases. The new guide was posted on open source code repository GitHub, focuses on steps agencies must take in order to comply with initial requirements, due November 1, 2013.
Open access publisher BioMed Central, a subsidiary of Springer, announced that it would wave all copyright on datasets it publishes. The “no rights reserved” license adopted by the publisher will enable data miners to reproduce research results and adapt datasets for new projects. The announcement, which followed an overwhelming show of public support for the initiative last year, includes an assurance that data miners will be encouraged in some cases and required in others to acknowledge the original source of their data.
“Series Finder,” a machine learning method developed jointly by MIT computer scientists and the Cambridge, Massachusetts police department, can help law enforcement discover patterns known as crime series in police report data. The algorithm incorporates factors like means of entry, day of the week and property characteristics to develop a model that can then be used to narrow down the range of suspect in a case. Focusing their efforts on burglaries, the team used the algorithm to recover “most” of the crimes in nine known crime series starting with only a handful of data points.
A report from government IT networking group MeriTalk estimates that government agencies could save 14% of their budgets by successfully analyzing “big data.” According to the findings, which are based on a survey of 150 federal IT professionals, nearly 16% of respondents’ annual IT budgets (nearly $13 billion) will be spent on “big data” initiatives in five years. Only 31% of respondents believe their agency’s strategy in this area is sufficiently comprehensive to deliver on the technologies’ potential. The report stresses the importance of implementation and management of good metadata regimes, but notes that the impact will be different across agencies.
The new Federal Trade Commission chief, Edith Ramirez, said this week that her agency will take a more active role in policing companies that collect large quantities of data. Ramirez stressed the FTC’s role as “lifeguard” to ensure that consumer privacy not be engulfed by the wave of data-driven innovation. She mentioned the importance of data de-identification and said that companies need to ensure that they aren’t “accidentally classifying people based on categories…such as race, ethnic background, gender and sexual orientation.”
Soccer’s English Premier League (EPL) is the world’s richest sports league by revenue, and the billions of dollars moving through the organization have attracted several analytics firms. Companies such as Opta and Prozone collect hyper-granular play-by-play data and sell it to the teams and media organizations. Since player salaries account for the majority of the teams’ budgets, smarter scouting is in high demand, and at least one club has data for all players in 15 leagues around the world.