10 Bits: The Data News Hot List
This week’s list of data news highlights covers December 28-January 3 and includes articles on a group using data science to bolster human rights law cases and a startup that is tackling K-12 student data consolidation.
The Human Rights Data Analysis Group uses data science to investigate mass human rights abuses, and in 2013, its findings helped secure a conviction in a trial against a Guatemalan General who was accused of genocide during his 1982-1983 presidency. The group’s executive director, Patrick Ball, explains that his group focused on determining whether the General’s crimes had been targeted at a particular group of people, which would meet the definition of genocide in Guatemalan law. Comparing the reported victims against a contemporary census, the group found that indigenous Guatemalans were eight times more likely to be killed under the General than non-indigenous citizens.
A major barrier to adopting new technology in schools is lack of interoperability with legacy student information systems. The San Francisco education technology startup Clever is trying to eliminate this challenge by creating a software platform that provides developers a common interface for accessing this data regardless of the underlying legacy system. Already it has met with much success—in less than two years, the startup has put its software into 15,000 schools.
Netflix uses over 75,000 different hyper-specific “personalized genres” to describe movies in its vast collection, and an analysis this week revealed some interesting characteristics about the company’s genre dataset. For instance, the analysis found that the subject of marriage appears in the largest number of subgenres, with the subject of royalty a distant second. Among adjectives used to describe the subgenres, “romantic” is the most common, followed closely by “classic” and “dark.”
Erik Zachte is the data scientist and developer behind Wikistats, Wikimedia’s data analysis and reporting hub. Over the past decade, he has worked to quantify the quality of Wikimedia’s wikis, including Wikipedia, by tracking variables such as article count, number of editors and edits per article. Some notable insights from his analysis include the fact that the number of active editors on Wikipedia peaked at 90,000 in 2007 and has since shrunk to around 70,000. This insight has helped spur internal efforts to boost editor engagement and increase language and content diversity across the site.
As part of the federal push for open data, the U.S. Department of Veterans Affairs (VA) has rolled out new tools and resources to offer developers access to its data. The VA has also released new data, including information on VA medical facilities and services available to homeless veterans. The VA hopes that making these datasets publicly available will help enable outside organizations build apps and services for veterans.
Researchers have demonstrated a novel data mining technique, known as Reference Publication Year Spectroscopy (RPYS), for tracking the evolution of research fields by investigating the origins of the term “Darwin’s finches.” Contrary to popular belief, the term was popularized not by Charles Darwin but rather by evolutionary biologist David Lack who wrote about the birds over 100 years after Darwin famously visited the Galapagos. Although this particular discovery had already been made by historians, the technique could be applied to other so-called “scientific legends.”
A paper published recently on scientific preprint repository ArXiv describes an attempt to mine text data from the Internet to hunt for messages that could be evidence of time travelers. The whimsical study looked at appearances of the terms “Comet ISON” and “Pope Francis” on search engines, social media and popular websites on dates before these topics entered broad public awareness. Perhaps unsurprisingly, the researchers did not find any evidence of time travelers.
The Brazilian legislature passed a resolution last month to create a permanent “hacker lab” in the chamber, offering a freely available space for citizens to work collaboratively with public data. The idea arose from a week-long hackathon held by the legislature in November, which was widely attended by government officials, including the legislature’s president. The resolution comes amid a broad push for open data in the South American nation, which also recently launched the open data visualization tool DataViva.
The U.S. Department of Agriculture (USDA) is moving toward a data-driven approach to combatting pests and disease in agricultural products. The USDA’s Animal and Plant Inspection Service manages and promotes access to the massive amounts of historical and newly-collected data from agricultural inspections. Officials working on the initiative hope to save billions in lost agricultural revenue from pests, such as the prolific Asian Longhorned Beetle.
The legal profession has been disrupted by the global financial crisis and waning demand for recent law school graduates, but there is a variety of measures lawyers can take to automate, quantify and deploy data analysis to make their jobs easier. Using text mining and semantic analysis, for example, can accelerate the process of e-discovery. Moreover, there are a number of automated web scraping tools that can help lawyers extract data from websites and visualize it.