10 Bits: The Data News Hot List
This week’s list of data news highlights covers July 13-19 and includes articles on a redesign of Data.gov and initiatives to detect suicidal behavior in veterans and track endangered species.
DARPA’s Durkheim Project, named for pioneering sociologist Emile Durkheim, proposes to mine Twitter and Facebook data of veterans in an effort to link self-reported suicidal thoughts with social network usage patterns in order to better predict such behavior in the future. The project, which will use machine learning and natural language processing to detect telltale language, is only a pilot, and so far is being conducted on volunteers.
The Securities and Exchange Commission (SEC) launched two data-driven task forces earlier this month. The Financial Reporting and Audit Task Force will analyze data from different units across the SEC to detect financial reporting fraud. The Center for Risk and Quantitative Analytics will profile high-risk behaviors and transactions to detect corporate misconduct. The task forces are looking into novel metrics and advanced modeling, as well as applying pre-existing tools such as the Accounting Quality Model financial analytics suite in an attempt to efficiently detect previously undiscovered misconduct.
Cambridge, Mass.-based startup Ovuline collects extensive data from disparate sources (such as wearable devices) and integrates machine learning into its analytics to help women maximize their chances of getting pregnant. Ovuline CEO Paris Wallace reported that users of the app who got pregnant did so about three times faster than the national average. Wallace hopes the company’s data-intensive approach can eventually be abstracted and applied in other health monitoring contexts.
Computer vision, which is widely used in facial recognition, optical character recognition and elsewhere, is notorious for being one of the most difficult problems in applied computer science. One academic, however, thinks ordinary people might be able to help the algorithms along. James Davis, an associate professor of computer science at UC Santa Cruz, recently won a $50,000 grant to explore computer vision systems that integrates both CPUs and “Human Processing Units” into the work of image recognition. In a task to find the same person in two separate photos, for example, an algorithm might propose a set of potential matches and submit them to a human for the final judgment.
A group of scientists at the National Cancer Institute have developed a complex algorithm to predict the efficacy of certain cancer treatments. After building up massive databases of cancerous cells’ genetic mutations and reactions to various drugs, the scientists modeled connections between the datasets to produce their so-called “Super Learner” algorithm. The scientists, who published findings this week in the journal Cancer Research, hope their methods may eventually lead to data-driven cancer treatment recommendations.
UC Berkeley is partnering with interactive learning platform provider 2U to offer a new online masters degree in data science. The first students will enroll in the $60,000 course in January, 2014 and will learn in classes with no more than 20 participants.
Online question-and-answer website Quora will hold a machine-learning contest on July 20 to improve the company’s core service. The exact nature of the three contest problems has not yet been released, but in the past the company has held a similar contest focused on separating better answers from worse ones on its question pages.
This week, the Office of Science and Technology Policy (OSTP) announced the launch of the redesigned federal open data portal data.gov, christened next.data.gov. The new site was implemented by a collaboration involving the General Services Administration, OSTP staff and a number of presidential innovation fellows. The old data.gov site will remain operational while the team continues work on next.data.gov.
Most electronic storage degrades after a few years, but a new storage medium could preserve data for millions of years, according to a team of European scientists. The medium, nanostructured glass known as fused quartz, is highly durable, and its three-dimensional structure lends itself to high data storage capacity. The team showed that volumes of data larger than the capacity of the U.S. Library of Congress could theoretically be stored in this medium on a disc about the size of a CD.
ARBIMON, which stands for automated remote biodiversity monitoring network, is a new endangered animal monitoring project from researchers at the University of Puerto Rico. Recordings of various animals’ calls, captured automatically in habitats that are costly for humans to monitor, are taken using inexpensive microphones attached to solar-powered iPods, and are transmitted wirelessly to base stations up to 40 km away. Once the recordings reach the base stations, they are analyzed with machine learning algorithms to detect the species, and added to a database that tracks population levels over time. The team hopes their “biodiversity weather stations” will one day be deployed around the globe.