10 Bits: The Data News Hotlist
This week’s list of data news highlights covers July 4-10, 2015 and includes articles about the coming explosion of genomic data and how Google uses artificial intelligence to fight spam.
A researcher at the Israel Institute of Technology has developed a machine learning system capable of distinguishing between jokes and serious comments in electronic communications. Public officials could use this technology to filter out legitimate security or terrorism threats or indicators of cyberbullying, depression, and suicide, from sarcastic or tongue-in-cheek social media posts, text messages, and emails.
Genomics researchers called on industry to tackle the problem of storing the oncoming influx of human genomics data, which they expect to be the single largest source of data in the world. Though the entirety of existing genomic data totals just 250 petabytes, the researchers point out that the genomics data produced per day is growing exponentially, potentially reaching over 40 exabytes of new data per year by 2025. The researchers say that figuring out how to store and make use of all this data will be one of the greatest hurdles for precision medicine initiatives.
The Department of Veterans Affairs (VA) has begun to tap its new Million Veteran Program (MVP) database to use health data from veterans to study heart and kidney disease, and substance abuse. MVP, which so far has 390,000 enrolled veterans, links clinical, genetic, and lifestyle data, as well as military exposure information, with the goal of improving the VA healthcare system. The new studies will target African American and Hispanic veterans, historically understudied populations, to further understanding of common chronic illnesses.
The Federal Election Commission (FEC) has released an application program interface (API) to help the public make better use of campaign finance data, which the FEC publishes in bulk. The FEC is already a frontrunner amongst agencies in terms of bulk data publishing, though the sheer volume of data can make it difficult for the average user to find specific information. The new API allows users to easily parse through FEC data as well as extract summaries and detailed financial reports to keep better tabs on who might be influencing an election.
Oakland, California has developed an interactive map that displays crime data in real time to show the public how police officers respond to and manage local crime. The web-based tool was developed in response to mounting citizen requests for crime data, and automatically populates whenever law enforcement officers issue tickets or respond to 911 calls. Oakland police staff also have access to an internal version of the tool, which they can use to increase efficiency and reduce response times.
Fitness tracker apps MyFitnessPal and MapMyFitness, which allow users to log their diet, sleep, and physical activities, have combined their data to rank the fitness habits of entire states. According to the data, contributed by the tens of millions of users of each app, California, Colorado, and Washington have the most active residents, while South Carolina, Delaware, and North Dakota are among the least active states.
Google and Carnegie Mellon University have partnered to outfit the university’s Pittsburgh, Pennsylvania campus with connected sensors to turn the campus into a “living laboratory” for the Internet of Things. Google will network a wide array of regular objects ranging from coffee pots to campus bus stops to give university faculty a test bed for smart city applications, which the university hopes could eventually be expanded to all of Pittsburgh.
Google has developed machine learning technologies capable of blocking over 99.9 percent of all spam messages from reaching its users. The algorithms analyze massive amounts of emails to learn the difference between genuine messages and junk mail and phishing messages, with a false positive rate of under 0.05 percent. Because these algorithms learn as they analyze new messages, they can continuously improve at detecting spam as new trends emerge.
British Petroleum (BP) has partnered with General Electric to pilot a new platform to monitor and analyze data on 650 of its oil wells around the world, which generate millions of data points per minute. BP’s wells have twenty to thirty sensors that gather data on everything from underwater pressure and temperature to equipment performance, and analyzing this data should allow BP to predict well flows and better manage extraction. If the pilot is successful, BP will expand the platform to 4,000 wells by the end of 2016 to support a real-time understanding of its worldwide operations.
A team of of data scientists has developed HowLoud, a tool that maps the noise levels around Los Angeles and Orange counties in California. HowLoud scores individual addresses on the type, time, and intensity of noise it experiences, such as noise from airports, cars, and restaurants, based on factors like air traffic, proximity to bars, and vehicle flow. The team hopes this tool could influence how city planners handle certain environmental issues, as well as better address how low-income neighborhoods are more likely to experience high levels of noise pollution, which can pose serious health risks.
Image: Agência Brasil.