10 Bits: The Data News Hot List
This week’s list of data news highlights covers September 28-October 4 and includes articles on Monsanto’s analytics acquisition and the need for leadership and management in “big science” data projects.
A team of researchers in Israel is using advanced analytics to pull the most important information out of email messages and summarize them for rapid digestion. The work, which was initially English-only due to the availability of English natural language processing tools, has now expanded to include Hebrew, Arabic and other languages. Methodologically, however, their work is language-independent, and the team has applied for a U.S. patent on their technologies.
The SEC’s newly created Office of Analytics and Research is trying to get a more accurate view of high frequency traders’ activities. Having purchased a $2.5 million data collection system from a speed-trading firm, the SEC will have access to the same market data traders get, and it hopes to deploy real-time analytics to identify patterns and flag suspicious transactions. The SEC will also release summary statistics of the data on its website, along with its research on the effects high speed trading has on markets.
Deriving insights from the flood of data produced today certainly presents numerous technical challenges, but the most acute difficulty may be one of management. Large-scale scientific data initiatives increasingly require the input of researchers from widely different disciplines, and coordinating such efforts is unprecedented in the research community. Groups like the Research Data Alliance may help promote interoperability among research organizations, but the question of effective leadership in “big science” projects remains open.
Agricultural giant Monsanto announced this week that it had bought the Climate Corporation, which conducts sophisticated data analysis on environmental variables to help farmers predict crop yields. Monsanto, which already collects large amounts of research data, hopes to sell the Climate Corporation’s crop insurance products to farmers internationally. Last year, the conglomerate bought Precision Planning, an analytics company that helped farmers plant different parts of their fields at different depths and seed densities for a more efficient overall allocation.
The government shutdown has forced the suspension of many “nonessential” services, leaving a number of standard government data products unavailable. In particular, the Bureau of Labor Statistics’ monthly jobs report will not be issued for the month of October, despite its considerable relevance to current policy debates. Other organizations, such as ADP, the Conference Board and the National Federation of Independent Business, release their own labor statistics.
In recent years, online game developers have had to contend with the increasingly sophisticated methods players use to steal and fraudulently create in-game currency. Systems to combat such fraud have typically been based on shoring up network weaknesses, often at significant cost, but data could promise a more cost-effective solution. By tracking the progress of a large numbers of players through the game, a developer could define an acceptable rate of play; if a player appeared to be moving too quickly, he or she might be flagged for potential currency fraud.
IBM is partnering with several top universities to conduct research into “cognitive computing” using the company’s Watson AI engine. The research, which is set to include text analysis, computer vision and human-computer interaction, will be split up among the four participating schools: the Massachusetts Institute of Technology, Rensselaer Polytechnic Institute, Carnegie Mellon University and New York University.
Facebook’s efforts to make its server designs freely available through the Open Compute Project may put pressure on enterprise hardware companies in the near future. Instead of relying on big U.S. vendors to produce ready-made systems, Facebook has opted to design its servers internally and outsource their construction to low-cost manufacturers in Asia. U.S. hardware vendors, however, have been skeptical about the long-term effects of the Open Compute Project.
The Dutch government is undertaking an effort to consolidate and manage its water data. Due to the pervasive threat of floods in the low-lying Netherlands, the government maintains a sophisticated water management system, complete with extensive monitoring and predictive modeling for major flood events. The amount of data collected runs in the petabytes annually, but due to a lack of interoperability and structure, the data is currently difficult to analyze, and the government hopes a centralized effort will be able to turn the tide.
A group of Stanford researchers plan to open-source a high performing text analysis model that can determine the positive-or-negative sentiment of a given sentence 85 percent of the time. The researchers trained their model on nearly 11,000 sentences from Rotten Tomatoes’ movie review database, which is well-suited for this application because it pairs the sentences with numerical ratings for each movie.
Photo: Creative Commons / Flickr: Reallyboring