Hadoop: A Government Primer
What Is Hadoop?
Hadoop is open-source software that data scientists can use for large-scale distributed computing. It has two main components: Hadoop Distributed File System (HDFS) which is used to store data across multiple computers, and MapReduce, which is used to divide labor among those computers to perform large computations quickly.
Hadoop was created in 2005 by computer scientists Mike Cafarella and Doug Cutting (then at Yahoo), and is based in part on two Google technologies: the famous 2004 MapReduce algorithm and the distributed Google File System. Since its creation, it has grown to over 2 million lines of code written by 70 contributors. It is maintained by the Apache Software Foundation, a nonprofit organization that helps coordinate the activities of a large community of developers who work on Hadoop and other open-source projects. Its free availability and astonishing performance on certain types of large-scale data analysis problems have made it the single most important tool in coping with and generating knowledge from the increasingly large datasets today’s computing infrastructure produces.
When a user stores data with Hadoop, the system automatically breaks the data up into blocks and apportions them out to the different computers. It makes multiple copies of the data and stores them in different places to make sure that even if individual computers fail, the system as a whole will keep functioning. These copies are necessary because the system is designed to run on inexpensive hardware to keep costs down; it turns out to be significantly more cost effective to start with cheap computers and plan for their failure than to run the system on expensive computers with lower failure rates. HDFS keeps track of what data is being stored where, including where to find each copy of a particular data block.
Once the data is stored, it becomes convenient to think about conducting large-scale data analysis and manipulation by doing much smaller-scale operations on each of the different computers simultaneously.
It’s similar—although not quite analogous—to an assembly line in a factory, insofar as it’s a system in which individual “workers” are carefully coordinated to do tasks much smaller than the final product. Where Hadoop differs from an assembly line is that the workers (the different computers housing the data) are all doing the same task—and simultaneously, not sequentially—which means they need an additional worker to assemble all of their output together when they’re finished.
The procedure that converts the large task into smaller tasks and assigns them to individual computers is called the “mapper;” the additional worker that pulls everything together at the end is called the “reducer;” and the whole division-of-labor paradigm itself is called MapReduce. Part of the Hadoop engineer’s job is to indicate what task should be assigned by the mapper and in what way the reducer should combine the results to get back something useful.
For many applications, computing in this way instead of on a single computer leads to substantial speed-ups and cost-savings; for others, it enables computations that no single computer in the world could perform on its own. Limited by processor speeds as well as memory reserves, even the most advanced supercomputers have upper bounds on their computational capabilities; but clusters of computers running Hadoop can scale into the thousands of units while maintaining good performance.
What Types of Problems Can Government Agencies Use Hadoop to Solve?
In many cases, when a question can be answered without looking at all the data at once, Hadoop can help. Text mining, which can be undertaken using Hadoop, can be applied to financial fraud detection, research paper classification, student sentiment analysis and smarter search engines for all manner of government records, and machine learning can be used for decision support systems for healthcare, model generation for climate science, speech recognition for security and mobile data entry across agencies.
In many cases Hadoop’s HDFS data storage system can be implemented securely enough for many applications on cloud-based servers which keeps costs minimal. Moreover, popular pay-per-use services such as Amazon Web Services’ Elastic MapReduce require little commitment and lend themselves to experimentation.
Example: Hadoop Makes It Easier to Find Government Documents
Many government agencies need to provide a means for officials and the public to find documents (e.g., patent applications, research reports, etc.), but there are far too many for an individual (or even a thousand individuals) to catalog all on their own, and even the task of manually creating indices of documents in various topic areas is daunting. One solution is to create a search engine that can automatically find documents relevant to a particular set of query terms; in order to build such a search engine, it is necessary to create a list of documents for each term, indicating the documents where the term can be found. The collection of all these terms and the documents that contain them is known as an “inverted index.” With millions of documents, creating an inverted index could be an enormous task, but it can be generated quickly and straightforwardly with Hadoop.
Here’s how it would work.
When the collection of documents to be indexed is uploaded with Hadoop, it gets divided into pieces, backed up and assigned to different computers using HDFS (the storage component). Then, a Hadoop engineer can specify the task for the mapper (scan through a document, keeping track of each word that appears in that paper) and a way for the reducer to combine the results from different computers (for each word, make a list of all the different documents that the mapper associated with it). The result is an inverted index, containing all the words from all the documents on all the different computers, generated automatically and in a fraction of the time it would have taken to conduct all the operations sequentially.
That’s it. (Almost. MapReduce has a couple of intermediate steps that involve shuffling and sorting, but we can ignore these without loss of the bigger picture.)
What Are Hadoop’s Limitations?
The major drawback of Hadoop is the rigidity of MapReduce. The paradigm does not lend itself to all computational problems; it is difficult, for example, to apply to the analysis of networks, which cannot always be easily separated into independent blocks. Network analysis has broad applications in the sciences as well as public health and national security, and government agencies that wish to answer network-related questions will not always be able to rely on Hadoop.
Fortunately, this limitation may soon disappear with the introduction of YARN, another Apache Software Foundation project that has been dubbed “MapReduce 2.0.” YARN was designed to enable the development of other parallel computing techniques besides MapReduce, while operating on the same fundamental HDFS storage system. YARN, which is still in beta, has been eagerly awaited by the Hadoop community, although the talent shortage that plagues Hadoop in general will likely be even more deeply felt during the early years of YARN.