GIT – Summary: Every day people rely on search engines to find specific content in the many terabytes of data that exist on the Internet, but have you ever wondered how this search is actually performed? One approach is Apache’s Hadoop, which is a software framework that enables distributed manipulation of vast amounts of data. One application of Hadoop is parallel indexing of Internet Web pages. Hadoop is an Apache project with support from Yahoo!, Google, IBM, and others. This article introduces the Hadoop framework and shows you why it’s one of the most important Linux®-based distributed computing frameworks.
Hadoop was introduced to the world in the fall of 2005 as part of a Nutch subproject of Lucene by the Apache Software Foundation. It was inspired by the MapReduce and Google File System originally developed by Google Labs. In March of 2006, the MapReduce and Nutch Distributed File System (NDFS) were separated into their own project called Hadoop.
Hadoop is most popular as a means to classify content on the Internet (for search keywords), but it can be used for a large number of problems that require massive scalability. For example, what would happen if you wanted to
grep a 10TB file? On a traditional system, this would take a terribly long time. But Hadoop was designed with these problems in mind and can make the task quite efficient.
Hadoop is a software framework that enables distributed manipulation of large amounts of data. But Hadoop does this in a way that makes it reliable, efficient, and scalable. Hadoop is reliable because it assumes that computing elements and storage will fail and, therefore, it maintains several copies of working data to ensure that processing can be redistributed around failed nodes. Hadoop is efficient because it works on the principle of parallelization, allowing data to process in parallel to increase the processing speed. Hadoop is also scalable, permitting operations on petabytes of data. In addition, Hadoop relies on commodity servers, making it inexpensive and available for use by anyone.
As you would expect, Hadoop is ideal on Linux as a production platform, with the framework written in the Java™ language. Applications on Hadoop may be developed using other languages such as C++.
Hadoop is made up of a number of elements. At the bottom is the Hadoop Distributed File System (HDFS), which stores files across storage nodes in a Hadoop cluster. Above the HDFS (for the purposes of this article) is the MapReduce engine, which consists of JobTrackers and TaskTrackers.
To an external client, the HDFS appears as a traditional hierarchical file system. Files can be created, deleted, moved, renamed, and so on. But due to the special characteristics of HDFS, its architecture is built from a collection of special nodes (see Figure 1). These are the NameNode (there is only one), which provides metadata services within HDFS , and the DataNode, which serves storage blocks for HDFS. As only one NameNode may exist, this represents an issue with HDFS (a single point of failure).
Figure 1. Simplified view of a Hadoop cluster
Files stored in HDFS are divided into blocks, and those blocks are replicated to multiple computers (DataNodes). This is quite different from traditional RAID architectures. The block size (typically 64MB) and the amount of block replication are determined by the client when the file is created. All file operations are controlled by the NameNode. All communication within HDFS is layered on the standard TCP/IP protocol.
The NameNode is a piece of software that is typically run on a distinct machine in an HDFS instance. It is responsible for managing the file system namespace and controlling access by external clients. The NameNode determines the mapping of files to replicated blocks on DataNodes. For the common replication factor of three, one replica block is stored on a different node in the same rack, and the last copy is stored in a node on a different rack. Note that this requires knowledge of the cluster architecture.
Actual I/O transactions do not pass through the NameNode, only the metadata that indicates the file mapping of DataNodes and blocks. When an external client sends a request to create a file, the NameNode responds with the block identification and DataNode IP address for the first copy of that block. The NameNode also informs the other specific DataNodes that will be receiving copies of that block.
The NameNode stores all information about the file system namespace in a file called FsImage. This file, along with a record of all transactions (referred to as the EditLog), is stored on the local file system of the NameNode. The FsImage and EditLog files are also replicated to protect against file corruption or loss of the NameNode system itself.
A DataNode is also a piece of software that is typically run on a distinct machine within an HDFS instance. Hadoop clusters contain a single NameNode and hundreds to thousands of DataNodes. DataNodes are typically organized into racks where all the systems are connected to a switch. An assumption of Hadoop is that network bandwidth between nodes within a rack is faster than between racks.
DataNodes respond to read and write requests from HDFS clients. They also respond to commands to create, delete, and replicate blocks received from the NameNode. The NameNode relies on periodic heartbeat messages from each DataNode. Each of these messages contains a block report that the NameNode can validate against its block mapping and other file system metadata. When a DataNode fails to send its heartbeat message, the NameNode may take the remedial action to re-replicate the blocks that were lost on that node.
It’s probably clear by now that HDFS is not a general-purpose file system. Instead, it is designed to support streaming access to large files that are written once. For a client seeking to write a file to HDFS, the process begins with caching the file to temporary storage local to the client. When the cached data exceeds the desired HDFS block size, a file creation request is sent to the NameNode. The NameNode responds to the client with the DataNode identity and the destination block. The DataNodes that will host file block replicas are also notified. When the client starts sending its temporary file to the first DataNode, the block contents are relayed immediately to the replica DataNodes in a pipelined fashion. Clients are also responsible for the creation of checksum files that are also saved in the same HDFS namespace. After the last file block is sent, the NameNode commits the file creation to its persistent meta data storage (in the EditLog and FsImage files).
The Hadoop framework can be used on a single Linux platform (for development and debug situations), but its true power is realized using racks of commodity-class servers. These racks collectively make up a Hadoop cluster. It uses knowledge of the cluster topology to make decisions about how jobs and files are distributed throughout a cluster. Hadoop assumes that nodes can fail and, therefore, employs native methods to cope with the failures of individual computers and even entire racks.
One of the most common uses for Hadoop is in Web search. While not the only application of the software framework, it succinctly identifies its strengths as a parallel data processing engine. One of the most interesting aspects of this is called the Map and Reduce process, which was inspired by Google’s development. This process, called indexing, takes textual Web pages retrieved by a Web crawler as input and reports the frequency of words found in those pages as the result. This can then be used through Web search to identify content from defined search parameters.
At its simplest, a MapReduce application contains at least three pieces: a map function, a reduce function, and a main function that combines job control and file input/output. In this regard, Hadoop provides a large number of interfaces and abstract-classes to provide the developer of a Hadoop application a large number of tools from debug to performance measurements.
MapReduce is itself a software framework for the parallel processing of large data sets. MapReduce has its roots in functional programming, formally from the
reduce functions found in functional languages. It consists of two operations that may consist of many instances (many maps, many reduces). The Map function takes a set of data and transforms it into a list of key/value pairs, one per element of the input domain. The Reduce function takes the list that resulted from the Map function and reduces the list of key/value pairs based on their key (a single key/value pair results for each key).
Here’s an example to help you understand what it all means. Say your input domain is
one small step for man, one giant leap for mankind. Running the Map function on this domain results in the following list of key/value pairs:
(one, 1) (small, 1) (step, 1) (for, 1) (man, 1) (one, 1) (giant, 1) (leap, 1) (for, 1) (mankind, 1)
If you now apply this list of key/value pairs to the Reduce function, you get the following set of key/value pairs:
(one, 2) (small, 1) (step, 1) (for, 2) (man, 1) (giant, 1) (leap, 1) (mankind, 1)
The result is the count of words within the input domain, which is obviously useful in the process of indexing. But now imagine that your input domain is actually two input domains, the first
one small step for man and the second
one giant leap for mankind. You can execute the Map function on each, and also the Reduce function, and then finally apply the two lists of key/value pairs to another Reduce function and arrive at the same result. In other words, you can parallelize the operations on the input domain and arrive at the same answer, albeit much faster. That’s the power of MapReduce; it’s inherently parallelizable over any number of systems. Figure 2 illustrates this idea in the form of segmentation and iteration.
Figure 2. Conceptual flow of the MapReduce process
Returning to Hadoop, how does it implement this functionality? A MapReduce application is started or launched on behalf of a client on a single master system referred to as a JobTracker. Similar to the NameNode, it is the only system in the Hadoop cluster devoted to its job of controlling MapReduce applications. When an application is submitted, input and output directories contained in the HDFS are provided. The JobTracker uses knowledge of the file blocks (physical quantity and where they are located) to decide how many TaskTracker subordinate tasks will be created. The MapReduce application is copied to every node where input file blocks are present. For each file block on a given node, a unique subordinate task is created. Each TaskTracker reports status and completion back to the JobTracker. Figure 3 shows the work distribution in an example cluster.
Figure 3. Example Hadoop cluster showing physical distribution of processing and storage
This aspect of Hadoop is important because instead of moving storage to the location for processing, Hadoop moves the processing to the storage. This supports efficient processing of the data by scaling processing with the number of nodes in the cluster.
Hadoop is a surprisingly versatile framework for development of distributed applications; all that’s necessary to take advantage of Hadoop is a different way of viewing problems. Recall from Figure 2 that processing occurs as step functions where the work of components is leveraged by others. It’s certainly not a panacea for development, but if your problem can be viewed through this lens, then Hadoop should be an option.
Hadoop has been used to help solve a variety of problems, including sorts of extremely large data sets and greps of particularly large files. It’s also used as the core of a variety of search engines, such as Amazon’s A9 and Able Grape’s vertical search engine for wine information. The Hadoop Wiki provides a great list of applications and companies that use Hadoop in a variety of different ways (see Resources).
Yahoo! currently has the largest Hadoop Linux production architecture, which consists of 10,000 cores with over five petabytes of storage distributed among the DataNodes. Within their Web index, there are roughly one trillion links. But your problem may not require a system of that scale, and, if not, you could use the Amazon Elastic Compute Cloud (EC2) to build a virtual 20-node cluster. In fact, the New York Times used Hadoop and EC2 to convert 4TB of TIFF images—including 405K large TIFF images, 3.3M SGML articles, and 405K XML files—into 800K Web-friendly PNG images in 36 hours. This process, known as cloud computing, is a unique way to demonstrate the power of Hadoop.
Hadoop is certainly going strong, and by the looks of applications that are making use of it, it has a bright future. You can learn more about Hadoop and its applications in the Resources section, including advice on setting up your own Hadoop cluster.
- The Hadoop core Web site is the best resource for learning about Hadoop. Here you’ll find the latest documentation, quickstart guides, details for how to set up cluster configurations, tutorials, and more. You’ll also find detailed application program interface (API) documentation for developing on the Hadoop framework.
- The Hadoop DFS User Guide introduces HDFS and its associated components.
- Yahoo! launched what is believed to be the largest Hadoop cluster for their search engine in early 2008. This Hadoop cluster consists of over 10,000 processing cores and provides over five petabytes (500,000 gigabytes) of raw disk storage.
- “Hadoop: Funny Name, Powerful Software” (LinuxInsider, November 2008) is a great piece on Hadoop that includes an interview with its creator, Doug Cutting. This article also discusses the New York Times‘ use of Hadoop with Amazon’s EC2 for mass image transformation.
- Hadoop has found a home in cloud computing environments. To learn more about cloud computing, check out “Cloud computing with Linux” (developerWorks, September 2008).
- See a full list of applications powered by Hadoop on the Hadoop Wiki PoweredBy page. Hadoop is finding a home in many problem domains outside of search engines.
- “Running Hadoop on Ubuntu Linux (Multi-Node Cluster),” a tutorial by Michael Noll, shows you how to set up a Hadoop cluster. This tutorial also references an earlier tutorial about setting up for a single node.
- Read more of Tim’s articles on developerWorks.
- In the developerWorks Linux zone, find more resources for Linux developers (including developers who are new to Linux), and scan our most popular articles and tutorials.
- See all Linux tips and Linux tutorials on developerWorks.
- Stay current with developerWorks technical events and Webcasts.
Get products and technologies
- The MapReduce concept, first introduced in functional languages many decades ago, can also be found in the form of a plug-in. IBM has created a plug-in for Eclipse that simplifies the creation and deployment of MapReduce programs.
- Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.