Wednesday, May 18, 2011

Paper Review : MapReduce - Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat

This is my first review of a published paper. Over the next few days, I am going to concentrate on papers related to distributed systems that changed the way people write softwares for distributed systems.

This paper is available at Google Research. The authors are Jeffrey Dean and Sanjay Ghemawat.

In this paper, the authors discuss about a programming model MapReduce and its implementation to process and generate large data sets.  Programs written in this model are automatically parallelized, and the model takes care of partitioning the data, scheduling program's execution on a set of machines, handling machine failures, and managing the required inter-machine communication. The authors then introduce the programming model where the programmer has to write just two functions:

  • Map: It takes as input a key/value pair and emits a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function.
  • Reduce: It takes an intermediate key and a set of values for that key and typically produces one output for each key.
The authors provide sample implementation of counting the number of occurrences of words in a large set of documents. The authors then provide some more examples like
  • Distributed grep
  • Count of URL access frequency
  • Inverted index
  • Distributed sort
The authors explain the overall flow of MapReduce as following steps:
  1. The MapReduce library divides the input files into pieces and starts many copies of the program on a set of cluster machines.
  2. One task is master and rest are workers. There are $M$ map tasks and $R$ reduce tasks and the master picks up idle worker and assign either one map or one reduce task.
  3. A worker doing map task reads contents of the corresponding input split. It parses key/value pairs out of the input and sends it to user-defined Map function. The intermediate key value pairs produced by the Map function are buffered.
  4. The buffered intermediate key/value pairs are partitioned into $R$ regions and location of them is passed back to the master who forwards them to reduce workers.
  5. The reduce worker will read the data from the locations provided by the master and sorts the intermediate keys.
  6. The reduce worker iterates over sorted intermediate data and passes each unique intermediate key and the corresponding set of intermediate values to user's Reduce function. The output of the reduce function is appended to a final output file for this reduce partition.
  7. When all the map and reduce tasks complete, the user program returns.
The authors then describe the strategy to handle a worker failure and the master failure. The authors then describe the usage of GFS (Google File System) in reducing the network bandwidth usage by coupling data and tasks on the same machine. The authors then provide details about typical values of $M$ and $R$ used in their own environment. The authors also discuss the problem of stragglers, the machines which take too long to complete the last few map or reduce tasks and the solution in form of backup tasks that the authors provided.

The authors then discuss following extensions to the model:
  • Partitioning: Instead of default partitioning function that divides the data across $R$ reduce tasks, users are allowed to specify their own partitioning function.
  • Ordering Guarantee: The library guarantees that the intermediate key/value pairs within a given partition are processed in increasing key order.
  • Combiner Function: The Combiner function is used to process the output of map tasks on the local machine before the result is sent over the network for reduce tasks to process.
  • Input and Output types: The library supports reading input data in some standard formats but user can customize it by implementing a reader interface.
  • Side-effects: Users are allowed to produce auxiliary files as additional outputs of their map/reduce operators.
  • Skipping Bad Records: When there are bugs in user code that crashes map or reduce operation deterministically, the library keeps track of such failures and skips processing of such record.
  • Local Execution: Debugging problems in map or reduce is impossible in distributed architecture so the library provides a way to execute the code serially on a single machine and the user can attach debugger to it.
  • Status Information: The master runs an HTTP server and outputs status information of all the map/reduce tasks, their standard error and standard output files.
  • Counters: The library provides facility of creating named counters which can be used to keep track of occurrences of a particular event, e.g. number of German documents indexed, number of capitalized words in a file.
The authors then measure the performance of their library on a particular hardware configuration for two set of examples: grep and sort. They report the performance of sort to be comparable to that reported in TeraSort benchmark.

The authors also provide performance data for effect of backup tasks and also in case of large number of machine failures (200 out of 1745). The authors conclude the paper with some of the applications written in Google with the usage of MapReduce library.

Outside of Google, the MapReduce library has been implemented as open source in Apache Hadoop project. Amazon has come up with Elastic MapReduce which uses Apache Hadoop to provide facility to develop applications that can process large sets of data.

Some recent comments by one of the ex-googlers on this is available at Ex-Google Engineer Says the Company's Software Infrastructure is Obsolete.

No comments:

Post a Comment