This is my first review of a published paper. Over the next few days, I am going to concentrate on papers related to distributed systems that changed the way people write softwares for distributed systems.
This paper is available at Google Research. The authors are Jeffrey Dean and Sanjay Ghemawat.
In this paper, the authors discuss about a programming model MapReduce and its implementation to process and generate large data sets. Programs written in this model are automatically parallelized, and the model takes care of partitioning the data, scheduling program's execution on a set of machines, handling machine failures, and managing the required inter-machine communication. The authors then introduce the programming model where the programmer has to write just two functions:
The authors then discuss following extensions to the model:
The authors also provide performance data for effect of backup tasks and also in case of large number of machine failures (200 out of 1745). The authors conclude the paper with some of the applications written in Google with the usage of MapReduce library.
Outside of Google, the MapReduce library has been implemented as open source in Apache Hadoop project. Amazon has come up with Elastic MapReduce which uses Apache Hadoop to provide facility to develop applications that can process large sets of data.
Some recent comments by one of the ex-googlers on this is available at Ex-Google Engineer Says the Company's Software Infrastructure is Obsolete.
This paper is available at Google Research. The authors are Jeffrey Dean and Sanjay Ghemawat.
In this paper, the authors discuss about a programming model MapReduce and its implementation to process and generate large data sets. Programs written in this model are automatically parallelized, and the model takes care of partitioning the data, scheduling program's execution on a set of machines, handling machine failures, and managing the required inter-machine communication. The authors then introduce the programming model where the programmer has to write just two functions:
- Map: It takes as input a key/value pair and emits a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function.
- Reduce: It takes an intermediate key and a set of values for that key and typically produces one output for each key.
- Distributed grep
- Count of URL access frequency
- Inverted index
- Distributed sort
- The MapReduce library divides the input files into pieces and starts many copies of the program on a set of cluster machines.
- One task is master and rest are workers. There are $M$ map tasks and $R$ reduce tasks and the master picks up idle worker and assign either one map or one reduce task.
- A worker doing map task reads contents of the corresponding input split. It parses key/value pairs out of the input and sends it to user-defined Map function. The intermediate key value pairs produced by the Map function are buffered.
- The buffered intermediate key/value pairs are partitioned into $R$ regions and location of them is passed back to the master who forwards them to reduce workers.
- The reduce worker will read the data from the locations provided by the master and sorts the intermediate keys.
- The reduce worker iterates over sorted intermediate data and passes each unique intermediate key and the corresponding set of intermediate values to user's Reduce function. The output of the reduce function is appended to a final output file for this reduce partition.
- When all the map and reduce tasks complete, the user program returns.
The authors then discuss following extensions to the model:
- Partitioning: Instead of default partitioning function that divides the data across $R$ reduce tasks, users are allowed to specify their own partitioning function.
- Ordering Guarantee: The library guarantees that the intermediate key/value pairs within a given partition are processed in increasing key order.
- Combiner Function: The Combiner function is used to process the output of map tasks on the local machine before the result is sent over the network for reduce tasks to process.
- Input and Output types: The library supports reading input data in some standard formats but user can customize it by implementing a reader interface.
- Side-effects: Users are allowed to produce auxiliary files as additional outputs of their map/reduce operators.
- Skipping Bad Records: When there are bugs in user code that crashes map or reduce operation deterministically, the library keeps track of such failures and skips processing of such record.
- Local Execution: Debugging problems in map or reduce is impossible in distributed architecture so the library provides a way to execute the code serially on a single machine and the user can attach debugger to it.
- Status Information: The master runs an HTTP server and outputs status information of all the map/reduce tasks, their standard error and standard output files.
- Counters: The library provides facility of creating named counters which can be used to keep track of occurrences of a particular event, e.g. number of German documents indexed, number of capitalized words in a file.
The authors also provide performance data for effect of backup tasks and also in case of large number of machine failures (200 out of 1745). The authors conclude the paper with some of the applications written in Google with the usage of MapReduce library.
Outside of Google, the MapReduce library has been implemented as open source in Apache Hadoop project. Amazon has come up with Elastic MapReduce which uses Apache Hadoop to provide facility to develop applications that can process large sets of data.
Some recent comments by one of the ex-googlers on this is available at Ex-Google Engineer Says the Company's Software Infrastructure is Obsolete.
No comments:
Post a Comment