AlgoDaily - What is MapReduce and How Does It Work?

Home > Systems Design and Architecture 🔥 > Fundamentals of Systems Design > What is MapReduce and How Does It Work?

One Pager Cheat Sheet

To enable the efficient processing of large amounts of data, MapReduce was created to help alleviate the difficulty of splitting data across multiple physical machines.
MapReduce is a programming paradigm that splits and maps large data sets into smaller chunks, processes them in parallel across commodity servers, and aggregates the data to return a consolidated output, providing benefits such as scalability, flexibility, speed and simplicity.
No, the Mapper step does not shuffle the data, but instead it splits and maps it into key-value pairs before being further processed by the Map and Reduce functions.
Apache Hadoop is an ecosystem that enhances massive data processing, and is a popular choice in public cloud services such as Amazon Elastic MapReduce, Microsoft HDInsight and Google Cloud Dataproc.
Using MapReduce, you can tokenize, sort, shuffle, and reduce the unique words in an input file with a given value to create a new output file of key-value pairs of the unique words and their occurrences.
The MapReduce program can be written in any language and can be divided into three main parts: a Mapper Code, Reducer Code, and a Driver Code, which sets the configuration of the program to be run in Hadoop using the command hadoop jar hadoop-mapreduce-example.jar WordCount /sample/input /sample/output.
The last step of the MapReduce process is Reducing, where the data is aggregated and the output is computed by iterating over all values associated with a single key.
Hadoop Streaming allows MapReduce programs to be written in various languages such as Python, C++ and Ruby, and to parse the input/output data format as specified by the protocol.
The Driver Code defines the necessary configuration settings for Hadoop to execute a MapReduce job efficiently, by assigning tasks to the workers in the cluster.

One Pager Cheat Sheet

Programming Categories

Popular Lessons