One Pager Cheat Sheet
- To enable the efficient processing of large amounts of data,
MapReduce
was created to help alleviate the difficulty of splitting data across multiple physical machines. - MapReduce is a programming paradigm that
splits
andmaps
large data sets into smaller chunks, processes them in parallel across commodity servers, and aggregates the data to return a consolidated output, providing benefits such as scalability, flexibility, speed and simplicity. - No, the Mapper step does not
shuffle
the data, but instead it splits and maps it into key-value pairs before being further processed by the Map and Reduce functions. - Apache Hadoop is an ecosystem that enhances massive data processing, and is a popular choice in public cloud services such as
Amazon Elastic MapReduce
,Microsoft HDInsight
andGoogle Cloud Dataproc
. - Using MapReduce, you can
tokenize
,sort
,shuffle
, andreduce
the unique words in an input file with a given value to create a new output file ofkey-value
pairs of the unique words and their occurrences. The MapReduce program
can be written in any language and can be divided into three main parts: a Mapper Code, Reducer Code, and a Driver Code, which sets the configuration of the program to be run in Hadoop using the commandhadoop jar hadoop-mapreduce-example.jar WordCount /sample/input /sample/output
.- The last step of the MapReduce process is Reducing, where the data is aggregated and the output is computed by iterating over all values associated with a single key.
Hadoop Streaming
allows MapReduce programs to be written in various languages such asPython
,C++
andRuby
, and to parse theinput/output data format
as specified by the protocol.- The
Driver Code
defines the necessaryconfiguration settings
for Hadoop to execute a MapReduce job efficiently, by assigning tasks to the workers in the cluster.