One Pager Cheat Sheet
- To enable the efficient processing of large amounts of data,
MapReducewas created to help alleviate the difficulty of splitting data across multiple physical machines. - MapReduce is a programming paradigm that
splitsandmapslarge data sets into smaller chunks, processes them in parallel across commodity servers, and aggregates the data to return a consolidated output, providing benefits such as scalability, flexibility, speed and simplicity. - No, the Mapper step does not
shufflethe data, but instead it splits and maps it into key-value pairs before being further processed by the Map and Reduce functions. - Apache Hadoop is an ecosystem that enhances massive data processing, and is a popular choice in public cloud services such as
Amazon Elastic MapReduce,Microsoft HDInsightandGoogle Cloud Dataproc. - Using MapReduce, you can
tokenize,sort,shuffle, andreducethe unique words in an input file with a given value to create a new output file ofkey-valuepairs of the unique words and their occurrences. The MapReduce programcan be written in any language and can be divided into three main parts: a Mapper Code, Reducer Code, and a Driver Code, which sets the configuration of the program to be run in Hadoop using the commandhadoop jar hadoop-mapreduce-example.jar WordCount /sample/input /sample/output.- The last step of the MapReduce process is Reducing, where the data is aggregated and the output is computed by iterating over all values associated with a single key.
Hadoop Streamingallows MapReduce programs to be written in various languages such asPython,C++andRuby, and to parse theinput/output data formatas specified by the protocol.- The
Driver Codedefines the necessaryconfiguration settingsfor Hadoop to execute a MapReduce job efficiently, by assigning tasks to the workers in the cluster.

