Mark As Completed Discussion

One Pager Cheat Sheet

  • To enable the efficient processing of large amounts of data, MapReduce was created to help alleviate the difficulty of splitting data across multiple physical machines.
  • MapReduce is a programming paradigm that splits and maps large data sets into smaller chunks, processes them in parallel across commodity servers, and aggregates the data to return a consolidated output, providing benefits such as scalability, flexibility, speed and simplicity.
  • No, the Mapper step does not shuffle the data, but instead it splits and maps it into key-value pairs before being further processed by the Map and Reduce functions.
  • Apache Hadoop is an ecosystem that enhances massive data processing, and is a popular choice in public cloud services such as Amazon Elastic MapReduce, Microsoft HDInsight and Google Cloud Dataproc.
  • Using MapReduce, you can tokenize, sort, shuffle, and reduce the unique words in an input file with a given value to create a new output file of key-value pairs of the unique words and their occurrences.
  • The MapReduce program can be written in any language and can be divided into three main parts: a Mapper Code, Reducer Code, and a Driver Code, which sets the configuration of the program to be run in Hadoop using the command hadoop jar hadoop-mapreduce-example.jar WordCount /sample/input /sample/output.
  • The last step of the MapReduce process is Reducing, where the data is aggregated and the output is computed by iterating over all values associated with a single key.
  • Hadoop Streaming allows MapReduce programs to be written in various languages such as Python, C++ and Ruby, and to parse the input/output data format as specified by the protocol.
  • The Driver Code defines the necessary configuration settings for Hadoop to execute a MapReduce job efficiently, by assigning tasks to the workers in the cluster.