In this blog, I will show you how to run a MapReduce program. MapReduce is one of the core part of Apache Hadoop, it is the processing layer of Apache Hadoop. So before I show you how to run a MapReduce program, let me briefly explain you MapReduce.
MapReduce is a system for parallel processing of large data sets. MapReduce reduces the data into results and creates a summary of the data. A mapreduce program has two parts - mapper and reducer. After the mapper finishes its work then only reducers start.
Mapper : It maps input key/value pairs to a set of intermediate key/value pairs.
Reducer : It reduces a set of intermediate values which share a key to a smaller set of values.
Basically, in the wordcount mapreduce program, we provide input file(s) - any text file, as input. When the mapreduce program starts, below are the processes it goes through:
Splitting : It splits the each line in the input file into words.
Mapping : It forms a key value pair, where word is the key and 1 is the value assigned to each key.
Shuffling : Common key value pairs get grouped together.
Reducing : The values of similar keys are added together.
Running MapReduce Program
A MapReduce program is written in Java. And mostly Eclipse IDE is used for programming by the developers. So in this blog, I will show you how to export a mapreduce program into a jar file from Eclipse IDE and run it on a Hadoop cluster.
My MapReduce program is there in my Eclipse IDE.
Now to run this MapReduce program on a hadoop cluster, we will export the project as a jar file. Select File option in eclipse ide and click on Export. In Java option, select Jar file and click on Next.
Select the Wordcount project and give the path and name for the jar file, I am keeping it wordcount.jar, Click on Next twice.
Now Click on Browse and select the main class and finally click on Finish to make the jar file. In case you get any warning as below, just click OK.
Check whether your Hadoop cluster is up and working or not.
Command: jps
hadoop@hadoop-VirtualBox:~$ jps 3008 NodeManager 3924 Jps 2885 ResourceManager 2505 DataNode 3082 JobHistoryServer 2716 SecondaryNameNode 2383 NameNode hadoop@hadoop-VirtualBox:~$
We have our input file onto HDFS for wordcount program.
hadoop@hadoop-VirtualBox:~$ hdfs dfs -put input / hadoop@hadoop-VirtualBox:~$ hdfs dfs -cat /input This is my first mapreduce test This is wordcount program hadoop@hadoop-VirtualBox:~$
Now run the wordcount.jar file using below command.
Note: Since we selected the main class while exporting wordcount.jar , so no need to mention the main class in the command.
Command: hadoop jar wordcount.jar /input /output
hadoop@hadoop-VirtualBox:~$ hadoop jar wordcount.jar /input /output 16/11/27 22:52:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0: 8032 16/11/27 22:52:22 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 16/11/27 22:52:27 INFO input.FileInputFormat: Total input paths to process : 1 16/11/27 22:52:28 INFO mapreduce.JobSubmitter: number of splits:1 16/11/27 22:52:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_14802 67251741_0001 16/11/27 22:52:32 INFO impl.YarnClientImpl: Submitted application application_14802 67251741_0001 16/11/27 22:52:33 INFO mapreduce.Job: The url to track the job: http://hadoop-Virtu alBox:8088/proxy/application_1480267251741_0001/ 16/11/27 22:52:33 INFO mapreduce.Job: Running job: job_1480267251741_0001 16/11/27 22:53:20 INFO mapreduce.Job: Job job_1480267251741_0001 running in uber mo de : false 16/11/27 22:53:20 INFO mapreduce.Job: map 0% reduce 0% 16/11/27 22:53:45 INFO mapreduce.Job: map 100% reduce 0% 16/11/27 22:54:13 INFO mapreduce.Job: map 100% reduce 100% 16/11/27 22:54:15 INFO mapreduce.Job: Job job_1480267251741_0001 completed successfully 16/11/27 22:54:16 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=124 FILE: Number of bytes written=237911 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=150 HDFS: Number of bytes written=66 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=21062 Total time spent by all reduces in occupied slots (ms)=25271 Total time spent by all map tasks (ms)=21062 Total time spent by all reduce tasks (ms)=25271 Total vcore-milliseconds taken by all map tasks=21062 Total vcore-milliseconds taken by all reduce tasks=25271 Total megabyte-milliseconds taken by all map tasks=21567488 Total megabyte-milliseconds taken by all reduce tasks=25877504 Map-Reduce Framework Map input records=2 Map output records=10 Map output bytes=98 Map output materialized bytes=124 Input split bytes=92 Combine input records=0 Combine output records=0 Reduce input groups=8 Reduce shuffle bytes=124 Reduce input records=10 Reduce output records=8 Spilled Records=20 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=564 CPU time spent (ms)=4300 Physical memory (bytes) snapshot=330784768 Virtual memory (bytes) snapshot=3804205056 Total committed heap usage (bytes)=211812352 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=58 File Output Format Counters Bytes Written=66 hadoop@hadoop-VirtualBox:~$
After the program runs successfully, go to HDFS and check the part file inside output directory.
Below is the output of wordcount program.
hadoop@hadoop-VirtualBox:~$ hdfs dfs -cat /output/part-r-00000 This 2 first 1 is 2 mapreduce 1 my 1 program 1 test 1 wordcount 1 hadoop@hadoop-VirtualBox:~$
Conclusion
This example here is in Java, you can write a MapReduce program in python as well. We successfully ran a Hadoop MapReduce program on a Hadoop Cluster on Ubuntu 16.04. The steps to run a Mapreduce Program on other Linux environments remain the same. Make sure that before running the program, you Hadoop cluster should be up and running, also your input file should be present in HDFS.