In this article, I will show you how to setup a single node hadoop cluster using Docker. Before I start with the setup, let me briefly remind you what Docker and Hadoop are.
Docker is a software containerization platform where you package your application with all the libraries, dependencies, environments in a container. This container is called docker container. With Docker, you can build, ship, run an application (software) on the fly.
For example, if you want to test an application on an ubuntu system, you need not setup a complete operating system on your laptop/desktop or start a virtual machine with ubuntu os. That will take a lot of time and space. You can simply start an ubuntu docker container which will have the environment, libraries you need to test your application on the fly.
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers. These days it is one of the most important technology in the industry. Now to use Hadoop to store and analyse huge amount of data, you need to setup a hadoop cluster. If you have done setting up of hadoop cluster before, you know its not a easy task.
What if I say, setting up a hadoop cluster is hardly 5-10 minutes job, will you believe me? I guess not!
Here is where Docker comes into picture, and using docker you can setup a hadoop cluster in no time.
Benefits of using Docker for setting up a hadoop cluster
- Installs and runs hadoop in no time.
- Uses the resources as per need, so no wastage of resource.
- Easily scalable, best suited for testing environments in hadoop cluster.
- No worries of hadoop dependencies, libraries etc. , docker will take care of it.
Setup a Single Node Hadoop Cluster Using Docker
So let us see now how to setup a single node hadoop cluster using Docker. I am using Ubuntu 16.04 system and docker is already installed and configured on my system.
Before I setup a single node hadoop cluster using docker, let me just run simple example to see that docker is working correctly on my system.
Let me check if I have any docker image as of now.
hadoop@hadoop-VirtualBox:~$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
I don't have any docker image as of now. Let me run a simple hello-world docker example.
hadoop@hadoop-VirtualBox:~$ docker run hello-world Hello from Docker! This message shows that your installation appears to be working correctly. To generate this message, Docker took the following steps:
- The Docker client contacted the Docker daemon.
- The Docker daemon pulled the "hello-world" image from the Docker Hub.
- The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
- The Docker daemon streamed that output to the Docker client, which sent it
to your terminal. To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash Share images, automate workflows, and more with a free Docker Hub account: https://hub.docker.com For more examples and ideas, visit: https://docs.docker.com/engine/userguide/
So now you know that docker is working properly. Let us go ahead and install hadoop in a docker container. To do so, we need a hadoop docker image. The below command will get me a hadoop-2.7.1 docker image.
hadoop@hadoop-VirtualBox:~$ sudo docker pull sequenceiq/hadoop-docker:2.7.1 [sudo] password for hadoop: 2.7.1: Pulling from sequenceiq/hadoop-docker b253335dcf03: Pull complete a3ed95caeb02: Pull complete 11c8cd810974: Pull complete 49d8575280f2: Pull complete 2240837237fc: Pull complete e727168a1e18: Pull complete ede4c89e7b84: Pull complete a14c58904e3e: Pull complete 8d72113f79e9: Pull complete 44bc7aa001db: Pull complete f1af80e588d1: Pull complete 54a0f749c9e0: Pull complete f620e24d35d5: Pull complete ff68d052eb73: Pull complete d2f5cd8249bc: Pull complete 5d3c1e2c16b1: Pull complete 6e1d5d78f75c: Pull complete a0d5160b2efd: Pull complete b5c5006d9017: Pull complete 6a8c6da42d5b: Pull complete 13d1ee497861: Pull complete e3be4bdd7a5c: Pull complete 391fb9240903: Pull complete Digest: sha256:0ae1419989844ca8b655dea261b92554740ec3c133e0826866c49319af7359db Status: Downloaded newer image for sequenceiq/hadoop-docker:2.7.1
Run below command to check whether the hadoop docket image got downloaded correctly.
hadoop@hadoop-VirtualBox:~$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE hello-world latest c54a2cc56cbb 5 months ago 1.848 kB sequenceiq/hadoop-docker 2.7.1 e3c6e05ab051 2 years ago 1.516 GB hadoop@hadoop-VirtualBox:~$
Now run this docker image, which will create a docker container where hadoop-2.7.1 will run.
hadoop@hadoop-VirtualBox:~$ docker run -it sequenceiq/hadoop-docker:2.7.1 /etc/bootstrap.sh -bash / Starting sshd: [ OK ] Starting namenodes on [e34a63e1dcf8] e34a63e1dcf8: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-e34a63e1dcf8.out localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-e34a63e1dcf8.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-e34a63e1dcf8.out starting yarn daemons starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-e34a63e1dcf8.out localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-e34a63e1dcf8.out
Now that the docker container has started, run jps command to see if the hadoop services are up and running.
bash-4.1# jps 291 SecondaryNameNode 560 NodeManager 856 Jps 107 NameNode 483 ResourceManager 180 DataNode bash-4.1#
Open a new terminal and run below command to see the list of containers which are running and their details.
hadoop@hadoop-VirtualBox:~$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES e34a63e1dcf8 sequenceiq/hadoop-docker:2.7.1 "/etc/bootstrap.sh -b" 44 minutes ago Up 44 minutes 22/tcp, 8030-8033/tcp, 8040/tcp, 8042/tcp, 8088/tcp, 49707/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp condescending_poincare
Go back to you docker container terminal, and run below command to get the ip address of the docker container.
bash-4.1# ifconfig eth0 Link encap:Ethernet HWaddr 02:42:AC:11:00:02 inet addr:172.17.0.2 Bcast:0.0.0.0 Mask:255.255.0.0 inet6 addr: fe80::42:acff:fe11:2/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:56 errors:0 dropped:0 overruns:0 frame:0 TX packets:31 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:6803 (6.6 KiB) TX bytes:2298 (2.2 KiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:28648 errors:0 dropped:0 overruns:0 frame:0 TX packets:28648 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1 RX bytes:4079499 (3.8 MiB) TX bytes:4079499 (3.8 MiB) bash-4.1#
After running jps command, we already saw that all the services were running, let us now check the namenode ui on the browser. Go to 172.17.0.2 :50070 in the browser, and there you go, namenode ui of a hadoop cluster running in a docker container.
Just to make sure that the hadoop cluster is working fine, let us run a hadoop mapreduce example in the docker container.
bash-4.1# cd $HADOOP_PREFIX bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs[a-z.]+' 16/11/29 13:07:02 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 16/11/29 13:07:07 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String). 16/11/29 13:07:08 INFO input.FileInputFormat: Total input paths to process : 27 16/11/29 13:07:10 INFO mapreduce.JobSubmitter: number of splits:27 16/11/29 13:07:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1480434980067_0001 16/11/29 13:07:14 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources. 16/11/29 13:07:15 INFO impl.YarnClientImpl: Submitted application application_1480434980067_0001 16/11/29 13:07:16 INFO mapreduce.Job: The url to track the job: http://e34a63e1dcf8:8088/proxy/application_1480434980067_0001/ 16/11/29 13:07:16 INFO mapreduce.Job: Running job: job_1480434980067_0001 16/11/29 13:07:58 INFO mapreduce.Job: Job job_1480434980067_0001 running in uber mode : false 16/11/29 13:07:58 INFO mapreduce.Job: map 0% reduce 0% 16/11/29 13:10:44 INFO mapreduce.Job: map 22% reduce 0% 16/11/29 13:13:40 INFO mapreduce.Job: map 22% reduce 7% 16/11/29 13:13:41 INFO mapreduce.Job: map 26% reduce 7% 16/11/29 13:20:30 INFO mapreduce.Job: map 96% reduce 32% 16/11/29 13:21:01 INFO mapreduce.Job: map 100% reduce 32% 16/11/29 13:21:04 INFO mapreduce.Job: map 100% reduce 100% 16/11/29 13:21:08 INFO mapreduce.Job: Job job_1480434980067_0001 completed successfully 16/11/29 13:21:10 INFO mapreduce.Job: Counters: 50 File System Counters FILE: Number of bytes read=345 FILE: Number of bytes written=2621664 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=64780 HDFS: Number of bytes written=437 HDFS: Number of read operations=84 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Launched map tasks=29 Launched reduce tasks=1 Data-local map tasks=29 Map-Reduce Framework Map input records=1586 Map output records=24 Bytes Written=437 16/11/29 13:21:10 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 16/11/29 13:21:10 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String). 16/11/29 13:21:10 INFO input.FileInputFormat: Total input paths to process : 1 16/11/29 13:21:12 INFO mapreduce.JobSubmitter: number of splits:1 16/11/29 13:21:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1480434980067_0002 16/11/29 13:21:13 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources. 16/11/29 13:21:14 INFO impl.YarnClientImpl: Submitted application application_1480434980067_0002 16/11/29 13:21:14 INFO mapreduce.Job: The url to track the job: http://e34a63e1dcf8:8088/proxy/application_1480434980067_0002/ 16/11/29 13:21:14 INFO mapreduce.Job: Running job: job_1480434980067_0002 16/11/29 13:21:48 INFO mapreduce.Job: Job job_1480434980067_0002 running in uber mode : false 16/11/29 13:21:48 INFO mapreduce.Job: map 0% reduce 0% 16/11/29 13:22:12 INFO mapreduce.Job: map 100% reduce 0% 16/11/29 13:22:37 INFO mapreduce.Job: map 100% reduce 100% 16/11/29 13:22:38 INFO mapreduce.Job: Job job_1480434980067_0002 completed successfully 16/11/29 13:22:38 INFO mapreduce.Job: Counters: 49 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Map-Reduce Framework Map input records=11 Map output records=11 Map output bytes=263 Map output materialized bytes=291 Input split bytes=132 Physical memory (bytes) snapshot=334082048 Virtual memory (bytes) snapshot=1297162240 Total committed heap usage (bytes)=209518592 File Input Format Counters Bytes Read=437 File Output Format Counters Bytes Written=197 bash-4.1#
Check the output.
bash-4.1# bin/hdfs dfs -cat output/* 6 dfs.audit.logger 4 dfs.class 3 dfs.server.namenode. 2 dfs.period 2 dfs.audit.log.maxfilesize 2 dfs.audit.log.maxbackupindex 1 dfsmetrics.log 1 dfsadmin 1 dfs.servers 1 dfs.replication 1 dfs.file bash-4.1#
We successfully ran a single node hadoop cluster using docker. You saw, we had to do nothing to setup the hadoop cluster, and within no time we had an up and running hadoop cluster. As mentioned before, Docker is mostly used for testing environments, so if you want to test an hadoop application, setting up hadoop cluster in a docker container and testing the hadoop application is the easiest and fastest way.