In this article, I will show you how to setup a single node hadoop cluster using Docker. Before I start with the setup, let me briefly remind you what Docker and Hadoop are.
Docker is a software containerization platform where you package your application with all the libraries, dependencies, environments in a container. This container is called docker container. With Docker, you can build, ship, run an application (software) on the fly.
For example, if you want to test an application on an ubuntu system, you need not setup a complete operating system on your laptop/desktop or start a virtual machine with ubuntu os. That will take a lot of time and space. You can simply start an ubuntu docker container which will have the environment, libraries you need to test your application on the fly.
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers. These days it is one of the most important technology in the industry. Now to use Hadoop to store and analyse huge amount of data, you need to setup a hadoop cluster. If you have done setting up of hadoop cluster before, you know its not a easy task.
What if I say, setting up a hadoop cluster is hardly 5-10 minutes job, will you believe me? I guess not!
Here is where Docker comes into picture, and using docker you can setup a hadoop cluster in no time.
Benefits of using Docker for setting up a hadoop cluster
- Installs and runs hadoop in no time.
- Uses the resources as per need, so no wastage of resource.
- Easily scalable, best suited for testing environments in hadoop cluster.
- No worries of hadoop dependencies, libraries etc. , docker will take care of it.
Setup a Single Node Hadoop Cluster Using Docker
So let us see now how to setup a single node hadoop cluster using Docker. I am using Ubuntu 16.04 system and docker is already installed and configured on my system.
Before I setup a single node hadoop cluster using docker, let me just run simple example to see that docker is working correctly on my system.
Let me check if I have any docker image as of now.
[email protected]:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
I don't have any docker image as of now. Let me run a simple hello-world docker example.
[email protected]:~$ docker run hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
- The Docker client contacted the Docker daemon.
- The Docker daemon pulled the "hello-world" image from the Docker Hub.
- The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
- The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker Hub account:
For more examples and ideas, visit:
So now you know that docker is working properly. Let us go ahead and install hadoop in a docker container. To do so, we need a hadoop docker image. The below command will get me a hadoop-2.7.1 docker image.
[email protected]:~$ sudo docker pull sequenceiq/hadoop-docker:2.7.1[sudo] password for hadoop:
2.7.1: Pulling from sequenceiq/hadoop-docker
b253335dcf03: Pull complete
a3ed95caeb02: Pull complete
11c8cd810974: Pull complete
49d8575280f2: Pull complete
2240837237fc: Pull complete
e727168a1e18: Pull complete
ede4c89e7b84: Pull complete
a14c58904e3e: Pull complete
8d72113f79e9: Pull complete
44bc7aa001db: Pull complete
f1af80e588d1: Pull complete
54a0f749c9e0: Pull complete
f620e24d35d5: Pull complete
ff68d052eb73: Pull complete
d2f5cd8249bc: Pull complete
5d3c1e2c16b1: Pull complete
6e1d5d78f75c: Pull complete
a0d5160b2efd: Pull complete
b5c5006d9017: Pull complete
6a8c6da42d5b: Pull complete
13d1ee497861: Pull complete
e3be4bdd7a5c: Pull complete
391fb9240903: Pull complete
Status: Downloaded newer image for sequenceiq/hadoop-docker:2.7.1
Run below command to check whether the hadoop docket image got downloaded correctly.
[email protected]:~$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
hello-world latest c54a2cc56cbb 5 months ago 1.848 kB
sequenceiq/hadoop-docker 2.7.1 e3c6e05ab051 2 years ago 1.516 GB
Now run this docker image, which will create a docker container where hadoop-2.7.1 will run.
[email protected]:~$ docker run -it sequenceiq/hadoop-docker:2.7.1 /etc/bootstrap.sh -bash
Starting sshd: [ OK ]
Starting namenodes on [e34a63e1dcf8]
e34a63e1dcf8: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-e34a63e1dcf8.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-e34a63e1dcf8.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-e34a63e1dcf8.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-e34a63e1dcf8.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-e34a63e1dcf8.out
Now that the docker container has started, run jps command to see if the hadoop services are up and running.
Open a new terminal and run below command to see the list of containers which are running and their details.
[email protected]:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e34a63e1dcf8 sequenceiq/hadoop-docker:2.7.1 "/etc/bootstrap.sh -b" 44 minutes ago Up 44 minutes 22/tcp, 8030-8033/tcp, 8040/tcp, 8042/tcp, 8088/tcp, 49707/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp condescending_poincare
Go back to you docker container terminal, and run below command to get the ip address of the docker container.
eth0 Link encap:Ethernet HWaddr 02:42:AC:11:00:02
inet addr:172.17.0.2 Bcast:0.0.0.0 Mask:255.255.0.0
inet6 addr: fe80::42:acff:fe11:2/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:56 errors:0 dropped:0 overruns:0 frame:0
TX packets:31 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:6803 (6.6 KiB) TX bytes:2298 (2.2 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:28648 errors:0 dropped:0 overruns:0 frame:0
TX packets:28648 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:4079499 (3.8 MiB) TX bytes:4079499 (3.8 MiB)
After running jps command, we already saw that all the services were running, let us now check the namenode ui on the browser. Go to 172.17.0.2 :50070 in the browser, and there you go, namenode ui of a hadoop cluster running in a docker container.
Just to make sure that the hadoop cluster is working fine, let us run a hadoop mapreduce example in the docker container.
bash-4.1# cd $HADOOP_PREFIX
bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs[a-z.]+'
16/11/29 13:07:02 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/11/29 13:07:07 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
16/11/29 13:07:08 INFO input.FileInputFormat: Total input paths to process : 27
16/11/29 13:07:10 INFO mapreduce.JobSubmitter: number of splits:27
16/11/29 13:07:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1480434980067_0001
16/11/29 13:07:14 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
16/11/29 13:07:15 INFO impl.YarnClientImpl: Submitted application application_1480434980067_0001
16/11/29 13:07:16 INFO mapreduce.Job: The url to track the job: http://e34a63e1dcf8:8088/proxy/application_1480434980067_0001/
16/11/29 13:07:16 INFO mapreduce.Job: Running job: job_1480434980067_0001
16/11/29 13:07:58 INFO mapreduce.Job: Job job_1480434980067_0001 running in uber mode : false
16/11/29 13:07:58 INFO mapreduce.Job: map 0% reduce 0%
16/11/29 13:10:44 INFO mapreduce.Job: map 22% reduce 0%
16/11/29 13:13:40 INFO mapreduce.Job: map 22% reduce 7%
16/11/29 13:13:41 INFO mapreduce.Job: map 26% reduce 7%
16/11/29 13:20:30 INFO mapreduce.Job: map 96% reduce 32%
16/11/29 13:21:01 INFO mapreduce.Job: map 100% reduce 32%
16/11/29 13:21:04 INFO mapreduce.Job: map 100% reduce 100%
16/11/29 13:21:08 INFO mapreduce.Job: Job job_1480434980067_0001 completed successfully
16/11/29 13:21:10 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=345
FILE: Number of bytes written=2621664
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=64780
HDFS: Number of bytes written=437
HDFS: Number of read operations=84
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Launched map tasks=29
Launched reduce tasks=1
Data-local map tasks=29
Map input records=1586
Map output records=24
16/11/29 13:21:10 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/11/29 13:21:10 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
16/11/29 13:21:10 INFO input.FileInputFormat: Total input paths to process : 1
16/11/29 13:21:12 INFO mapreduce.JobSubmitter: number of splits:1
16/11/29 13:21:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1480434980067_0002
16/11/29 13:21:13 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
16/11/29 13:21:14 INFO impl.YarnClientImpl: Submitted application application_1480434980067_0002
16/11/29 13:21:14 INFO mapreduce.Job: The url to track the job: http://e34a63e1dcf8:8088/proxy/application_1480434980067_0002/
16/11/29 13:21:14 INFO mapreduce.Job: Running job: job_1480434980067_0002
16/11/29 13:21:48 INFO mapreduce.Job: Job job_1480434980067_0002 running in uber mode : false
16/11/29 13:21:48 INFO mapreduce.Job: map 0% reduce 0%
16/11/29 13:22:12 INFO mapreduce.Job: map 100% reduce 0%
16/11/29 13:22:37 INFO mapreduce.Job: map 100% reduce 100%
16/11/29 13:22:38 INFO mapreduce.Job: Job job_1480434980067_0002 completed successfully
16/11/29 13:22:38 INFO mapreduce.Job: Counters: 49
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Map input records=11
Map output records=11
Map output bytes=263
Map output materialized bytes=291
Input split bytes=132
Physical memory (bytes) snapshot=334082048
Virtual memory (bytes) snapshot=1297162240
Total committed heap usage (bytes)=209518592
File Input Format Counters
File Output Format Counters
Check the output.
bash-4.1# bin/hdfs dfs -cat output/*
We successfully ran a single node hadoop cluster using docker. You saw, we had to do nothing to setup the hadoop cluster, and within no time we had an up and running hadoop cluster. As mentioned before, Docker is mostly used for testing environments, so if you want to test an hadoop application, setting up hadoop cluster in a docker container and testing the hadoop application is the easiest and fastest way.