How to Setup Hadoop Multi-Node Cluster on Ubuntu

In this tutorial, we will learn how to setup a multi-node hadoop cluster on Ubuntu 16.04. A hadoop cluster which has more than 1 datanode is a multi-node hadoop cluster, hence, the goal of this tutorial is to get 2 datanodes up and running.

1) Prerequisites

  • Ubuntu 16.04
  • Hadoop-2.7.3
  • Java 7
  • SSH

For this tutorial, I have two ubuntu 16.04 systems, I call them master and slave system, one datanode will be running on each system.

IP address of Master -> 192.168.1.37

master ip

IP address of Slave -> 192.168.1.38

slave ip

On Master

Edit hosts file with master and slave ip address.

sudo gedit /etc/hosts

Edit the file as below, you may remove other lines in the file. After editing save the file and close it.

master hosts file

On Slave

Edit hosts file with master and slave ip address.

sudo gedit /etc/hosts

Edit the file as below, you may remove other lines in the file. After editing save the file and close it.

slave hosts file

2) Java Installation

Before setting up hadoop, you need to have java installed on your systems. Install open JDK 7 on both ubuntu machines using below commands.

sudo add-apt-repository ppa:openjdk-r/ppa

java repo

sudo apt-get update

update

sudo apt-get install openjdk-7-jdk

java installation

Run below command to see if java got installed on your system.

java -version

java version

By default java gets stored on /usr/lib/jvm/ directory.

ls /usr/lib/jvm

java path

Set Java path in .bashrc file.

sudo gedit .bashrc

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

export PATH=$PATH:/usr/lib/jvm/java-7-openjdk-amd64/bin

Run below command to update the changes made in .bashrc file.

source .bashrc

3) SSH

Hadoop requires SSH access to manage its nodes, therefore we need to install ssh on both master and slave systems.

sudo apt-get install openssh-server

install ssh

Now, we have to generate an SSH key on master machine. When it asks you to enter a file name to save the key, do not give any name, just press enter.

ssh-keygen -t rsa -P ""

generate key

Second, you have to enable SSH access to your master machine with this newly created key.

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

copy key

Now test the SSH setup by connecting to your local machine.

ssh localhost

ssh localhost

Now run the below command to send the public key generated on master to slave.

ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave

sshcopy

Now that both master and slave have the pubic key, you can connect master to master and master to slave as well.

ssh master

ssh master

ssh slave

ssh slave

On Master

Edit the masters file as below.

sudo gedit hadoop-2.7.3/etc/hadoop/masters

hadoop masters file

Edit the slaves file as below.

sudo gedit hadoop-2.7.3/etc/hadoop/slaves

hadoop slaves file

On Slave

Edit the masters file as below.

sudo gedit hadoop-2.7.3/etc/hadoop/masters

masters-file

4) Hadoop Installation

Now that we have our java and ssh setup ready. We are good to go and install hadoop on both the systems. Use below link to download hadoop package. I am using the latest stable version hadoop 2.7.3

http://hadoop.apache.org/releases.html

hadoop releases

On Master

Below command will download hadoop-2.7.3 tar file.

wget http://mirror.fibergrid.in/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

download hadoop

ls

Untar the file

tar -xvf hadoop-2.7.3.tar.gz

untar hadoop

ls

ls command

Confirm that hadoop has got installed on your system.

cd hadoop-2.7.3/
bin/hadoop-2.7.3/

hadoop check

Before setting configurations for hadoop, we will set below environment variables in .bashrc file.

cd
sudo gedit .bashrc

Hadoop environment Variables

# Set Hadoop-related environment variables

 

export HADOOP_HOME=$HOME/hadoop-2.7.3

export HADOOP_CONF_DIR=$HOME/hadoop-2.7.3/etc/hadoop

export HADOOP_MAPRED_HOME=$HOME/hadoop-2.7.3

export HADOOP_COMMON_HOME=$HOME/hadoop-2.7.3

export HADOOP_HDFS_HOME=$HOME/hadoop-2.7.3

export YARN_HOME=$HOME/hadoop-2.7.3

# Add Hadoop bin/ directory to PATH

export PATH=$PATH:$HOME/hadoop-2.7.3/bin

bashrc lines

Put below lines at the end of your .bashrc file, save the file and close it.

source .bashrc

Configure JAVA_HOME in ‘hadoop-env.sh’. This file specifies environment variables that affect the JDK used by Apache Hadoop 2.7.3 daemons started by the Hadoop start-up scripts:

cd hadoop-2.7.3/etc/hadoop/

sudo gedit hadoop-env.sh

hadoop env

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 

java-home

Set the java path as shown above, save the file and close it.

Now we will create NameNode and DataNode directories.

cd

mkdir -p $HADOOP_HOME/hadoop2_data/hdfs/namenode

mkdir -p $HADOOP_HOME/hadoop2_data/hdfs/datanode

mkdir hadoop

Hadoop has many of configuration files, which need to configured as per requirements of your hadoop infrastructure. Let us configure hadoop configuration files one by one.

cd hadoop-2.7.3/etc/hadoop/

sudo gedit core-site.xml

Core-site.xml

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://master:9000</value>

</property>

</configuration>

core site file

sudo gedit hdfs-site.xml

hdfs-site.xml

<configuration>

<property>

<name>dfs.replication</name>

<value>2</value>

</property>

<property>

<name>dfs.permissions</name>

<value>false</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>/home/ubuntu/hadoop-2.7.3/hadoop2_data/hdfs/namenode</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>/home/ubuntu/hadoop-2.7.3/hadoop2_data/hdfs/datanode</value>

</property>

</configuration>

hdfs site file

sudo gedit yarn-site.xml

yarn-site.xml

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

</configuration>

yarn site

cp mapred-site.xml.template mapred-site.xml

sudo gedit mapred-site.xml

mapred-site.xml

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

mapred site

Now follow the same hadoop installation and configuring steps on slave machine as well. After you have hadoop installed and configured on both the systems, the first thing in starting up your hadoop cluster is formatting the hadoop file-system, which is implemented on top of the local file-systems of your cluster. This is required at the first time of hadoop installation. Do not format a running hadoop file-system, this will erase all your HDFS data.

On Master

cd

cd hadoop-2.7.3/bin

hadoop namenode -format

namenode format

We are now ready to start the hadoop daemons i.e. NameNode, DataNode, ResourceManager and NodeManager on our Apache Hadoop Cluster.

cd ..

Now run the below command to start the NameNode on master machine and DataNodes on master and slave.

sbin/start-dfs.sh

start dfs

Below command will start YARN daemons, ResourceManager will run on master and NodeManagers will run on master and slave.

sbin/start-yarn.sh

start yarn

Cross check that all the services have started correctly using JPS (Java Process Monitoring Tool). on both master and slave machine.

Below are the daemons running on master machine.

jps

jps master

On Slave

You will see DataNode and NodeManager will be running on slave machine also.

jps

jps slave

Now open you mozilla browser on master machine and go to below URL

Check the NameNode status:  http://master:50070/dfshealth.html

namenode status

If you see '2' in live nodes, that means 2 DataNodes are up and running and you have successfully setup a multi-node hadoop culster.

live nodes

Conclusion

You can add more nodes to your hadoop cluster, all you need to do is add the new slave node ip to slaves file on master, copy ssh key to new slave node, put master ip in masters file on new slave node and then restart the hadoop services. Congratulations!! You have successfully setup a multi-node hadoop cluster.

Have anything to say?

Your email address will not be published. Required fields are marked *

All comments are subject to moderation.

17 Comments

  1. i can't run "sbin/start-dfs.sh"
    when command "sbin/start-dfs.sh", i got a error message that is "WARN hdfs.DFSUtil: Namenode for null remains unresolved for ID null. Check your hdfs-site.xml file to ensure namenodes are configured properly."and "master: ssh: Could not resolve hostname master: Name or service not known
    master: ssh: Could not resolve hostname master: Name or service not known" and "ubuntu@slave's password:".
    So, i push the password, but i couldn't run "start-dfs.sh"
    What should i do?

    1. That is my hdfs-site.xml file.
      My HADOOP_HOME is home/selab1

      dfs.replication

      2

      dfs.permissions

      false

      dfs.namenode.name.dir

      ~/hadoop-2.7.3/hadoop2_data/hdfs/namenode

      dfs.datanode.data.dir

      ~/hadoop-2.7.3/hadoop2_data/hdfs/datanode

    2. I think its host issue and ssh issue as well. Please check the you ip address has not changed, and it is the same as in /etc/hosts file.

      If the ip has changed, update that ip in /etc/hosts file as well.

  2. Hi AnHyeonSik,

    Can you please post your complete hdfs-site.xml here in xml syntax, so that I can verify.

    Also, one reason can be that your start-dfs.sh file is not having executable right, to give this right, run this command -----> sudo chmod +x sbin/start-dfs.sh

    Regards,
    John @Linoxide

    1. Oh i can do it
      I have one more question
      This post mean that connected only one master to one slave?
      Can i connect [ master - slave1 : slave2 ]
      Please give me a hand...

      1. Yes, you can have have slave 1 and slave 2 also. You can just add one more machine and name it slave 2 and follow the same steps as I have done for the slave machine.

        You need to add the ip of slave 2 machine in slaves file of master and set ssh with slave 2.

      1. After running jps command on master, are all the hadoop daemons up and running. If yes, just try giving master:50070 in the browser.

        If the daemons are not running, check your configurations and start the services again.

        1. In your guide you set Core-site.xml to use hdfs://master:9000

          but then put the URL as: http://master:50070/dfshealth.html

          Maybe this is the problem?

          Also, early on you say to do: sudo gedit hadoop-2.7.3/etc/hadoop/masters but at this point in the process the hadoop directory doesn't exist because you didn't download and extract yet...

  3. I have install all the things properly on 32 bit system but when I am trying to execute JPS comand then it will only show JPS processing.Some time it is showing all process. Could suggest me what is doing wrong ?

    Thanks in advance.

  4. When I type "jps" in the terminals (both master and slave) I can see everything but datanode. I have double checked again the whole process but I dont see anything different. What could it be?
    (I am using java 8)
    Thanks a lot.

  5. Thank you for the elaborate procedure. I have followed it and now I have this problem. I am not sure what this means since I have checked the line numbers of start-dfs.sh xml file but have no idea what is happening.

    ubuntu@ec2-54-245-219-137:~/hadoop-2.7.3/sbin$ ./start-dfs.sh
    ./start-dfs.sh: line 56: /home/ubuntu/hadoop/bin/hdfs: No such file or directory
    Starting namenodes on []
    ./start-dfs.sh: line 61: /home/ubuntu/hadoop/sbin/hadoop-daemons.sh: No such file or directory
    ./start-dfs.sh: line 74: /home/ubuntu/hadoop/sbin/hadoop-daemons.sh: No such file or directory
    ./start-dfs.sh: line 109: /home/ubuntu/hadoop/bin/hdfs: No such file or directory

    However, the start-yarn.sh was successful.
    kindly assist
    ubuntu@ec2-54-245-219-137:~/hadoop-2.7.3/sbin$ jps
    22972 Jps
    22859 NodeManager
    ubuntu@ec2-54-245-219-137:~/hadoop-2.7.3/sbin$

  6. Hello,

    Can you help me here?
    I received this output, but i don't know why...
    Thank you in advance,

    master@master-VirtualBox:~/hadoop-2.7.3$ sbin/start-dfs.sh
    Starting namenodes on [master]
    The authenticity of host 'master (192.168.0.100)' can't be established.
    ECDSA key fingerprint is SHA256:OIGRdbP7r9AdAZWCE+v00MFGSoxMFneSDXZctV40eqs.
    Are you sure you want to continue connecting (yes/no)? yes
    master: Warning: Permanently added 'master,192.168.0.100' (ECDSA) to the list of known hosts.
    master: starting namenode, logging to /home/master/hadoop-2.7.3/logs/hadoop-master-namenode-master-VirtualBox.out
    localhost: starting datanode, logging to /home/master/hadoop-2.7.3/logs/hadoop-master-datanode-master-VirtualBox.out
    Starting secondary namenodes [0.0.0.0]
    The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
    ECDSA key fingerprint is SHA256:OIGRdbP7r9AdAZWCE+v00MFGSoxMFneSDXZctV40eqs.
    Are you sure you want to continue connecting (yes/no)? yes
    0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
    0.0.0.0: starting secondarynamenode, logging to /home/master/hadoop-2.7.3/logs/hadoop-master-secondarynamenode-master-VirtualBox.out
    master@master-VirtualBox:~/hadoop-2.7.3$

  7. Hi All!
    i have followed as above, but I had a proglemb below:
    master@master:/etc/hadoop-2.7.3$ sbin/start-dfs.sh
    Incorrect configuration: namenode address dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured.
    Starting namenodes on []
    localhost: Error: JAVA_HOME is not set and could not be found.
    localhost: Error: JAVA_HOME is not set and could not be found.
    Starting secondary namenodes [0.0.0.0]
    0.0.0.0: Error: JAVA_HOME is not set and could not be found.
    master@master:/etc/hadoop-2.7.3$
    /// I was installing hadoop2.7.3, java 7u21.(installed to /usr/local/java/jdk1.7.0_21) ang config to /etc/profile.
    I have searched google about this many times but not ok.
    Please help me!