How to Install Apache Sqoop on Ubuntu 16.04

November 16, 2016 | By in UBUNTU HOWTO
| Reply More

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. For example : MySQL, Oracle, Microsoft SQL Server. You can import and export data between relational databases and hadoop. You can also import / export from / to semi-structured data sources, for example HBase and Cassandra (NoSQL databases). Sqoop ships as one binary package that incorporates two separate parts - client and server.

  • Server- You need to install server on single node in your cluster. This node will then serve as an entry point for all Sqoop clients.
  • Client- Clients can be installed on any number of machines.

Below are the steps to setup Apache Sqoop on Ubuntu 16.04. Download required Sqoop package and this will have sqoop-1.99.7-bin-hadoop200.tar.gz file.

Sqoop Package

1) Download Sqoop using wget

Download Sqoop using below command on your filesystem.

wget http://archive.apache.org/dist/sqoop/1.99.7/sqoop-1.99.7-bin-hadoop200.tar.gz

Download Sqoop Hadoop - Apache Sqoop

Check if the file got downloaded correctly.

Check Sqoop Download

2) Extract Sqoop tar file

Extract the downloaded file.

tar -xvf sqoop-1.99.7-bin-hadoop200.tar.gz

Check if the file got extracted correctly.

Check Sqoop Directory

3) Move the Sqoop Directory

Move the sqoop directory to /usr/lib/

sudo mv sqoop-1.99.7-bin-hadoop200 /usr/lib/

Move Sqoop Directory

The Sqoop server acts as a Hadoop client, therefore Hadoop libraries (Yarn, Mapreduce, and HDFS jar files) and configuration files (core-site.xml, mapreduce-site.xml, ...) must be available on this node.

4) Set Hadoop and Sqoop Environment Variables

You should have Hadoop environment variables set in .bashrc file.

# Set Hadoop-related environment variables
export HADOOP_HOME=$HOME/hadoop-2.7.3
export HADOOP_CONF_DIR=$HOME/hadoop-2.7.3/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/hadoop-2.7.3
export HADOOP_COMMON_HOME=$HOME/hadoop-2.7.3
export HADOOP_HDFS_HOME=$HOME/hadoop-2.7.3
export HADOOP_YARN_HOME=$HOME/hadoop-2.7.3

Also, set sqoop environment variables in .bashrc file.

sudo gedit .bashrc

Put below lines in .bashrc file.

export SQOOP_HOME=/usr/lib/sqoop-1.99.7-bin-hadoop200
export PATH=$PATH:$SQOOP_HOME/bin
export SQOOP_CONF_DIR=$SQOOP_HOME/conf
export SQOOP_CLASS_PATH=$SQOOP_CONF_DIR

Use below command to put the changes into effect.

source .bashrc

5) Copy Required Jar Files to Sqoop Server lib Directory

Copy hadoop-common, hadoop-mapreduce, hadoop-hdfs, hadoop-yarn jars to /usr/lib/sqoop-1.99.7-bin-hadoop200/server/lib (sqoop server lib directory). Below are the paths from where you need to copy all the jars to sqoop server lib directory.

/home/ubuntu/hadoop-2.7.3/share/hadoop/common
/home/ubuntu/hadoop-2.7.3/share/hadoop/common/lib
/home/ubuntu/hadoop-2.7.3/share/hadoop/hdfs
/home/ubuntu/hadoop-2.7.3/share/hadoop/hdfs/lib
/home/ubuntu/hadoop-2.7.3/share/hadoop/mapreduce
/home/ubuntu/hadoop-2.7.3/share/hadoop/mapreduce/lib
/home/ubuntu/hadoop-2.7.3/share/hadoop/yarn
/home/ubuntu/hadoop-2.7.3/share/hadoop/yarn/lib

6) Edit core-site.xml

Sqoop server will need to impersonate users to access HDFS and other resources in or outside of the cluster as the user who started given job rather then user who is running the server. You need to configure Hadoop's core-site.xml and add below 2 properties to it.

<property>
<name>hadoop.proxyuser.ubuntu.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.ubuntu.groups</name>
<value>*</value>
</property>

7) Initialize Metadeta Repository

The metadata repository needs to be initialized before starting Sqoop 2 server for the first time.

 ./bin/sqoop2-tool upgrade

8) Start Sqoop Server

Start the sqoop server.

 ./bin/sqoop2-server start

Sqoop Server Start

Check if the sqoop server service has started.

jps

jps

9) Start Sqoop Client

Just copy Sqoop distribution artifact on target machine and unzip it in desired location and you can start your client their. I am using the same machine as client as well. Start the Sqoop client

./bin/sqoop2-shell

Sqoop Client Start

10) Download RDBMS Connectors

Download connectors of MySQL , Oracle and SQL Server using below links. These connectors are needed to make connection between Sqoop and RDBMS.

MySQL connector : Download
Oracle Connector : Download
Microsoft SQL Server Connector : Download

Check whether all the connectors got downloaded.

ls Downloads/

RDBMS Connectors

11) Set an Environment Variable to use RDBMS Connectors

Move all the connectors to a directory and set that directory as an environment variable.

sudo mkdir -p /var/lib/sqoop2/
sudo chmod 777 /var/lib/sqoop2/
mv Downloads/*.jar /var/lib/sqoop2/
ls -l /var/lib/sqoop2/
export SQOOP_SERVER_EXTRA_LIB=/var/lib/sqoop2/

Connectors For Sqoop

Conclusion

Voila! You have successfully setup Apache Sqoop on Ubuntu 16.04. Now you are ready to import/export data using Sqoop. The next step is to use any of the RDBMS connector and import/export data from RDBMS to HDFS or HDFS to RDBMS.

Filed Under : OPEN SOURCE TOOLS, UBUNTU HOWTO

Tagged With : ,

Free Linux Ebook to Download

Leave a Reply

Commenting Policy:
Promotion of your products ? Comment gets deleted.
All comments are subject to moderation.