Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. For example : MySQL, Oracle, Microsoft SQL Server. You can import and export data between relational databases and hadoop. You can also import / export from / to semi-structured data sources, for example HBase and Cassandra (NoSQL databases). Sqoop ships as one binary package that incorporates two separate parts - client and server.
- Server- You need to install server on single node in your cluster. This node will then serve as an entry point for all Sqoop clients.
- Client- Clients can be installed on any number of machines.
Below are the steps to setup Apache Sqoop on Ubuntu 16.04. Download required Sqoop package and this will have
1) Download Sqoop using wget
Download Sqoop using below command on your filesystem.
Check if the file got downloaded correctly.
2) Extract Sqoop tar file
Extract the downloaded file.
tar -xvf sqoop-1.99.7-bin-hadoop200.tar.gz
Check if the file got extracted correctly.
3) Move the Sqoop Directory
Move the sqoop directory to /usr/lib/
sudo mv sqoop-1.99.7-bin-hadoop200 /usr/lib/
The Sqoop server acts as a Hadoop client, therefore Hadoop libraries (Yarn, Mapreduce, and HDFS jar files) and configuration files (core-site.xml, mapreduce-site.xml, ...) must be available on this node.
4) Set Hadoop and Sqoop Environment Variables
You should have Hadoop environment variables set in .bashrc file.
# Set Hadoop-related environment variables
Also, set sqoop environment variables in .bashrc file.
sudo gedit .bashrc
Put below lines in .bashrc file.
Use below command to put the changes into effect.
5) Copy Required Jar Files to Sqoop Server lib Directory
Copy hadoop-common, hadoop-mapreduce, hadoop-hdfs, hadoop-yarn jars to
/usr/lib/sqoop-1.99.7-bin-hadoop200/server/lib (sqoop server lib directory). Below are the paths from where you need to copy all the jars to sqoop server lib directory.
6) Edit core-site.xml
Sqoop server will need to impersonate users to access HDFS and other resources in or outside of the cluster as the user who started given job rather then user who is running the server. You need to configure Hadoop's core-site.xml and add below 2 properties to it.
7) Initialize Metadeta Repository
The metadata repository needs to be initialized before starting Sqoop 2 server for the first time.
8) Start Sqoop Server
Start the sqoop server.
Check if the sqoop server service has started.
9) Start Sqoop Client
Just copy Sqoop distribution artifact on target machine and unzip it in desired location and you can start your client their. I am using the same machine as client as well. Start the Sqoop client
10) Download RDBMS Connectors
Download connectors of MySQL , Oracle and SQL Server using below links. These connectors are needed to make connection between Sqoop and RDBMS.
Check whether all the connectors got downloaded.
11) Set an Environment Variable to use RDBMS Connectors
Move all the connectors to a directory and set that directory as an environment variable.
sudo mkdir -p /var/lib/sqoop2/
sudo chmod 777 /var/lib/sqoop2/
mv Downloads/*.jar /var/lib/sqoop2/
ls -l /var/lib/sqoop2/
Voila! You have successfully setup Apache Sqoop on Ubuntu 16.04. Now you are ready to import/export data using Sqoop. The next step is to use any of the RDBMS connector and import/export data from RDBMS to HDFS or HDFS to RDBMS.