Apache Sqoop is a tool in Hadoop ecosystem which is used to import/export data between RDBMS and HDFS. This data is in structured format and has a schema. There are multiple cases where you want to analyze some data in your RDBMS, but due to huge size of data your RDBMS is not capable enough to process that big data. Hence, you can use Sqoop in such situations and send this data on HDFS where storing and processing big data is not a problem. In this blog, I will show you how to send data from MySQL to HDFS using Sqoop Import.
MySql to HDFS Using Sqoop
To show this example, let me create a table in mysql which is on my windows machine and put some data in it.
Create a Database named 'linoxide', then create a Table named 'employee' by executing the below command:
Command: create table employee ( id int, name varchar(20), dept varchar(20) );
Insert values in table employee and check the rows present in table employee by executing the below command:
Command: select * from employee;
Now, Open Command Prompt (CMD) on Windows and check the IPv4 Address of your LAN (most cases vmnet1) connection by executing the below command:
Find out the IPv4 address of your system using the above step. In my case it is 192.168.1.33
Grant all privileges to root@your_ipv4_address by executing the below command.
Command: grant all privileges on *.* to root@192.168.1.33 IDENTIFIED BY ‘root’ WITH GRANT OPTION;
Download connector of MySQL on your machine where sqoop is installed using below links. This connector are needed to make connection between Sqoop and Mysql db.
MySQL connector : Download
Move mysql connector to the lib folder of sqoop by executing the below command:
Command: sudo mv Downloads/mysql-connector-java-5.1.26-bin.jar /home/hadoop/sqoop/lib
Now we are ready to run sqoop import command.
Required items for the command:
IPv4 Address – Your IPv4 address. In my case it is 192.168.1.33
Database Name – linoxide
Table Name – employee
Username – root
Output Directory – Could be any, I have used sqoop_out
Command: bin/sqoop import --connect jdbc:mysql://192.168.1.33/linoxide --table employee --username root -P --target-dir /sqoop_out -m 1
Enter the Password of your mysql when asked.
Import the table employee present in MySQL database to hdfs by executing the below command.
hadoop@hadoop-VirtualBox:~/sqoop$ bin/sqoop import --connect jdbc:mysql://192.168. 1.33/linoxide --table employee --username root -P --target-dir /sqoop_out -m 1 16/12/25 03:03:04 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5 Enter password: 16/12/25 03:03:09 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 16/12/25 03:03:09 INFO tool.CodeGenTool: Beginning code generation Sun Dec 25 03:03:10 IST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 16/12/25 03:03:11 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1 16/12/25 03:03:11 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1 16/12/25 03:03:11 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoop/ hadoop-2.7.3 16/12/25 03:03:21 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/ compile/33bf897949f7b40e3cc1f5b5afae7cb3/employee.jar 16/12/25 03:03:21 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 16/12/25 03:03:21 INFO mapreduce.ImportJobBase: Beginning import of employee 16/12/25 03:03:23 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 16/12/25 03:03:28 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 16/12/25 03:03:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0: 8032 Sun Dec 25 03:03:37 IST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 16/12/25 03:03:37 INFO db.DBInputFormat: Using read commited transaction isolation 16/12/25 03:03:38 INFO mapreduce.JobSubmitter: number of splits:1 16/12/25 03:03:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1482614949106_0001 16/12/25 03:03:42 INFO impl.YarnClientImpl: Submitted application application_1482614949106_0001 16/12/25 03:03:43 INFO mapreduce.Job: The url to track the job: http://hadoop-VirtualBox:8088/proxy/application_1482614949106_0001/ 16/12/25 03:03:43 INFO mapreduce.Job: Running job: job_1482614949106_0001 16/12/25 03:04:25 INFO mapreduce.Job: Job job_1482614949106_0001 running in uber mode : false 16/12/25 03:04:25 INFO mapreduce.Job: map 0% reduce 0% 16/12/25 03:04:51 INFO mapreduce.Job: map 100% reduce 0% 16/12/25 03:04:52 INFO mapreduce.Job: Job job_1482614949106_0001 completed successfully 16/12/25 03:04:53 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=127675 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=87 HDFS: Number of bytes written=86 HDFS: Number of read operations=4 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Other local map tasks=1 Total time spent by all maps in occupied slots (ms)=20382 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=20382 Total vcore-milliseconds taken by all map tasks=20382 Total megabyte-milliseconds taken by all map tasks=20871168 Map-Reduce Framework Map input records=5 Map output records=5 Input split bytes=87 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=210 CPU time spent (ms)=2780 Physical memory (bytes) snapshot=118214656 Virtual memory (bytes) snapshot=1904357376 Total committed heap usage (bytes)=54853632 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=86 16/12/25 03:04:53 INFO mapreduce.ImportJobBase: Transferred 86 bytes in 85.6488 seconds (1.0041 bytes/sec) 16/12/25 03:04:53 INFO mapreduce.ImportJobBase: Retrieved 5 records.
Once the job completes, check the output stored on HDFS
hadoop@hadoop-VirtualBox:~/sqoop$ hdfs dfs -cat /sqoop_out/part-m-00000 1,John,sales 2,Alex,Development 3,Jessica,Marketing 4,Luke,Operation 5,Lisa,Marketing hadoop@hadoop-VirtualBox:~/sqoop$
Oracle to HDFS
Similarly, to send the data from Oracle database to HDFS, you will need to put oracle connector in sqoop's lib directory.
Oracle Connector : Download
Below are the required items for the command to import data from oracle to HDFS:
IPv4 Address – Your IPv4 address. In my case it is 192.168.1.33
Database Name – linoxide
Table Name – employee
Username – root
Output Directory – sqoop_out
Command - sudo bin/sqoop import –connect jdbc:oracle:thin:system/system@192.168.1.33:1521:xe --username system -P --table system.emp –columns “ID” –target-dir /sqoop_out-m 1
Conclusion
Congratulations! You have successfully transferred data from MySQL to HDFS using Apache Sqoop. Similarly, you can transfer structured data from any RDBMS to HDFS that you want to process, make sure that you put the connectors in sqoop's lib directory before running import/export command.