In this tutorial, we will walk you through the Hadoop Distributed File System (HDFS) commands you will need to manage files on HDFS. HDFS command is used most of the times when working with Hadoop File System. It includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports. Most of the commands behave like corresponding Unix commands. Error information is sent to stderr and the output is sent to stdout. So, let's get started.
1) Version Check
To check the version of Hadoop.
ubuntu@ubuntu-VirtualBox:~$ hadoop version Hadoop 2.7.3 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff Compiled by root on 2016-08-18T01:41Z Compiled with protoc 2.5.0 From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4 This command was run using /home/ubuntu/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar
2) list Command
List all the files/directories for the given hdfs destination path.
ubuntu@ubuntu-VirtualBox:~ $ hdfs dfs -ls / Found 3 items drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:11 /test drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:09 /tmp drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:09 /usr
3) df Command
Displays free space at given hdfs destination
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -df hdfs:/ Filesystem Size Used Available Use% hdfs://master:9000 6206062592 32768 316289024 0%
4) count Command
- Count the number of directories, files and bytes under the paths that match the specified file pattern.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -count hdfs:/ 4 0 0 hdfs:///
5) fsck Command
HDFS Command to check the health of the Hadoop file system.
ubuntu@ubuntu-VirtualBox:~$ hdfs fsck / Connecting to namenode via http://master:50070/fsck?ugi=ubuntu&path=%2F FSCK started by ubuntu (auth:SIMPLE) from /192.168.1.36 for path / at Mon Nov 07 01:23:54 GMT+05:30 2016 Status: HEALTHY Total size: 0 B Total dirs: 4 Total files: 0 Total symlinks: 0 Total blocks (validated): 0 Minimally replicated blocks: 0 Over-replicated blocks: 0 Under-replicated blocks: 0 Mis-replicated blocks: 0 Default replication factor: 2 Average block replication: 0.0 Corrupt blocks: 0 Missing replicas: 0 Number of data-nodes: 1 Number of racks: 1 FSCK ended at Mon Nov 07 01:23:54 GMT+05:30 2016 in 33 milliseconds The filesystem under path '/' is HEALTHY
6) balancer Command
Run a cluster balancing utility.
ubuntu@ubuntu-VirtualBox:~$ hdfs balancer 16/11/07 01:26:29 INFO balancer.Balancer: namenodes = [hdfs://master:9000] 16/11/07 01:26:29 INFO balancer.Balancer: parameters = Balancer.Parameters[BalancingPolicy.Node, threshold=10.0, max idle iteration = 5, number of nodes to be excluded = 0, number of nodes to be included = 0] Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved 16/11/07 01:26:38 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.36:50010 16/11/07 01:26:38 INFO balancer.Balancer: 0 over-utilized: [] 16/11/07 01:26:38 INFO balancer.Balancer: 0 underutilized: [] The cluster is balanced. Exiting... 7 Nov, 2016 1:26:38 AM 0 0 B 0 B -1 B 7 Nov, 2016 1:26:39 AM Balancing took 13.153 seconds
7) mkdir Command
HDFS Command to create the directory in HDFS.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -mkdir /hadoop ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls / Found 5 items drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:29 /hadoop drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:26 /system drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:11 /test drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:09 /tmp drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:09 /usr
8) put Command
File
Copy file from single src, or multiple srcs from local file system to the destination file system.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -put test /hadoop ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop Found 1 items -rw-r--r-- 2 ubuntu supergroup 16 2016-11-07 01:35 /hadoop/test
Directory
HDFS Command to copy directory from single source, or multiple sources from local file system to the destination file system.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -put hello /hadoop/ ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop Found 2 items drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:43 /hadoop/hello -rw-r--r-- 2 ubuntu supergroup 16 2016-11-07 01:35 /hadoop/test
9) du Command
Displays size of files and directories contained in the given directory or the size of a file if its just a file.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -du / 59 /hadoop 0 /system 0 /test 0 /tmp 0 /usr
10) rm Command
HDFS Command to remove the file from HDFS.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -rm /hadoop/test 16/11/07 01:53:29 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted /hadoop/test
11) expunge Command
HDFS Command that makes the trash empty.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -expunge 16/11/07 01:55:54 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
12) rm -r Command
HDFS Command to remove the entire directory and all of its content from HDFS.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -rm -r /hadoop/hello 16/11/07 01:58:52 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted /hadoop/hello
13) chmod Command
Change the permissions of files.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -chmod 777 /hadoop ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls / Found 5 items drwxrwxrwx - ubuntu supergroup 0 2016-11-07 01:58 /hadoop drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:26 /system drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:11 /test drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:09 /tmp drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:09 /usr
14) get Command
HDFS Command to copy files from hdfs to the local file system.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -get /hadoop/test /home/ubuntu/Desktop/ ubuntu@ubuntu-VirtualBox:~$ ls -l /home/ubuntu/Desktop/ total 4 -rw-r--r-- 1 ubuntu ubuntu 16 Nov 8 00:47 test
15) cat Command
HDFS Command that copies source paths to stdout.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -cat /hadoop/test This is a test.
16) touchz Command
HDFS Command to create a file in HDFS with file size 0 bytes.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -touchz /hadoop/sample ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop Found 2 items -rw-r--r-- 2 ubuntu supergroup 0 2016-11-08 00:57 /hadoop/sample -rw-r--r-- 2 ubuntu supergroup 16 2016-11-08 00:45 /hadoop/test
17) text Command
HDFS Command that takes a source file and outputs the file in text format.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -text /hadoop/test This is a test.
18) copyFromLocal Command
HDFS Command to copy the file from Local file system to HDFS.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -copyFromLocal /home/ubuntu/new /hadoop ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop Found 3 items -rw-r--r-- 2 ubuntu supergroup 43 2016-11-08 01:08 /hadoop/new -rw-r--r-- 2 ubuntu supergroup 0 2016-11-08 00:57 /hadoop/sample -rw-r--r-- 2 ubuntu supergroup 16 2016-11-08 00:45 /hadoop/test
19) copyToLocal Command
Similar to get command, except that the destination is restricted to a local file reference.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -copyToLocal /hadoop/sample /home/ubuntu/ ubuntu@ubuntu-VirtualBox:~$ ls -l s* -rw-r--r-- 1 ubuntu ubuntu 0 Nov 8 01:12 sample -rw-rw-r-- 1 ubuntu ubuntu 102436055 Jul 20 04:47 sqoop-1.99.7-bin-hadoop200.tar.gz
20) mv Command
HDFS Command to move files from source to destination. This command allows multiple sources as well, in which case the destination needs to be a directory.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -mv /hadoop/sample /tmp ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /tmp Found 1 items -rw-r--r-- 2 ubuntu supergroup 0 2016-11-08 00:57 /tmp/sample
21) cp Command
HDFS Command to copy files from source to destination. This command allows multiple sources as well, in which case the destination must be a directory.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -cp /tmp/sample /usr ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /usr Found 1 items -rw-r--r-- 2 ubuntu supergroup 0 2016-11-08 01:22 /usr/sample
22) tail Command
Displays last kilobyte of the file "new" to stdout
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -tail /hadoop/new This is a new file. Running HDFS commands.
23) chown Command
HDFS command to change the owner of files.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -chown root:root /tmp ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls / Found 5 items drwxrwxrwx - ubuntu supergroup 0 2016-11-08 01:17 /hadoop drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:26 /system drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:11 /test drwxr-xr-x - root root 0 2016-11-08 01:17 /tmp drwxr-xr-x - ubuntu supergroup 0 2016-11-08 01:22 /usr
24) setrep Command
Default replication factor to a file is 3. Below HDFS command is used to change replication factor of a file.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -setrep -w 2 /usr/sample Replication 2 set: /usr/sample Waiting for /usr/sample ... done
25) distcp Command
Copy a directory from one node in the cluster to another
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop
26) stat Command
Print statistics about the file/directory at <path> in the specified format. Format accepts filesize in blocks (%b), type (%F), group name of owner (%g), name (%n), block size (%o), replication (%r), user name of owner(%u), and modification date (%y, %Y). %y shows UTC date as “yyyy-MM-dd HH:mm:ss” and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -stat "%F %u:%g %b %y %n" /hadoop/test regular file ubuntu:supergroup 16 2016-11-07 19:15:22 test
27) getfacl Command
Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then getfacl also displays the default ACL.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -getfacl /hadoop # file: /hadoop # owner: ubuntu # group: supergroup
28) du -s Command
Displays a summary of file lengths.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -du -s /hadoop 59 /hadoop
29) checksum Command
Returns the checksum information of a file.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -checksum /hadoop/new /hadoop/new MD5-of-0MD5-of-512CRC32C 000002000000000000000000639a5d8ac275be8d0c2b055d75208265
30) getmerge Command
Takes a source directory and a destination file as input and concatenates files in src into the destination local file.
ubuntu@ubuntu-VirtualBox:~$ cat test This is a test. ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -cat /hadoop/new This is a new file. Running HDFS commands. ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -getmerge /hadoop/new test ubuntu@ubuntu-VirtualBox:~$ cat test This is a new file. Running HDFS commands.
Conclusion
This is the end of the HDFS Command blog, we hope it was informative and you were able to execute all the commands. We learned to create, upload and list the the contents in our HDFS directories. We also acquired the skills to download files from HDFS to our local file system and explored a few advanced features of HDFS file management using the command line.