In this tutorial, we will walk you through the Hadoop Distributed File System (HDFS) commands you will need to manage files on HDFS. HDFS command is used most of the times when working with Hadoop File System. It includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports. Most of the commands behave like corresponding Unix commands. Error information is sent to stderr and the output is sent to stdout. So, let's get started.
1) Version Check
To check the version of Hadoop.
ubuntu@ubuntu-VirtualBox:~$ hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /home/ubuntu/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar
2) list Command
List all the files/directories for the given hdfs destination path.
ubuntu@ubuntu-VirtualBox:~ $ hdfs dfs -ls /
Found 3 items
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:11 /test
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:09 /tmp
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:09 /usr
3) df Command
Displays free space at given hdfs destination
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -df hdfs:/
Filesystem Size Used Available Use%
hdfs://master:9000 6206062592 32768 316289024 0%
4) count Command
- Count the number of directories, files and bytes under the paths that match the specified file pattern.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -count hdfs:/
4 0 0 hdfs:///
5) fsck Command
HDFS Command to check the health of the Hadoop file system.
ubuntu@ubuntu-VirtualBox:~$ hdfs fsck /
Connecting to namenode via http://master:50070/fsck?ugi=ubuntu&path=%2F
FSCK started by ubuntu (auth:SIMPLE) from /192.168.1.36 for path / at Mon Nov 07 01:23:54 GMT+05:30 2016
Status: HEALTHY
Total size: 0 B
Total dirs: 4
Total files: 0
Total symlinks: 0
Total blocks (validated): 0
Minimally replicated blocks: 0
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 2
Average block replication: 0.0
Corrupt blocks: 0
Missing replicas: 0
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Mon Nov 07 01:23:54 GMT+05:30 2016 in 33 milliseconds
The filesystem under path '/' is HEALTHY
6) balancer Command
Run a cluster balancing utility.
ubuntu@ubuntu-VirtualBox:~$ hdfs balancer
16/11/07 01:26:29 INFO balancer.Balancer: namenodes = [hdfs://master:9000]
16/11/07 01:26:29 INFO balancer.Balancer: parameters = Balancer.Parameters[BalancingPolicy.Node, threshold=10.0, max idle iteration = 5, number of nodes to be excluded = 0, number of nodes to be included = 0]
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
16/11/07 01:26:38 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.36:50010
16/11/07 01:26:38 INFO balancer.Balancer: 0 over-utilized: []
16/11/07 01:26:38 INFO balancer.Balancer: 0 underutilized: []
The cluster is balanced. Exiting...
7 Nov, 2016 1:26:38 AM 0 0 B 0 B -1 B
7 Nov, 2016 1:26:39 AM Balancing took 13.153 seconds
7) mkdir Command
HDFS Command to create the directory in HDFS.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -mkdir /hadoop
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /
Found 5 items
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:29 /hadoop
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:26 /system
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:11 /test
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:09 /tmp
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:09 /usr
8) put Command
File
Copy file from single src, or multiple srcs from local file system to the destination file system.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -put test /hadoop
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop
Found 1 items
-rw-r--r-- 2 ubuntu supergroup 16 2016-11-07 01:35 /hadoop/test
Directory
HDFS Command to copy directory from single source, or multiple sources from local file system to the destination file system.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -put hello /hadoop/
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop
Found 2 items
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:43 /hadoop/hello
-rw-r--r-- 2 ubuntu supergroup 16 2016-11-07 01:35 /hadoop/test
9) du Command
Displays size of files and directories contained in the given directory or the size of a file if its just a file.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -du /
59 /hadoop
0 /system
0 /test
0 /tmp
0 /usr
10) rm Command
HDFS Command to remove the file from HDFS.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -rm /hadoop/test
16/11/07 01:53:29 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /hadoop/test
11) expunge Command
HDFS Command that makes the trash empty.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -expunge
16/11/07 01:55:54 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
12) rm -r Command
HDFS Command to remove the entire directory and all of its content from HDFS.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -rm -r /hadoop/hello
16/11/07 01:58:52 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /hadoop/hello
13) chmod Command
Change the permissions of files.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -chmod 777 /hadoop
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /
Found 5 items
drwxrwxrwx - ubuntu supergroup 0 2016-11-07 01:58 /hadoop
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:26 /system
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:11 /test
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:09 /tmp
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:09 /usr
14) get Command
HDFS Command to copy files from hdfs to the local file system.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -get /hadoop/test /home/ubuntu/Desktop/
ubuntu@ubuntu-VirtualBox:~$ ls -l /home/ubuntu/Desktop/
total 4
-rw-r--r-- 1 ubuntu ubuntu 16 Nov 8 00:47 test
15) cat Command
HDFS Command that copies source paths to stdout.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -cat /hadoop/test
This is a test.
16) touchz Command
HDFS Command to create a file in HDFS with file size 0 bytes.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -touchz /hadoop/sample
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop
Found 2 items
-rw-r--r-- 2 ubuntu supergroup 0 2016-11-08 00:57 /hadoop/sample
-rw-r--r-- 2 ubuntu supergroup 16 2016-11-08 00:45 /hadoop/test
17) text Command
HDFS Command that takes a source file and outputs the file in text format.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -text /hadoop/test
This is a test.
18) copyFromLocal Command
HDFS Command to copy the file from Local file system to HDFS.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -copyFromLocal /home/ubuntu/new /hadoop
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop
Found 3 items
-rw-r--r-- 2 ubuntu supergroup 43 2016-11-08 01:08 /hadoop/new
-rw-r--r-- 2 ubuntu supergroup 0 2016-11-08 00:57 /hadoop/sample
-rw-r--r-- 2 ubuntu supergroup 16 2016-11-08 00:45 /hadoop/test
19) copyToLocal Command
Similar to get command, except that the destination is restricted to a local file reference.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -copyToLocal /hadoop/sample /home/ubuntu/
ubuntu@ubuntu-VirtualBox:~$ ls -l s*
-rw-r--r-- 1 ubuntu ubuntu 0 Nov 8 01:12 sample
-rw-rw-r-- 1 ubuntu ubuntu 102436055 Jul 20 04:47 sqoop-1.99.7-bin-hadoop200.tar.gz
20) mv Command
HDFS Command to move files from source to destination. This command allows multiple sources as well, in which case the destination needs to be a directory.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -mv /hadoop/sample /tmp
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /tmp
Found 1 items
-rw-r--r-- 2 ubuntu supergroup 0 2016-11-08 00:57 /tmp/sample
21) cp Command
HDFS Command to copy files from source to destination. This command allows multiple sources as well, in which case the destination must be a directory.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -cp /tmp/sample /usr
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /usr
Found 1 items
-rw-r--r-- 2 ubuntu supergroup 0 2016-11-08 01:22 /usr/sample
22) tail Command
Displays last kilobyte of the file "new" to stdout
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -tail /hadoop/new
This is a new file.
Running HDFS commands.
23) chown Command
HDFS command to change the owner of files.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -chown root:root /tmp
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /
Found 5 items
drwxrwxrwx - ubuntu supergroup 0 2016-11-08 01:17 /hadoop
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:26 /system
drwxr-xr-x - ubuntu supergroup 0 2016-11-07 01:11 /test
drwxr-xr-x - root root 0 2016-11-08 01:17 /tmp
drwxr-xr-x - ubuntu supergroup 0 2016-11-08 01:22 /usr
24) setrep Command
Default replication factor to a file is 3. Below HDFS command is used to change replication factor of a file.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -setrep -w 2 /usr/sample
Replication 2 set: /usr/sample
Waiting for /usr/sample ... done
25) distcp Command
Copy a directory from one node in the cluster to another
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop
26) stat Command
Print statistics about the file/directory at <path> in the specified format. Format accepts filesize in blocks (%b), type (%F), group name of owner (%g), name (%n), block size (%o), replication (%r), user name of owner(%u), and modification date (%y, %Y). %y shows UTC date as “yyyy-MM-dd HH:mm:ss” and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -stat "%F %u:%g %b %y %n" /hadoop/test
regular file ubuntu:supergroup 16 2016-11-07 19:15:22 test
27) getfacl Command
Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then getfacl also displays the default ACL.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -getfacl /hadoop
# file: /hadoop
# owner: ubuntu
# group: supergroup
28) du -s Command
Displays a summary of file lengths.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -du -s /hadoop
59 /hadoop
29) checksum Command
Returns the checksum information of a file.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -checksum /hadoop/new
/hadoop/new MD5-of-0MD5-of-512CRC32C 000002000000000000000000639a5d8ac275be8d0c2b055d75208265
30) getmerge Command
Takes a source directory and a destination file as input and concatenates files in src into the destination local file.
ubuntu@ubuntu-VirtualBox:~$ cat test
This is a test.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -cat /hadoop/new
This is a new file.
Running HDFS commands.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -getmerge /hadoop/new test
ubuntu@ubuntu-VirtualBox:~$ cat test
This is a new file.
Running HDFS commands.
Conclusion
This is the end of the HDFS Command blog, we hope it was informative and you were able to execute all the commands. We learned to create, upload and list the the contents in our HDFS directories. We also acquired the skills to download files from HDFS to our local file system and explored a few advanced features of HDFS file management using the command line.