How to Split Large Text File into Smaller Files in Linux

Linux has several utilities for breaking down large files into small files. Split and csplit are two of the popular commands which are used for this purpose. These utilities will help to break down big log files and even archive files to make it into a smaller size. This will make convenient to split large files into smaller sizes so that it fits on smaller media storage devices like USB to meet our purpose. By this technique, we can even speed up network file transfers, because parallel transfers of small files are usually faster. In this article, I'll explain more on how to use these split and csplit utilities to break-down large files in Linux.

Split

To split large files into smaller files, we can use this command utility in Linux.

Syntax

split [options] filename prefix

You can replace filename with the name of the large file you wish to split. And "prefix" with the name you wish to give the small output files. You can exclude [options], or replace it with either of the following:

-a –suffix-length=N use suffixes of length N (default 2)
-b –bytes=SIZE put SIZE bytes per output file
-C –line-bytes=SIZE put at most SIZE bytes of lines per output file
-d –numeric-suffixes use numeric suffixes instead of alphabetic
-l –lines=NUMBER put NUMBER lines per output file

The split command will give each output file it creates the name prefix with an extension tacked to the end that indicates its order. By default, the split command adds aa to the first output file, proceeding through the alphabet to zz for subsequent files. By default, most systems use x as the prefix.

Split Examples

Split command splits the file into n lines per file and names the files as PREFIXaa, PREFIXab, PREFIXac, and so on. By default the PREFIX is x , and the number of lines is 1000 lines per file.

Split a file into multiple pieces by default usage

I've my log file namely system log with 1099 lines, let's see the status of my log file after splitting it using this command.

# cat systemlog | wc -l
1099
# split systemlog
# ll
total 160
-rw-rw-r-- 1 root root 76294 Mar 25 12:02 systemlog
-rw-r--r-- 1 root root 68251 Mar 25 12:07 xaa
-rw-r--r-- 1 root root 8043 Mar 25 12:07 xab
# cat xaa | wc -l
1000
# cat xab | wc -l
99

The command splits the log file into two files xaa and xab, with the first one having 1000 lines and dumps the leftover in the second file.

Split the file, based upon the number of lines

We can split the file into multiple pieces based on the number of lines using -l option. Here, I'm splitting my system log file with 1099 lines into smaller files with 200 lines each. Let's see the commands for the same:

# split -l 200 systemlog
# ll
total 172
-rw-rw-r-- 1 root root 76294 Mar 25 12:02 systemlog
-rw-r--r-- 1 root root 14369 Mar 25 12:16 xaa
-rw-r--r-- 1 root root 12795 Mar 25 12:16 xab
-rw-r--r-- 1 root root 13566 Mar 25 12:16 xac
-rw-r--r-- 1 root root 13681 Mar 25 12:16 xad
-rw-r--r-- 1 root root 13840 Mar 25 12:16 xae
-rw-r--r-- 1 root root 8043 Mar 25 12:16 xaf
# cat xaa | wc -l; cat xab | wc -l; cat xac | wc -l; cat xad | wc -l; cat xae | wc -l; cat xaf | wc -l
200
200
200
200
200
99

You can see that the command has split my log file into five smaller files with 200 lines each and the last one with the leftover.

Split a large file into 500MB files

You can use the option -b to specify the required size limit to split the files. Please see this command which I used for splitting my 1GB Apache log file into two 500MB files each.

# split -b 500MB httpd.log
# ll -lh
total 1.9G
-rw-r--r-- 1 root root 954M Mar 25 12:35 httpd.log
-rw-r--r-- 1 root root 477M Mar 25 12:38 xaa
-rw-r--r-- 1 root root 477M Mar 25 12:38 xab

Split a large file into  200MB files with the given prefix

You can use the option -b to specify the 200M file size and the required prefix as the second argument. Please see the command which I used to split my 1GB Apache log to 200MB files with a prefix named split.log below:

# split -b 200M httpd.log split.log
# ll -lh
total 1.9G
-rw-r--r-- 1 root root 954M Mar 25 12:35 httpd.log
-rw-r--r-- 1 root root 200M Mar 25 12:52 split.logaa
-rw-r--r-- 1 root root 200M Mar 25 12:52 split.logab
-rw-r--r-- 1 root root 200M Mar 25 12:52 split.logac
-rw-r--r-- 1 root root 200M Mar 25 12:52 split.logad
-rw-r--r-- 1 root root 154M Mar 25 12:52 split.logae

In this example, you can see that my log files are broken down into 200MB files with my required prefix.

Split the file and name it with numbers

You can use the option -d to name the files with number suffixes as 00, 01, 02 .. and so on, instead of aa, ab, ac. Please see the command which I used to split my 1GB Apache log to 200MB files with a prefix named log and add numbers to the suffix using the option -d instead of alphabets below:

# split -d -b 200M httpd.log log
# ll -lh
total 1.9G
-rw-r--r-- 1 root root 954M Mar 25 12:35 httpd.log
-rw-r--r-- 1 root root 200M Mar 25 12:58 log00
-rw-r--r-- 1 root root 200M Mar 25 12:58 log01
-rw-r--r-- 1 root root 200M Mar 25 12:58 log02
-rw-r--r-- 1 root root 200M Mar 25 12:58 log03
-rw-r--r-- 1 root root 154M Mar 25 12:58 log04

You can see the manual page of split command using the command man split to see more information.

Csplit

Csplit is another command utility which divides single files into multiple files determined by context lines.

Syntax

csplit [option]... filename pattern.

The files created by csplit normally have names of the form

xxnumber
where number is a two digit decimal number which begins at zero and it increments by one for each new file that csplit creates.

csplit also displays the size, in bytes, of each file that it creates as output.

Options
-A, uses uppercase letters in place of numbers in the number portion of output file names like xxAA, xxAB, and so on.

-a, uses lowercase letters in place of numbers in the number portion of output file names like xxaa, xxab, and so on.

-f prefix, specifies a prefix to use in place of the default xx when naming files. If prefix causes a file name longer than NAME_MAX bytes an error occurs and csplit exits without creating any files.

-k, leaves all created files intact. Normally, when an error occurs, csplit removes files that it has created.

-n number, specifies the number of digits in the number portion of created file names.

-s, suppresses the display of file sizes.

Csplit Examples

By default, the files that csplit produces in output have 'xx' as the prefix and the numbers produced in the output are the byte count for the files the command produced.

Split files based on the number of lines

I have a file which contains 8 lines with the domain names, and my requirement is to split that file at the fourth line, then this can be done by passing '4' as a command line argument after the command and file name.

For example, in our case, domainslist  contains the following information: # cat domainslist domain1.com domain2.com domain3.com domain4.com domain5.com domain6.com domain7.com domain8.com

By passing 4 as a command-line argument, this command splits our domainslist file at the 4th line. The numbers produced in the output are the byte count for the files the command produced. Apparently, two files were produced in the output, namely xx00 and xx01.

# csplit domainslist 4
36
60
# ll
total 20
-rw-r--r-- 1 root root 96 Mar 25 14:08 domainslist
-rw-r--r-- 1 root root 36 Mar 25 14:08 xx00
-rw-r--r-- 1 root root 60 Mar 25 14:08 xx01

# cat xx00
domain1.com
domain2.com
domain3.com
# cat xx01
domain4.com
domain5.com
domain6.com

Split files using regular expressions

We can use regular expressions with the csplit command. For example, in the previous case, if you want the command to repeat the pattern one more time, then you can do this using the following command:

# csplit domainslist 4 {1}
36
48
12
# ll
total 24
-rw-r--r-- 1 root root 96 Mar 25 14:08 domainslist
-rw-r--r-- 1 root root 36 Mar 25 15:13 xx00
-rw-r--r-- 1 root root 48 Mar 25 15:13 xx01
-rw-r--r-- 1 root root 12 Mar 25 15:13 xx02

In this case, we can get three output files.

# cat xx00
domain1.com
domain2.com
domain3.com
# cat xx01
domain4.com
domain5.com
domain6.com
domain7.com
# cat xx02
domain8.com

You can use the asterisk wildcard {*} to tell csplit to repeat your split as many times as possible.

Split files with the given prefix

By default, csplit spilts files and produces the output files to have xx as the prefix. However, if you want, you can change that default prefix using the option -f in the command line with a required prefix.

For example, the following command will produce files having 'domain' as prefix.

# csplit domainslist 4 {1} -f domain
36
48
12
# ll
total 24
-rw-r--r-- 1 root root 36 Mar 25 15:16 domain00
-rw-r--r-- 1 root root 48 Mar 25 15:16 domain01
-rw-r--r-- 1 root root 12 Mar 25 15:16 domain02
-rw-r--r-- 1 root root 96 Mar 25 14:08 domainslist

Split a file by suppressing a line that matches the input pattern

This csplit command provides an option to suppress lines that match the input pattern. The option in question is --suppress-matched.

For example, the following command splits our file at line 4 (xx00 will contain upto line 3, while xx11 will contain rest of the lines excluding line 4).

# csplit --suppress-matched domainslist 4
36
48
# ll
total 20
-rw-r--r-- 1 root root 96 Mar 25 14:08 domainslist
-rw-r--r-- 1 root root 36 Mar 25 15:27 xx00
-rw-r--r-- 1 root root 48 Mar 25 15:27 xx01
# cat xx00
domain1.com
domain2.com
domain3.com
# cat xx01
domain5.com
domain6.com
domain7.com

Customize the number of digits in the output files names

By default, the number of digits that follow the prefix in the output filename is 2. We can use this option -n to customize the number of digits following the prefix in the output file names. For example, if you want to have names like xx001, you can use the command line option which requires the input number signifying the number of digits like -n 3 as below:

# csplit -n 3 domainslist 4
36
60
# ll
-rw-r--r-- 1 root root 96 Mar 25 14:08 domainslist
-rw-r--r-- 1 root root 36 Mar 25 15:34 xx000
-rw-r--r-- 1 root root 60 Mar 25 15:34 xx001

Forcing csplit to save the output file in case of error

By default, csplit removes the output files created in case of any error situation. However, if you want to forcefully save this output file by using the -k option in the command. Please check this example to see the difference in the execution of this command with and without -k option.

By default, csplit removes the output files created in case of any error situation. However, we can forcefully save this output file by using the '-k' option in the command. Please check this example to see the difference in the execution of this command with and without -k option.  On this first example, the command is meant to split our file 'domainslist' on line 3 and repeat the command twice like that which means it should split the second file too at line 3 and should repeat it once again. But since our source file has only eight lines, after the first split it repeats once but unable to iterate twice due to the insufficient range. Hence, no output files are produced due to this error.

# csplit domainslist 3 {2}
24
36
36
csplit: ‘3’: line number out of range on repetition 2
# ll
total 12
drwxr-xr-x 2 root root 4096 Mar 25 15:41 ./
drwxr-xr-x 4 root root 4096 Mar 25 14:07 ../
-rw-r--r-- 1 root root 96 Mar 25 14:08 domainslist

But when we executed the same command with this option -k, the output files were not deleted. Please see the result below:

# csplit -k domainslist 3 {2}
24
36
36
csplit: ‘3’: line number out of range on repetition 2
# ll
total 24
-rw-r--r-- 1 root root 96 Mar 25 14:08 domainslist
-rw-r--r-- 1 root root 24 Mar 25 15:41 xx00
-rw-r--r-- 1 root root 36 Mar 25 15:41 xx01
-rw-r--r-- 1 root root 36 Mar 25 15:41 xx02
# cat xx00
domain1.com
domain2.com
# cat xx01
domain3.com
domain4.com
domain5.com
# cat xx02
domain6.com
domain7.com
domain8.com

You can check the man page for this tool using man csplit to get more information about this.

Refer Also : How to Use Truncate Command in Linux

Wrapping up

These command-line utilities may not be required for a Linux user on daily basis, but this is one of the important utility which will be helpful for you in your server administration. I hope this article explained all the basic options and uses for these tools. Please post your valuable comments and suggestions on this.

Saheetha Shameer 12:05 am

About Saheetha Shameer

Self-motivated and dedicated Linux Administrator having 10 years of working experience on various web-hosting control panels and Unix distributions. I'm a quick learner and have a slight inclination towards following the current and emerging trends in the industry. I'm passionate about testing/reviewing new Linux applications and open source tools.

Author's All Posts
Like to become part of Linoxide Team and contribute tips? Contact us here.

Comments

Your email address will not be published. Required fields are marked *

All comments are subject to moderation.