103.2. Process text streams using filters

103.2 Process text streams using filters

Weight:3

Description: Candidates should be able to apply filters to text streams.

Key Knowledge Areas:

Send text files and output streams through text utility filters to modify the output using standard UNIX commands found in the GNU textutils package

Terms and Utilities:

  • cat

  • cut

  • expand

  • fmt

  • head

  • join

  • less

  • nl

  • od

  • paste

  • pr

  • sed

  • sort

  • split

  • tail

  • tr

  • unexpand

  • uniq

  • wc

Everything in Linux revolves around streams of data—particularly text streams.

streams

A stream is nothing more than a sequence of bytes that is passed from one file, device, or program to another.

Input and output in the Linux environment is distributed across three streams (which are in fact special files).

These streams are:

  • standard input stream (stdin), which provides input to commands.

  • standard output stream (stdout), which displays output from commands.

  • standard error stream (stderr), which displays error output from commands.

The streams are also numbered: stdin (0) ,stdout (1), stderr (2).

piping with |

Piping is a mechanism for sending data from one program to another. The operator we use is ( | ) (found above the backslash \ key on most keyboards). What this operator does is feed the output from the program on the left as input to the program on the right.

Either command can have options or arguments. We can also use | to redirect the output of the second command in the pipeline to a third command, and so on.

Constructing long pipelines of commands that each have limited capability is a common Linux and UNIX way of accomplishing tasks.

Redirection

Linux includes redirection commands for each stream.We can use > in order to redirect output stream (mostly to a file).

"|" vs ">"

The difference between > (redirection operator) and | (pipeline operator) is that while the > connects a command with a file, the | connects the output of a command with another command.

Text filtering

Text filtering is the process of taking an input stream of text and performing some conversion on the text before sending it to an output stream.

cat

The cat (short for “concatenate“) command is one of the most frequently used command in Linux/Unix like operating systems. cat command allows us to create single or multiple files, view contain of file, concatenate files and redirect output in terminal or files.

Simplest usage of cat is displaying the content of a file:

is can show contents of Multiple Files :

The cat command also used to concatenate number of files together:

create a new file with cat:

"-" A hyphen (used alone) generally signifies that input will be taken from stdin as opposed to a named file:

List of cat command options:

Now what’s the opposite of cat? Yeah it’s ‘tac‘. tac is a command under Linux, try it for yourself.

od

od (Octal dump) command in Linux is used to output the contents of a file in different formats with the octal format being the default. This command is especially useful when debugging Linux scripts for unwanted changes or characters.

as and example:

With -t option we can select output format and display it. (Traditional format specifications may be intermixed):

example:

-A Option displays the contents of input in different format by concatenation some special character (offsets).

  • Hexadecimal (using -x along with -A)

  • Octal (using -o along with -A)

  • Decimal (using -d along with -A)

-An Option displays the contents of input in character format but with no offset information:

expand and unexpand

The expand command is used to convert tabs in files to spaces.

lets try it :

By default, expand converts tabs into the corresponding number of spaces. But it is possible to tweak the number of spaces using the -t (– – tabs=N) command line option. This option requires us to enter the new number of spaces(N) we want the tabs to get converted.

expand command options:

The unexpand command is used to convert space characters (blanks) into tabs in each file(unexpand needs at least two spaces).

Lets do reverse:

unexpand with no options just initial blanks!!! -a option convert all blanks, instead of just initial blanks:

unexpand only convert double spaces and more to tab, it doesn't convert single spaces!

the unexpand command options:

tr command

tr stands for translate. The tr utility copies the standard input to the standard output with substitution or deletion of selected characters. The syntax of tr command is:

Lets convert lower case to upper case:

The following command will also convert lower case to upper case:

Translate white-space to tabs:

if there are two are more spaces present continuously, then the previous command will translate each spaces to a tab. We can use -s option to squeeze repetition of characters :

-d option can be used to delete specified characters :

We complement the sets using -c option For example, to remove all characters except digits, you can use the following.:

tr has many options and sets try tr --help for more information.

pr

The pr command is used to format files for printing. The default header includes the filename and file creation date and time, along with a page number and two lines of blank footer.

Note: When output is created from multiple files or the standard input stream, the current date and time are used instead of the filename and creation date.

We can print files side-by-side in columns and control many aspects of formatting through options.

--column defines number of columns created in the output.-lspecifies page length (default is 66 lines).As usual, refer to the man page for details.

nl

nl is a linux command to number lines of the files, it copies its files to standard output, prepending line numbers.

-n FormatUses the value of the Format variable as the line numbering format. Recognized formats are:

  • ln :Left-justified, leading zeros suppressed

  • rn :Right-justified, leading zeros suppressed (default)

  • rz: Right-justified, leading zeros kept

By default nl skip over blank lines and does not give a number to them, use -ba switch to assign them numbers.

other ln options:

cat -n filename does the same thing that nl command do.

fmt

fmt simple optimal text formatter, it reformats paragraphs in specified file and prints results to the standard output.

By default fmt sets the column width at 75. This can be changed with the -w , --width=WIDTHoption.

fmt command options:

sort and uniq

Sort is a Linux program used for printing lines of input text files and concatenation of all files in sorted order. Sort command takes blank space as field separator and entire Input file as sort key. It is important to notice that sort command don’t actually sort the files but only print the sorted output, until your redirect the output.

If a file has words/lines beginning with both upper case and lower case characters, then sort displays those with upper case at top. However, we can change this behavior using the -f command line option:

The -n option sort the contents numerically. Also we can sort a file base on "n"th column with -kn option:

user -r to reverse the result of comparisons. Other options of sort command:

Sort can sort the contents of two files on standard output in one go! sort 1.txt 2.txt

uniq command is used to report or omit repeated lines, it filters lines from standard input and writes the outcome to standard output.

use -c to display number of repetitions for each line:

-d displays only the repeated lines and visa versa -u just shows uniq ones:

try -D to see all duplicated lines. other options from uniq --help :

split

split command is used to split or break a file into the pieces.

  • Replace filename with the name of the large file you wish to split.

  • Replace prefix with the name you wish to give the small output files.

  • We can exclude [options], or replace it with either of the following:

  • -l linenumber

  • -b bytes

If we use the -l (a lowercase L) option, replace line number with the number of lines we'd like in each of the smaller files (the default is 1,000).

The split command will give each output file it creates the name prefix with an extension tacked to the end that indicates its order. By default, the split command adds aa to the first output file, proceeding through the alphabet to zz for subsequent files. If you do not specify a prefix, most systems use x.

If we use the -b option, replace bytes with the number of bytes you'd like in each of the smaller files.

Some other options are:

For joining the splitted files use cat x* > orginalfile .

wc

The wc (word count) command is used to find out number of newline count, word count, byte and characters count in a file.

A Basic Example of WC Command

Three numbers shown below are 17 (number of lines), 80 (number of words[by default space delimited]) and 511(number of bytes) of the file.

options:

head and tail commands:

The head command reads the first ten lines of a any given file name.

For example lets take a look at /var/log/yum.log file:

For retrieving desired number of lines use -n<number> or simple -<number> options:

Options fromhead --help :

tail

tail command displays last ten lines of any text file.

Similar to the head command above, tail command also support options -n number of lines and n number of characters.

-f option will cause tail will loop forever, checking for new data at the end of the file(s). When new data appears, it will be printed. It works great with log files and lets us see what is going on:

options:

less

less command allows you to view the contents of a file and navigate through file.

By default the only way to exit less command is to hit q key. To change this behavior and automatically exit file when reaching the end of file use the -e or -E option. less -e /var/log/auth.log or less -E /var/log/auth.log

  • To open a file at the first occurrence of a pattern use the following syntax:

less +/sshd /var/log/auth.log

  • In order to automatically append the content of a file opened in less command use the Shift+f keys combination or run less with the following syntax:

less +F /var/log/messages

This makes less to run in interactive mode (live) and display new content on-fly while waiting for new data to be written to file. This behavior is similar to tail -f command. To exit live mode just pressCtrl+ckeys.

Tip: In combination with a pattern you can watch the log file interactively withShift+fkey stroke while matching a keyword.

less vs more

less command is similar to more, he main difference between more and less is that less command is faster because it does not load the entire file at once and allows navigation though file using page up/down keys.

Whether you decide to use more or less, which is a personal choice, remember that less is more with more features.

cut

The cut command in UNIX is a command line utility for cutting sections from each line of files and writing the result to standard output. It can be used to cut parts of a line by byte position, character and delimiter.

cut by byte position:

cut by character:

cut based on a delimiter:

To cut using a delimiter use the -d option. This is normally used in conjunction with the -f option to specify the field that should be cut. examples:

cut has lots of options:

paste

The paste command displays the corresponding lines of multiple files side-by-side.

paste writes lines consisting of the sequentially corresponding lines from each FILE, separated by tabs.To apply a colon (:) as a delimiting character instead of tabs, use -d option:

paste command options:

join

Joins the lines of two files which share a common field of data.

When using join, the input files must be sorted by the join field ONLY, otherwise you may see the warning

By default, the join command only prints pairable lines. unpairable lines are left out in the output. However, if we want, we can still have them in the output using the -a command line option. This option requires you to pass a file number so that the tool knows which file you are talking about.

Inorder to print unpaired lines (meaning, suppress the paired lines in output),use the -v command line option. This options works exactly the way -a works.

join combines lines of files on a common field, which is the first field by default. However, if we want, we can specify a different field for each file using -1 and -2 command line options. for example join -1 2 -2 2 file1 file2 uses second field of each line. join command options:

sed

The name Sed is short for _s_tream _ed_itor. S stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). sed uses regular expressions and the most basic (and popular) usage of sed is the substitution of characters.

As an example lets replace 'l' with "L" in a sample text file:

By default sed just perform the substitution just once for first instance of term, use -g flag to perform the substitution for all instances of term on every line of file.

Additionally, we can gi instead of g in order to ignore character case:

Another example is replacing blank spaces with tab :

sed is extremely powerful, and the tasks it can accomplish are limited only by your imagination.

.

.

.

sources:

https://developer.ibm.com/tutorials/l-lpic1-103-2/

https://ryanstutorials.net/linuxtutorial/piping.php#piping

https://www.tecmint.com/linux-io-input-output-redirection-operators/

https://www.digitalocean.com/community/tutorials/an-introduction-to-linux-i-o-redirection

https://www.tecmint.com/13-basic-cat-command-examples-in-linux/

https://kb.iu.edu/d/afar

https://www.tecmint.com/wc-command-examples/

https://www.tecmint.com/view-contents-of-file-in-linux/

http://landoflinux.com/linux_expand_unexpand_command.html

https://www.geeksforgeeks.org/expand-command-in-linux-with-examples/

https://www.thegeekstuff.com/2012/12/linux-tr-command

https://www.howtoforge.com/linux-uniq-command/

https://www.tecmint.com/linux-file-operations-commands/

https://www.tecmint.com/20-advanced-commands-for-linux-experts/

https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.cmds4/nl.htm

https://shapeshed.com/unix-fmt/

https://www.tecmint.com/linux-more-command-and-less-command-examples/

https://www.tecmint.com/linux-file-operations-commands/

https://shapeshed.com/unix-cut/

https://www.computerhope.com/unix/upaste.htm

https://www.howtoforge.com/tutorial/linux-join-command/

https://www.tecmint.com/sed-command-to-create-edit-and-manipulate-files-in-linux/

https://www.tecmint.com/linux-sed-command-tips-tricks/

.

Last updated

Was this helpful?