3.2 Searching and Extracting Data from Files

3.2 Searching and Extracting Data from Files

Weight: 3

Description: Search and extract data from files in the home directory.

Key Knowledge Areas:

  • Command line pipes

  • I/O redirection

  • Basic Regular Expressions using ., [ ], *, and ?

The following is a partial list of the used files, terms and utilities:

  • grep

  • less

  • cat, head, tail

  • sort

  • cut

  • wc

cat

The cat command in Linux is one of the most frequently used commands in Unix-like operating systems. It stands for “concatenate” and is primarily used to read, display, and concatenate text files.

  • Primarily used to read and display the contents of files on the terminal.

  • Can concatenate multiple files and display them as a single continuous output.

  • Allows users to create new files or append data to existing ones.

  • Useful for quick file inspection and merging without opening a text editor.

View the Content of a Single File using cat

The most basic use of 'cat' is to display the contents of a file on the terminal. This can be achieved by simply providing the filename as an argument:

Syntax:

Example: If our file_name = output.txt

streams

A stream is nothing more than a sequence of bytes that is passed from one file, device, or program to another.

In Linux, a stream is a fundamental concept for handling input, output, and communication between processes. At its core, a stream represents a sequence of bytes that can be read from or written to. Streams provide a uniform interface for data transfer and processing across various input/output operations.

These streams are:

  • standard input stream (stdin), which provides input to commands.

  • standard output stream (stdout), which displays output from commands.

  • standard error stream (stderr), which displays error output from commands.

The streams are also numbered: stdin (0) ,stdout (1), stderr (2).

I/O Redirection

Linux includes redirection commands for each stream. These can be used to write standard output or standard error to a file. If you write to a file that does not exist, a new file with that name will be created prior to writing.

Commands with a single bracket overwrite the destination’s existing contents.

Overwrite

  • > - standard output

  • < - standard input

  • 2> - standard error

Commands with a double bracket do not overwrite the destination’s existing contents.

Append

  • >> - standard output

  • << - standard input

  • 2>> - standard error

Examples:

piping with |

A pipe is a form of redirection (transfer of standard output to some other destination) that is used in Linux and other Unix-like operating systems to send the output of one command/program/process to another command/program/process for further processing. The Unix/Linux systems allow the stdout of a command to be connected to the stdin of another command. You can make it do so by using the pipe character '|'. (found above the backslash \ key on most keyboards). The pipe is used to combine two or more commands, and in this, the output of one command acts as input to another command, and this command's output may act as input to the next command, and so on. It can also be visualized as a temporary connection between two or more commands/ programs/ processes. The command line programs that do the further processing are referred to as filters. This direct connection between commands/ programs/ processes allows them to operate simultaneously and permits data to be transferred between them continuously rather than having to pass it through temporary text files or through the display screen. Pipes are unidirectional i.e., data flows from left to right through the pipeline.

Either command can have options or arguments. We can also use | to redirect the output of the second command in the pipeline to a third command, and so on.

View Kernel Messages in Linux

now lets redirct dmesg out put to less command input :

this way we have more control over reading logs using less command options.

Filters

Filters are are a class of programs that are commonly used with output piped from another program. Many of them are also useful on their own, but they illustrate piping behavior especially well.

  • grep - returns text that matches the string pattern passed to grep.

  • head - is used to display the first few lines of one or more text files directly in the terminal

  • tail- is used to display the last part of a file, showing recent content such as logs or updates.

  • sort- used to sort a file, arranging the records in a particular order.

  • wc - counts characters, lines, and words.

grep

Grep, short for “global regular expression print”, is one of the most useful tools in Linux and Unix systems. It is used to search for specific words, phrases, or patterns inside text files, and shows the matching lines on your screen.

grep Command is useful when you need to quickly find certain keywords or phrases in logs or documents. Let’s consider an example:

Search for a word in a file

If you have a file called notes.txt and you want to find all lines containing the word Python, you can use:

Syntax :

The basic syntax of the `grep` command is as follows:

  • [options]: These are command-line flags that modify the behavior of grep.

  • [pattern]: This is the regular expression you want to search for.

  • [file]: This is the name of the file(s) you want to search within. You can specify multiple files for simultaneous searching.

Commonly Used grep Options

Option

What It Does

Example Command

-i

Case insensitive search

grep -i myfile.txt

-c

Displaying the Count Matches

grep -c "unix" myfile.txt

-l

Display the Matching Filenames

grep -l "unix" *

grep -l "unix" f1.txt f2.txt f3.xt f4.txt

-w

Checking Whole Words : By default, grep matches the given string/pattern even if it is found as a substring in a file. The -w option to grep makes it match only the whole words.

grep -w "unix" myfile.txt

-o

Display Matched Pattern: By default, grep displays the entire line which has the matched string. We can make the grep to display only the matched string by using the -o option

grep -o "unix" myfile.txt

-n

Show Line Numbers

grep -n "unix" myfile.txt

-v

Inverting the Pattern Match: You can display the lines that are not matched with the specified search string pattern using the -v option.

grep -v "unix" myfile.txt

Regular Expressions

Regexps are acronyms for regular expressions(Regex). Regular expressions are special characters or sets of characters that help us to search for data and match the complex pattern. Regexps are most commonly used with the Linux commands: grep, sed, tr, vi.

The following are some basic regular expressions:

Symbol
Description

.

It is called a wild card character, It matches any one character other than the new line.

^

It matches the start of the string.

$

It matches the end of the string.

*

It matches up to zero or more occurrences i.e. any number of times of the character of the string.

\

It is used for escape following character.

()

It is used to match or search for a set of regular expressions.

?

It matches exactly one character in the string or stream.

[ ]

Matches any one of a set characters

[ - ]

Matches any one of a range characters

Globbing and Regex: So Similar, So Different

Beginners sometimes tend to confuse wildcards(globbing) with regular expressions when using grep but they are not the same. Wildcards are a feature provided by the shell to expand file names whereas regular expressions are a text filtering mechanism intended for use with utilities like grep, sed and awk.

grep supports regex for advanced searching:

Command
Description

grep "^unix" myfile.txt

Match Lines Starting with a string

grep "os$" myfile.txt

Match Lines Ending with a String

double quotes " " : Also we need to put our extended regex between double quotes, other wise it might be interpreted by shell and gives us different results.

In order to avoid any mistake while using extended regular expressions, use grep with -E option, -E treats pattern as an extended regular expression(ERE).

regex
match

echo "aa ab ba aaa bbb AB BA" | grep -E "a*b"

aa ab ba aaa bbb AB BA

echo "aa ab ba aaa bbb AB BA" | grep -E "a.b"

aa ab ba aaa bbb AB BA

echo "aa ab ba aaa bbb AB BA" | grep -E "a?b"

aa ab ba aaa bbb AB BA

egrep

Head and Tail Commands

The head command in Linux is used to display the first few lines of one or more text files directly in the terminal.

  • The head command reads a file and prints the top portion (default is the first 10 lines) to standard output.

  • It’s commonly used when you want to quickly preview the beginning of a file without opening it in an editor.

  • It supports options to specify the number of lines or bytes to display.

  • You can use it with multiple files at once to view the first lines of each.

the basic head command to display the first 10 lines of the sample.txt file:

example:

Syntax:

If no file name is specified, head reads from standard input (stdin)

Head command common options:

Option
Long-Form
Description

-n

--lines

show the specified number of lines

-c

--bytes

show the specified number of bytes

-v

--verbose

show the file name tag

-q

--quiet

don't separate the content of multiple files with a file name tag

example:

tail

Tail Command in Linux is used to display the last part of a file, showing recent content such as logs or updates.

  • By default, it shows the last 10 lines of a file.

  • Commonly used for monitoring log files and debugging.

  • You can customize the number of lines displayed using the -n option.

  • Useful for viewing the most recent entries without opening the entire file.

Without any option it display only the last 10 lines of the file specified:

another example:

Syntax:

tail command common options:

Short Form
Long Form
Description

-c

--bytes=[+]NUM

Shows the last NUM bytes of a file. Using + shows the bytes following from the specified NUM byte of each file.

-f

--follow[={name|descriptor}]

Monitors file for changes and outputs new data as the file grows. When no value is specified after --follow=, descriptor is used as the default value. This means that the update mode continues to run even when the file is renamed or moved. Specify the --max-unchanged-stats=N argument to reopen a [file] that has not changed size after N (default 5) iterations to check if it has been unlinked or renamed. Specify the --pid=PID argument to exit tail after the process with the PID process ID terminates.

-F

--follow= name --retry

Instructs tail to keep updating the output even if the original file is removed during the log rotation and replaced by a new one with the same name.

-n

--lines=[+]NUM

Shows the last NUM lines instead of the default 10. Using -n +NUM causes the output to start with the line NUM.

-q

--quiet, --silent

Omits the file names from the output, displaying only the contents.

-s

--sleep-interval=N

Used in combination with -f. Instructs tail to wait for N seconds (default 1) between iterations.

-v

--verbose

Makes tail always print the file name before displaying the contents.

-z

--zero-terminated

Uses NUL as the line delimiter instead of the newline character.

--help

Displays the help file.

sort

The ‘sort’ command is a Linux program used for printing lines of input text files and concatenation of all files in sorted order. Sort command takes blank space as field separator and the entire input file as the sort key. It is important to notice that the sort command doesn’t actually sort the files but only prints the sorted output until you redirect the output.

Syntax

example:

If a file has words/lines beginning with both upper case and lower case characters, then sort displays those with upper case at top. However, we can change this behavior using the -f command line option:

The -n option sort the contents numerically. Also we can sort a file base on "n"th column with -kn option:

user -r to reverse the result of comparisons. Other options of sort command:

sort command common options:

Short option form

Long option form

Description

-b

--ignore-leading-blanks

Causes sort to ignore leading blanks.

-d

--dictionary-order

Causes sort to consider only blanks and alphanumeric characters.

-f

--ignore-case

Ignores the default case sorting rule and changes all lowercase letters to uppercase before comparison.

-M

--month-sort

Sorts lines according to months (Jan-Dec).

-h

--human-numeric-sort

Compares human-readable numbers (e.g., 2K 1G).

-n

--numeric-sort

Compares data according to string numerical values.

-R

--random-sort

Sorts data by a random hash of keys but groups identical keys together.

-r

--reverse

Reverses the comparison results.

--sort=WORD

Sort data according to the specified WORD: general-numeric -g, human-numeric -h, month -M, numeric -n, random -R, version -V.

-c

--check, --check=diagnose-first

Checks if the input is already sorted but doesn't sort it.

--debug

Annotates the part of the line used for sorting.

-k

--key=KEYDEF

Sort data using the specified KEYDEF, which gives the key location and type.

-m

--merge

Causes sort to merge already sorted files.

-o

--output=FILE

Redirects the output to FILE instead of printing it in standard output.

-t

--field-separator=SEP

Uses the specified SEP separator instead of non-blank to blank transition.

-z

--zero-terminated

Causes sort to use NUL as the line delimiter instead of the newline character.

--help

Displays the help file with full options list and exits.

cut

The cut command in UNIX is a command line utility for cutting sections from each line of files and writing the result to standard output. It can be used to cut parts of a line by byte position, character and delimiter.

syntax:

Note: If FILE is not specified, `cut` reads from standard input (stdin).

cut by byte position:

cut by character:

cut based on a delimiter:

To cut using a delimiter use the -d option. This is normally used in conjunction with the -f option to specify the field that should be cut. examples:

Options Available in cut Command

Here is a list of the most commonly used options with the Linux cut command:

Option

Description

-b, --bytes=LIST

Selects only the bytes specified in LIST (e.g., -b 1-3,7).

-c, --characters=LIST

Selects only the characters specified in LIST (e.g., -c 1-3,7).

-d, --delimiter=DELIM

Uses DELIM as the field delimiter character instead of the tab character.

-f, --fields=LIS

Selects only the fields specified in LIST, separated by the delimiter character (default is tab).

-n

Do not split multi-byte characters (no effect unless -b or -c is specified).

--complement

Invert the selection of fields/characters. Print the fields/characters not selected.

--output-delimiter

Changes the output delimiter for fields in the cut command bash.

wc

wc (short for word count) is a command line tool in Unix/Linux operating systems, which is used to find out the number of newline count, word count, byte and character count in the files specified by the File arguments to the standard output and hold a total count for all named files.

When you define the File parameter, the wc command prints the file names as well as the requested counts. If you do not define a file name for the File parameter, it prints only the total count to the standard output. example:

Three numbers shown below are 16(number of lines), 76 (number of words[by default space delimited]) and 490(number of bytes) of the file.

Syntax :

If no file is specified, it will read from standard input, meaning you can type text manually or pipe it from another command.

The followings are the options and usage provided by the wc command.

  • wc -l – Prints the number of lines in a file.

  • wc -w – prints the number of words in a file.

  • wc -c – Displays the count of bytes in a file.

  • wc -m – prints the count of characters from a file.

  • wc -L – prints only the length of the longest line in a file.

That's all.

.

.

.


sources:

https://serveracademy.com/blog/the-linux-cat-command/ https://www.geeksforgeeks.org/linux-unix/cat-command-in-linux-with-examples/ https://www.geeksforgeeks.org/linux-unix/input-output-redirection-in-linux/ https://www.geeksforgeeks.org/linux-unix/redirect-output-to-a-file-and-stdout/ https://www.geeksforgeeks.org/linux-unix/dmesg-command-linux-driver-messages/ https://www.geeksforgeeks.org/linux-unix/grep-command-in-unixlinux/ https://www.geeksforgeeks.org/linux-unix/regular-expression-grep/ https://phoenixnap.com/kb/linux-head https://www.geeksforgeeks.org/linux-unix/tail-command-linux-examples/ https://phoenixnap.com/kb/linux-tail https://www.geeksforgeeks.org/linux-unix/sort-command-linuxunix-examples/ https://phoenixnap.com/kb/linux-sort https://www.geeksforgeeks.org/linux-unix/cut-command-linux-examples/

example fruit file to play with it:

Last updated