3.2 Searching and Extracting Data from Files
3.2 Searching and Extracting Data from Files
Weight: 3
Description: Search and extract data from files in the home directory.
Key Knowledge Areas:
Command line pipes
I/O redirection
Basic Regular Expressions using ., [ ], *, and ?
The following is a partial list of the used files, terms and utilities:
grep
less
cat, head, tail
sort
cut
wc
cat
The cat command in Linux is one of the most frequently used commands in Unix-like operating systems. It stands for “concatenate” and is primarily used to read, display, and concatenate text files.
Primarily used to read and display the contents of files on the terminal.
Can concatenate multiple files and display them as a single continuous output.
Allows users to create new files or append data to existing ones.
Useful for quick file inspection and merging without opening a text editor.
View the Content of a Single File using cat
The most basic use of 'cat' is to display the contents of a file on the terminal. This can be achieved by simply providing the filename as an argument:
Syntax:
Example: If our file_name = output.txt
streams
A stream is nothing more than a sequence of bytes that is passed from one file, device, or program to another.
In Linux, a stream is a fundamental concept for handling input, output, and communication between processes. At its core, a stream represents a sequence of bytes that can be read from or written to. Streams provide a uniform interface for data transfer and processing across various input/output operations.
These streams are:
standard input stream (stdin), which provides input to commands.
standard output stream (stdout), which displays output from commands.
standard error stream (stderr), which displays error output from commands.
The streams are also numbered: stdin (0) ,stdout (1), stderr (2).

I/O Redirection
Linux includes redirection commands for each stream. These can be used to write standard output or standard error to a file. If you write to a file that does not exist, a new file with that name will be created prior to writing.
Commands with a single bracket overwrite the destination’s existing contents.
Overwrite
> - standard output
< - standard input
2> - standard error
Commands with a double bracket do not overwrite the destination’s existing contents.
Append
>> - standard output
<< - standard input
2>> - standard error
Examples:
piping with |
A pipe is a form of redirection (transfer of standard output to some other destination) that is used in Linux and other Unix-like operating systems to send the output of one command/program/process to another command/program/process for further processing. The Unix/Linux systems allow the stdout of a command to be connected to the stdin of another command. You can make it do so by using the pipe character '|'. (found above the backslash \ key on most keyboards).
The pipe is used to combine two or more commands, and in this, the output of one command acts as input to another command, and this command's output may act as input to the next command, and so on. It can also be visualized as a temporary connection between two or more commands/ programs/ processes. The command line programs that do the further processing are referred to as filters.
This direct connection between commands/ programs/ processes allows them to operate simultaneously and permits data to be transferred between them continuously rather than having to pass it through temporary text files or through the display screen.
Pipes are unidirectional i.e., data flows from left to right through the pipeline.
Either command can have options or arguments. We can also use | to redirect the output of the second command in the pipeline to a third command, and so on.
View Kernel Messages in Linux
dmesg command also called “driver message” or “display message” is used to examine the kernel ring buffer and print the message buffer of the kernel. The output of this command contains the messages produced by the device drivers.
now lets redirct dmesg out put to less command input :
this way we have more control over reading logs using less command options.
Filters
Filters are are a class of programs that are commonly used with output piped from another program. Many of them are also useful on their own, but they illustrate piping behavior especially well.
grep - returns text that matches the string pattern passed to grep.
head - is used to display the first few lines of one or more text files directly in the terminal
tail- is used to display the last part of a file, showing recent content such as logs or updates.
sort- used to sort a file, arranging the records in a particular order.
wc - counts characters, lines, and words.
grep
Grep, short for “global regular expression print”, is one of the most useful tools in Linux and Unix systems. It is used to search for specific words, phrases, or patterns inside text files, and shows the matching lines on your screen.
grep Command is useful when you need to quickly find certain keywords or phrases in logs or documents. Let’s consider an example:
Search for a word in a file
If you have a file called notes.txt and you want to find all lines containing the word Python, you can use:
Syntax :
The basic syntax of the `grep` command is as follows:
[options]: These are command-line flags that modify the behavior ofgrep.[pattern]: This is the regular expression you want to search for.[file]: This is the name of the file(s) you want to search within. You can specify multiple files for simultaneous searching.
Commonly Used grep Options
grep OptionsOption
What It Does
Example Command
-i
Case insensitive search
grep -i myfile.txt
-c
Displaying the Count Matches
grep -c "unix" myfile.txt
-l
Display the Matching Filenames
grep -l "unix" *
grep -l "unix" f1.txt f2.txt f3.xt f4.txt
-w
Checking Whole Words : By default, grep matches the given string/pattern even if it is found as a substring in a file. The -w option to grep makes it match only the whole words.
grep -w "unix" myfile.txt
-o
Display Matched Pattern: By default, grep displays the entire line which has the matched string. We can make the grep to display only the matched string by using the -o option
grep -o "unix" myfile.txt
-n
Show Line Numbers
grep -n "unix" myfile.txt
-v
Inverting the Pattern Match: You can display the lines that are not matched with the specified search string pattern using the -v option.
grep -v "unix" myfile.txt
Regular Expressions
Regexps are acronyms for regular expressions(Regex). Regular expressions are special characters or sets of characters that help us to search for data and match the complex pattern. Regexps are most commonly used with the Linux commands: grep, sed, tr, vi.
The following are some basic regular expressions:
.
It is called a wild card character, It matches any one character other than the new line.
^
It matches the start of the string.
$
It matches the end of the string.
*
It matches up to zero or more occurrences i.e. any number of times of the character of the string.
\
It is used for escape following character.
()
It is used to match or search for a set of regular expressions.
?
It matches exactly one character in the string or stream.
[ ]
Matches any one of a set characters
[ - ]
Matches any one of a range characters
grep supports regex for advanced searching:
grep "^unix" myfile.txt
Match Lines Starting with a string
grep "os$" myfile.txt
Match Lines Ending with a String
double quotes " " : Also we need to put our extended regex between double quotes, other wise it might be interpreted by shell and gives us different results.
In order to avoid any mistake while using extended regular expressions, use grep with -E option, -E treats pattern as an extended regular expression(ERE).
echo "aa ab ba aaa bbb AB BA" | grep -E "a*b"
aa ab ba aaa bbb AB BA
echo "aa ab ba aaa bbb AB BA" | grep -E "a.b"
aa ab ba aaa bbb AB BA
echo "aa ab ba aaa bbb AB BA" | grep -E "a?b"
aa ab ba aaa bbb AB BA
egrep
egrep is a pattern searching command which belongs to the family of grep functions. It works the same way as grep -E does. It treats the pattern as an extended regular expression and prints out the lines that match the pattern. If there are several files with the matching pattern, it also displays the file names for each line.
Copy
Options: Most of the options for this command are same as grep.
So instead of using grep -E command in above we can use egrep easily.
Head and Tail Commands

head
The head command in Linux is used to display the first few lines of one or more text files directly in the terminal.
The head command reads a file and prints the top portion (default is the first 10 lines) to standard output.
It’s commonly used when you want to quickly preview the beginning of a file without opening it in an editor.
It supports options to specify the number of lines or bytes to display.
You can use it with multiple files at once to view the first lines of each.
the basic head command to display the first 10 lines of the sample.txt file:
example:
Syntax:
If no file name is specified,
headreads from standard input (stdin)
Head command common options:
-n
--lines
show the specified number of lines
-c
--bytes
show the specified number of bytes
-v
--verbose
show the file name tag
-q
--quiet
don't separate the content of multiple files with a file name tag
example:
tail
Tail Command in Linux is used to display the last part of a file, showing recent content such as logs or updates.
By default, it shows the last 10 lines of a file.
Commonly used for monitoring log files and debugging.
You can customize the number of lines displayed using the -n option.
Useful for viewing the most recent entries without opening the entire file.
Without any option it display only the last 10 lines of the file specified:
another example:
Syntax:
tail command common options:
-c
--bytes=[+]NUM
Shows the last NUM bytes of a file. Using + shows the bytes following from the specified NUM byte of each file.
-f
--follow[={name|descriptor}]
Monitors file for changes and outputs new data as the file grows. When no value is specified after --follow=, descriptor is used as the default value. This means that the update mode continues to run even when the file is renamed or moved.
Specify the --max-unchanged-stats=N argument to reopen a [file] that has not changed size after N (default 5) iterations to check if it has been unlinked or renamed.
Specify the --pid=PID argument to exit tail after the process with the PID process ID terminates.
-F
--follow= name --retry
Instructs tail to keep updating the output even if the original file is removed during the log rotation and replaced by a new one with the same name.
-n
--lines=[+]NUM
Shows the last NUM lines instead of the default 10. Using -n +NUM causes the output to start with the line NUM.
-q
--quiet, --silent
Omits the file names from the output, displaying only the contents.
-s
--sleep-interval=N
Used in combination with -f. Instructs tail to wait for N seconds (default 1) between iterations.
-v
--verbose
Makes tail always print the file name before displaying the contents.
-z
--zero-terminated
Uses NUL as the line delimiter instead of the newline character.
--help
Displays the help file.
sort
The ‘sort’ command is a Linux program used for printing lines of input text files and concatenation of all files in sorted order. Sort command takes blank space as field separator and the entire input file as the sort key. It is important to notice that the sort command doesn’t actually sort the files but only prints the sorted output until you redirect the output.
Syntax
example:
If a file has words/lines beginning with both upper case and lower case characters, then sort displays those with upper case at top. However, we can change this behavior using the -f command line option:
The -n option sort the contents numerically. Also we can sort a file base on "n"th column with -kn option:
user -r to reverse the result of comparisons. Other options of sort command:
sort command common options:
Short option form
Long option form
Description
-b
--ignore-leading-blanks
Causes sort to ignore leading blanks.
-d
--dictionary-order
Causes sort to consider only blanks and alphanumeric characters.
-f
--ignore-case
Ignores the default case sorting rule and changes all lowercase letters to uppercase before comparison.
-M
--month-sort
Sorts lines according to months (Jan-Dec).
-h
--human-numeric-sort
Compares human-readable numbers (e.g., 2K 1G).
-n
--numeric-sort
Compares data according to string numerical values.
-R
--random-sort
Sorts data by a random hash of keys but groups identical keys together.
-r
--reverse
Reverses the comparison results.
--sort=WORD
Sort data according to the specified WORD: general-numeric -g, human-numeric -h, month -M, numeric -n, random -R, version -V.
-c
--check, --check=diagnose-first
Checks if the input is already sorted but doesn't sort it.
--debug
Annotates the part of the line used for sorting.
-k
--key=KEYDEF
Sort data using the specified KEYDEF, which gives the key location and type.
-m
--merge
Causes sort to merge already sorted files.
-o
--output=FILE
Redirects the output to FILE instead of printing it in standard output.
-t
--field-separator=SEP
Uses the specified SEP separator instead of non-blank to blank transition.
-z
--zero-terminated
Causes sort to use NUL as the line delimiter instead of the newline character.
--help
Displays the help file with full options list and exits.
cut
The cut command in UNIX is a command line utility for cutting sections from each line of files and writing the result to standard output. It can be used to cut parts of a line by byte position, character and delimiter.
syntax:
Note: If
FILEis not specified, `cut`reads from standard input (stdin).
cut by byte position:
cut by character:
cut based on a delimiter:

To cut using a delimiter use the -d option. This is normally used in conjunction with the -f option to specify the field that should be cut. examples:
Options Available in cut Command
Here is a list of the most commonly used options with the Linux cut command:
Option
Description
-b, --bytes=LIST
Selects only the bytes specified in LIST (e.g., -b 1-3,7).
-c, --characters=LIST
Selects only the characters specified in LIST (e.g., -c 1-3,7).
-d, --delimiter=DELIM
Uses DELIM as the field delimiter character instead of the tab character.
-f, --fields=LIS
Selects only the fields specified in LIST, separated by the delimiter character (default is tab).
-n
Do not split multi-byte characters (no effect unless -b or -c is specified).
--complement
Invert the selection of fields/characters. Print the fields/characters not selected.
--output-delimiter
Changes the output delimiter for fields in the cut command bash.
wc
wc (short for word count) is a command line tool in Unix/Linux operating systems, which is used to find out the number of newline count, word count, byte and character count in the files specified by the File arguments to the standard output and hold a total count for all named files.
When you define the File parameter, the wc command prints the file names as well as the requested counts. If you do not define a file name for the File parameter, it prints only the total count to the standard output. example:
Three numbers shown below are 16(number of lines), 76 (number of words[by default space delimited]) and 490(number of bytes) of the file.
Syntax :
If no file is specified, it will read from standard input, meaning you can type text manually or pipe it from another command.
The followings are the options and usage provided by the wc command.
wc -l– Prints the number of lines in a file.wc -w– prints the number of words in a file.wc -c– Displays the count of bytes in a file.wc -m– prints the count of characters from a file.wc -L– prints only the length of the longest line in a file.
That's all.
.
.
.
sources:
https://serveracademy.com/blog/the-linux-cat-command/ https://www.geeksforgeeks.org/linux-unix/cat-command-in-linux-with-examples/ https://www.geeksforgeeks.org/linux-unix/input-output-redirection-in-linux/ https://www.geeksforgeeks.org/linux-unix/redirect-output-to-a-file-and-stdout/ https://www.geeksforgeeks.org/linux-unix/dmesg-command-linux-driver-messages/ https://www.geeksforgeeks.org/linux-unix/grep-command-in-unixlinux/ https://www.geeksforgeeks.org/linux-unix/regular-expression-grep/ https://phoenixnap.com/kb/linux-head https://www.geeksforgeeks.org/linux-unix/tail-command-linux-examples/ https://phoenixnap.com/kb/linux-tail https://www.geeksforgeeks.org/linux-unix/sort-command-linuxunix-examples/ https://phoenixnap.com/kb/linux-sort https://www.geeksforgeeks.org/linux-unix/cut-command-linux-examples/
example fruit file to play with it:
Last updated
