103.7. Search text files using regular expressions

Weight: 2

Description: Candidates should be able to manipulate files and text data using regular expressions. This objective includes creating simple regular expressions containing several notational elements. It also includes using regular expression tools to perform searches through a filesystem or file content.

Key Knowledge Areas:

  • Create simple regular expressions containing several notational elements

  • Use regular expression tools to perform searches through a filesystem or file content

Terms and Utilities:

  • grep

  • egrep

  • fgrep

  • sed

  • regex(7)

While we are working with text files, often it will happen that we are looking for a text in text file.We might be looking for something that is not that specific. For example we are looking for "linux" or "Linux" or what ever, that is where regular expression come in handy.(the most important light weight section!)

Regular Expression

Regular expressions are used when we want to search for specify lines of text containing a particular pattern.Regex can be used in a variety of programs like grep, sed, vi, bash, rename and many more. Here we will use regex with grep command.

A regex pattern uses a regular expression engine which translates those patterns.

Linux has two regular expression engines:

  • The Basic Regular Expression (BRE) engine.

  • The Extended Regular Expression (ERE) engine.

There's a difference between basic and extended regular expressions.Some utilities is written to support only basic regular expressions (BRE)and other utilities are written to support extended regular expressions(ERE) as well.Most Linux programs work well with BRE engine specifications, With GNU grep, there is no difference in functionality.

What makes up regular expressions

There are two types of characters to be found in regular expressions:

  • literal characters

  • metacharacters

Literal characters are standard characters that make up our strings. Every character in this sentence is a literal character. You could use a regular expression to search for each literal character in that string.

Meta characters are a different beast altogether; they are what give regular expressions their power. With meta characters, we can do much more than searching for a single character. Meta characters allow us to search for combinations of strings and much more.

grep

grep stands for general regular expression parser. The grep filter searches a file for a particular pattern of characters, and displays all lines that contain that pattern.grep is a utility that can benefit a lot from regular expressions.

Lets see some examples:

command

description

echo "linux is my os" | grep l

linux is my os

echo "linux is my os" | grep i

linux is my os

Concatenation

Concatenating two regular expressions creates a longer expression.

regex

match

echo "aa ab ba aaa bbb AB BA" | grep a

aa ab ba aaa bbb AB BA

echo "aa ab ba aaa bbb AB BA" | grep ab

aa ab ba aaa bbb AB BA

Extended Regular Expressions

Repetition

  • The * means preceding item will be matched 0 or more times.

  • The + means preceding item will be matched 1 or more times.

  • The ? means preceding item will be matched, 0 or 1 time.

Globbing and Regex: So Similar, So Different

Beginners sometimes tend to confuse wildcards(globbing) with regular expressions when using grep but they are not the same. Wildcards are a feature provided by the shell to expand file names whereas regular expressions are a text filtering mechanism intended for use with utilities like grep, sed and awk.

Special Character

Meaning in Globs

Meaning in Regex

*

zero or more characters

zero or more of the character it follows

?

single occurrence of any character

zero or one of the character it follows but not more than 1

.

literal "." character

any single character

In order to avoid any mistake while using extended regular expressions, use grep with -E option, -E treats pattern as an extended regular expression(ERE).

double quotes " " : Also we need to put our extended regex between double quotes, other wise it might be interpreted by shell and gives us different results.

regex

match

echo "aa ab ba aaa bbb AB BA" | grep -E "a*b"

aa ab ba aaa bbb AB BA

echo "aa ab ba aaa bbb AB BA" | grep -E "a+b"

aa ab ba aaa bbb AB BA

echo "aa ab ba aaa bbb AB BA" | grep -E "a?b"

aa ab ba aaa bbb AB BA

This is a point where egrep comes to play:

egrep

Curly Braces

Curly braces enable us to specify the number of existence for a pattern, it has three formats:

  • {N} meanspreceding item is matched exactly N times.

  • {N,} means preceding item is matched N or more times.

  • The {N,M} means preceding item is matched at least N times, but not more than M times.

regex

match

echo "ab aab aaab aaaab ba Ab" | egrep "a{1}b"

ab aab aaab aaaab ba Ab

echo "ab aab aaab aaaab ba Ab" | egrep "a{2,}b"

ab aab aaab aaaab ba Abb

echo "ab aab aaab aaaab ba Ab" | egrep "a{1,3}b"

ab aab aaab aaaab ba Ab

Any Characters

The .(dot) is a meta character that stands for any character(except newline \n )

  • The . Matches any single character.

One of the most commonly used patterns is .‘*’, which matches an arbitrary length string containing any characters (or no characters at all)

regex

match

echo "ac abc aaabbbccc abcz" | egrep "a.*c"

ac abc aaabbbccc abcz

Anchor Characters

Anchor Characters or easily Line positioning Characters are used To locate the beginning or ending of a line in a text:

  • ^ matches the beginning of the line.

  • $ matches the end of the line.

Alternation (multiple strings)

The alternation operator (|) matches either the preceding or following expression. for example :

  • cat|dog it will match cat or dog.

regex

match

echo "cat dog was a cartoon" | egrep "cat|dog"

cat dog was a cartoon

Character Classes

We can match any character with the dot special character, but what if you match a set of characters only? We can use a character class.The character class matches a set of characters if any of them found, the pattern matches.The character class is defined using square brackets [ ]

Bracket expression

By placing a group of characters within brackets ("[" and "]"), we can specify that the character at that position can be any one character found within the bracket group.

  • [set_of_characters] Matches any single character in set_of_characters. By default, the match is case-sensitive.

regex

match

echo "cat dog was a cartoon" | egrep "cart[o]*"

cat dog was a cartoon

Negating Character Classes

What about searching for a character that is not in the character class? To achieve that, precede the character class range with a caret ^.

  • [^set_of_characters] Negation: Matches any single character that is not in set_of_characters. By default, the match is case sensitive.

regex

match

echo "cat dog was a cartoon" | grep ".*[^cartoon]"

cat dog was a cartoon

Range expression

To specify a range of characters, we can use the (-) hyphen symbol, such as 0-9 for digits. Note that ranges are locale dependent.

regex

match

echo "a12345a a54321a 123" | egrep "[a-z]"

a12345a a54321a 123

Special Character Classes (Named classes)

Several named classes provide a convenient shorthand for commonly used classes. Named classes open with [: and close with :] and may be used within bracket expressions. Some examples:

  • [[:alnum:]] Alphanumeric characters.

  • [[:alpha:]] Alphabetic characters.

  • [[:blank:]] Space and tab characters.

  • [[:digit:]] The digits 0 through 9 (equivalent to 0-9).

  • [[:upper:]] and [:lower:] Upper and lower case letters, respectively.

regex

match

echo "a12345a a54321a 123 678 9" | egrep "[[:alpha:]]"

a12345a a54321a 123

Group expressions

By placing part of a regular expression inside parentheses, we can group that part of the regular expression together.

  • () Groups regular expressions

regex

match

echo "a12345a a54321a 123 678 9" | egrep "(1a.*)|(9)"

a12345a a54321a 123 678 9

Special Characters

regex patterns use some special characters. And we can’t include them in your patterns and if we do so, we won’t get the expected result.These special characters are recognized by regex:

.*[]^${}\+?|()

So how we can search for some of them inside a text? That's where fgrep comes to play.

fgrep

If you don't want the power of regex, it can be very frustrating. This is especially true if you're actually looking for some of the special characters in a bunch of text. You can use the fgrep command (or grep -F, which is the same thing) in order to skip any regex substitutions. Using fgrep, you'll search for exactly what you type, even if they are special characters.

as an example:

sed

Sed command or Stream Editor is very powerful utility offered by Linux/Unix systems. It is mainly used for text substitution , find & replace but it can also perform other text manipulations like insertion, deletion, search , etc.

With sed, we can edit complete files without actually having to open it. Sed also supports the use of regular expressions, which makes sed an even more powerful text manipulation tool.

We have previously used sed for text substitution, here we want to use regex with that.

-r, --regexp-extended tells sed that we are using regular expressions in the script.

By default, sed prints every line. If it makes a substitution, the new text is printed instead of the old one. If you use an optional argument to sed, "sed -n," it will not:

When the "-n" option is used, the "p" flag will cause just the modified line to be printed:

sed is very powerfull tools and that is complicated, we have just take a quick lookat it!

.

.

.

http://www.grymoire.com/Unix/Regular.html

https://www.linux.com/tutorials/introduction-regular-expressions-new-linux-users/

https://linux.die.net/Bash-Beginners-Guide/sect_04_01.html

https://dzone.com/articles/35-examples-of-regex-patterns-using-sed-and-awk-in

https://www.digitalocean.com/community/tutorials/using-grep-regular-expressions-to-search-for-text-patterns-in-linux#regular-expressions

https://www.linuxnix.com/10-file-globbing-examples-linux-unix/

https://www.linuxjournal.com/content/globbing-and-regex-so-similar-so-different

https://www.zyxware.com/articles/4627/difference-between-grep-and-egrep

https://www.geeksforgeeks.org/fgrep-command-in-linux-with-examples/

https://www.linuxtechi.com/20-sed-command-examples-linux-users/

http://www.grymoire.com/Unix/Sed.html

https://www.linuxtechi.com/20-sed-command-examples-linux-users/

.

Last updated

Was this helpful?