Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

WORKING WITH grep, sed, AND awk Pocket Primer: A Quick Guide to Mastering Powerful Command Line Tools
WORKING WITH grep, sed, AND awk Pocket Primer: A Quick Guide to Mastering Powerful Command Line Tools
WORKING WITH grep, sed, AND awk Pocket Primer: A Quick Guide to Mastering Powerful Command Line Tools
Ebook432 pages3 hours

WORKING WITH grep, sed, AND awk Pocket Primer: A Quick Guide to Mastering Powerful Command Line Tools

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book introduces readers to three powerful command-line utilities—grep, sed, and awk—that can create simple yet powerful shell scripts. Using the bash shell, it focuses on small text files to help readers understand these tools. Grep searches for patterns in data, sed modifies data, and awk performs tasks on pattern matches. Aimed at those new to the bash environment, the book is also valuable for those with some experience.
The journey starts with grep, teaching how to search for specific words or patterns in data. It then moves to sed, showing how to change or modify data efficiently. Finally, it delves into awk, a versatile programming language for searching and processing data files. The book also includes a chapter on using regular expressions with these tools, enhancing your scripting capabilities.
Mastering these utilities is crucial for efficient data handling and automation in a bash environment. This book transitions readers from basic to advanced command-line skills, blending theory with practical examples. It is an essential resource for anyone looking to harness the full power of bash scripting.

LanguageEnglish
Release dateAug 14, 2024
ISBN9781836641469
WORKING WITH grep, sed, AND awk Pocket Primer: A Quick Guide to Mastering Powerful Command Line Tools

Read more from Mercury Learning And Information

Related to WORKING WITH grep, sed, AND awk Pocket Primer

Related ebooks

Programming For You

View More

Related articles

Reviews for WORKING WITH grep, sed, AND awk Pocket Primer

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    WORKING WITH grep, sed, AND awk Pocket Primer - Mercury Learning and Information

    PREFACE

    WHAT IS THE GOAL?

    The goal of this book is to introduce readers to three powerful command line utilities that can be combined to create simple yet powerful shell scripts for performing a multitude of tasks. The code samples and scripts use the bash shell, and typically involve small text files, so you can focus on understanding the features of grep, sed, and awk. Aimed at a reader new to working in a bash environment, the book is comprehensive enough to be a good reference and teaches new tricks to those who already have some experience with these command line utilities.

    This book takes introductory concepts and demonstrates their use in simple yet powerful shell scripts. Keep in mind that this book does not cover pure system administration functionality.

    IS THIS BOOK IS FOR ME AND WHAT WILL I LEARN?

    This book is intended for general users as well as anyone who wants to perform a variety of tasks from the command line.

    You will acquire an understanding of how to use grep, sed, and awk whose functionality is discussed in the first five chapters. Specifically, Chapter 1 introduces the grep command, Chapter 2 introduces the sed command, and Chapters 3 through 5 discuss the awk command. The sixth and final chapter introduces you to regular expressions.

    This book saves you the time required to search for relevant code samples, adapting them to your specific needs, which is a potentially time-consuming process.

    HOW WERE THE CODE SAMPLES CREATED?

    The code samples in this book were created and tested using bash on a MacBook Pro with OS X 10.15.7 (macOS Catalina). Regarding their content: the code samples are derived primarily from scripts prepared by the author, and in some cases, there are code samples that incorporate short sections of code from discussions in online forums. The key point to remember is that the code samples follow the Four Cs: they must be Clear, Concise, Complete, and Correct to the extent that it is possible to do so, given the size of this book.

    WHAT YOU NEED TO KNOW FOR THIS BOOK

    You need some familiarity with working from the command line in a Unix-like environment. However, there are subjective prerequisites, such as a desire to learn shell programming, along with the motivation and discipline to read and understand the code samples. In any case, if you’re not sure whether or not you can absorb the material in this book, glance through the code samples to get a feel for the level of complexity.

    HOW DO I SET UP A COMMAND SHELL?

    If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:

    open /Applications/Utilities/Terminal.app

    A second method for Mac users is to open a new command shell on a MacBook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.

    If you are a PC user, you can install Cygwin (open source https://cygwin.com/) that simulates bash commands or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process.

    If you use RStudio, you need to launch a command shell inside of RStudio by navigating to Tools > Command Line, and then you can launch bash commands. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).

    WHAT ARE THE NEXT STEPS AFTER FINISHING THIS BOOK?

    The answer to this question varies widely, mainly because the answer depends heavily on your objectives. The best answer is to try a new tool or technique from the book out on a problem or task you care about, professionally, or personally. Precisely what that might be depends on who you are, as the needs of a data scientist, manager, student, or developer are all different. In addition, keep what you learned in mind as you tackle new data cleaning or manipulation challenges. Sometimes knowing a technique is possible will make finding a solution easier, even if you have to re-read the section to remember exactly how the syntax works.

    If you have reached the limits of what you have learned here and want to get further technical depth on these commands, there is a wide variety of literature published and online resources describing the bash shell, Unix programming, and the grep, sed, and awk commands.

    CHAPTER 1

    Working with GREP

    This chapter introduces you to the versatile grep command that can process an input text stream to generate a desired output text stream. This command also works well with other Unix commands. This chapter contains many short code samples that illustrate various options of the grep command.

    The first part of this chapter introduces the grep command used in isolation, in conjunction with meta characters (such as ^, $, and so forth), and with code snippets that illustrate how to use some of the options of the grep command. Next, you will learn how to match ranges of lines, how to use the back references in grep, and how to escape meta characters in grep.

    The second part of this chapter shows you how to use the grep command to find empty lines and common lines in datasets, as well as the use of keys to match rows in datasets. Next, you will learn how to use character classes with the grep command, as well as the backslash (\) character, and how to specify multiple matching patterns. You will learn how to combine the grep command with the find command and the xargs command, which is useful for matching a pattern in files that reside in different directories. This section contains some examples of common mistakes that people make with the grep command.

    The third section briefly discusses the egrep command and the fgrep command, which are related commands that provide additional functionality that is unavailable in the standard grep utility. The fourth section contains a use case that illustrates how to use the grep command to find matching lines that are then merged to create a new dataset.

    What is the grep Command?

    The grep (Global Regular Expression Print) command is useful for finding strings in one or more files. Several examples are here:

    grepabc *sh displays all the lines of abc in files with suffix sh.

    grep –i abc *sh is the same as the preceding query, but case-insensitive.

    grep –l abc *sh displays all the filenames with suffix sh that contain abc.

    grep –n abc *sh displays all the line numbers of the occurrences of the string abc in files with suffix sh.

    You can perform logical AND and logical OR operations with this syntax:

    grep abc *sh | grep def matches lines containing abc AND def.

    grep abc\|def *sh matches lines containing abc OR def.

    You can combine switches as well: the following command displays the names of the files that contain the string abc (case insensitive):

    grep –il abc *sh

    In other words, the preceding command matches filenames that contain abc, Abc, ABc, ABC, abC, and so forth.

    Another (less efficient way) to display the lines containing abc (case insensitive) is here:

    cat file1 |grep –i abc

    The preceding command involves two processes, whereas the grep using –l switch instead of cat to input the files you want approach involves a single process. The execution time is roughly the same for small text files, but the execution time can become more significant if you are working with multiple large text files.

    You can combine the sort command, the pipe symbol, and the grep command. For example, the following command displays the files with a Jan date in increasing size:

    ls -l |grep Jan | sort -n

    A sample output from the preceding command is here:

    -rw-r--r--  1 oswaldcampesato2  staff       3 Sep 27  2022 abc.txt

    -rw-r--r--  1 oswaldcampesato2  staff       6 Sep 21  2022 control1.txt

    -rw-r--r--  1 oswaldcampesato2  staff      27 Sep 28  2022 fiblist.txt

    -rw-r--r--  1 oswaldcampesato2  staff      28 Sep 14  2022 dest

    -rw-r--r--  1 oswaldcampesato2  staff      36 Sep 14  2022 source

    -rw-r--r--  1 oswaldcampesato2  staff     195 Sep 28  2022 Divisors.py

    -rw-r--r--  1 oswaldcampesato2  staff     267 Sep 28  2022 Divisors2.py

    Meta Characters and the grep Command

    The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any meta-character with special meaning may be quoted by preceding it with a backslash.

    A regular expression may be followed by one of several repetition operators, as shown here:

    . matches any single character.

    ? indicates that the preceding item is optional and will be matched at most once: Z? matches Z or ZZ.

    * indicates that the preceding item will be matched zero or more times: Z* matches Z, ZZ, ZZZ, and so forth.

    + indicates that the preceding item will be matched one or more times: Z+ matches ZZ, ZZZ, and so forth.

    {n} indicates that the preceding item is matched exactly n times: Z{3} matches ZZZ.

    {n,} indicates that the preceding item is matched n or more times: Z{3} matches ZZZ, ZZZZ, and so forth.

    {,m} indicates that the preceding item is matched at most m times: Z{,3} matches Z, ZZ, and ZZZ.

    {n,m} indicates that the preceding item is matched at least n times, but not more than m times: Z{2,4} matches ZZ, ZZZ, and ZZZZ.

    The empty regular expression matches the empty string (i.e., a line in the input stream with no data). Two regular expressions may be joined by the infix operator (|). When used in this manner, the infix operator behaves exactly like a logical OR statement, which directs the grep command to return any line that matches either regular expression.

    Escaping Meta Characters with the grep Command

    Listing 1.1 displays the content of lines.txt that contains lines with words and metacharacters.

    Listing 1.1: lines.txt

    abcd

    ab

    abc

    cd

    defg

    .*.

    ..

    The following grep command lists the lines of length 2 (using the ^ to begin and $ to end, with operators to restrict the length) in lines.txt:

    grep '^..$' lines.txt

    The following command lists the lines of length two in lines.txt that contain two dots (the backslash tells grep to interpret the dots as actual dots, not as metacharacters):

    grep '^\.\.$' lines.txt

    The result is shown here:

    ..

    The following command also displays lines of length 2 that begins and ends with a dot. Note that the * matches any text of any length, including no text at all, and is used as a metacharacter because it is not preceded with a backslash:

    grep '^\.*\.$' lines.txt

    The following command lists the lines that contain a period, followed by an asterisk, and then another period (the * is now a character that must be matched because it is preceded by a backslash):

    grep '^\.\*\.$' lines.txt

    Useful Options for the grep Command

    There are many types of pattern matching possibilities with the grep command, and this section contains an eclectic mix of such commands that handle common scenarios.

    In the following examples, we have four text files (two .sh and two .txt) and two Word documents in a directory. The string abc is found on one line in abc1.txt and three lines in abc3.sh. The string ABC is found on two lines in in ABC2.txt and four lines in ABC4.sh. Notice that abc is not found in ABC files, and ABC is not found in abc files.

    ls *

    ABC.doc   ABC4.sh   abc1.txt   ABC2.txt   abc.doc   abc3.sh

    The following code snippet searches for occurrences of the string abc in all the files in the current directory that have sh as a suffix:

    grep abc *sh

    abc3.sh:abc at start

    abc3.sh:ends with -abc

    abc3.sh:the abc is in the middle

    The -c option counts the number of occurrences of a string: even though ABC4.sh has no matches, it still counts them and returns zero:

    grep –c abc *sh

    The output of the preceding command is here:

    ABC4.sh:0

    abc3.sh:3

    The -e option lets you match patterns that would otherwise cause syntax problems (the character normally is interpreted as an argument for grep):

    grep –e -abc *sh

    abc3.sh:ends with -abc

    The -e option also lets you match multiple patterns:

    grep –e -abc -e comment *sh

    ABC4.sh:# ABC in a comment

    abc3.sh:ends with -abc

    The -i option is to perform a case insensitive match:

    grep –i abc *sh

    ABC4.sh:ABC at start

    ABC4.sh:ends with ABC

    ABC4.sh:the ABC is in the middle

    ABC4.sh:# ABC in a comment

    abc3.sh:abc at start

    abc3.sh:ends with -abc

    abc3.sh:the abc is in the middle

    The -v option inverts the matching string, which means that the output consists of the lines that do not contain the specified string (ABC does not match because -i is not used, and ABC4.sh has an entirely empty line):

    grep –v abc *sh

    Use the -iv options to display the lines that do not contain a specified string using a case insensitive match:

    grep –iv abc *sh

    ABC4.sh:

    abc3.sh:this line won't match

    The -l option is to list only the filenames that contain a successful match (note this matches contents of files, not the filenames). The Word document matches because the actual text is still visible to grep, it is just surrounded by proprietary formatting gibberish. You can do similar things with other formats that contain text, such as XML, HTML, CSV, and so forth:

    grep -l abc *

    abc1.txt

    abc3.sh

    abc.doc

    The -l option is to list only the filenames that contain a successful match:

    grep –l abc *sh

    Use the -il options to display the filenames that contain a specified string using a case insensitive match:

    grep –il abc *doc

    The preceding command is very useful when you want to check for the occurrence of a string in Word documents.

    The -n option specifies line numbers of any matching file:

    grep –n abc *sh

    abc3.sh:1:abc at start

    abc3.sh:2:ends with -abc

    abc3.sh:3:the abc is in the middle

    The -h option suppresses the display of the filename for a successful match:

    grep –h abc *sh

    abc at start

    ends with -abc

    the abc is in the middle

    For the next series of examples, we will use columns4.txt, as shown in Listing 1.2.

    Listing 1.2: columns4.txt

    123 ONE TWO

    456 three four

    ONE TWO THREE FOUR

    five 123 six

    one two three

    four five

    The -o option shows only the matched string (this is how you avoid returning the entire line that matches):

    grep –o one columns4.txt

    The -o option followed by the -b option shows the position of the matched string (returns character position, not line number. The o in one is the 59th character of the file):

    grep –o –b one columns4.txt

    You can specify a recursive search, as shown here (output not shown because it will be different on every client or account. This searches not only every file in directory /etc, but every file in every subdirectory of etc):

    grep –r abc /etc

    The preceding commands match lines where the specified string is a substring of a longer string in the file. For instance, the preceding commands will match occurrences of abc as well as abcd, dabc, abcde, and so forth.

    grep ABC *txt

    ABC2.txt:ABC at start or ABC in middle or end in ABC

    ABC2.txt:ABCD DABC

    If you want to exclude everything except for an exact match, you can use the –w option, as shown here:

    grep –w ABC *txt

    ABC2.txt:ABC at start or ABC in middle or end in ABC

    The --color switch displays the matching string in color:

    grep --color abc *sh

    abc3.sh:abc at start

    abc3.sh:ends with -abc

    abc3.sh:the abc is in the middle

    You can use the pair of metacharacters (.*) to find the occurrences of two words that are separated by an arbitrary number of intermediate characters.

    The following command finds all lines that contain the strings one and three with any number of intermediate characters:

    grep one.*three columns4.txt

    one two three

    You can invert the preceding result by using the –v switch, as shown here:

    grep –v one.*three columns4.txt

    123 ONE TWO

    456 three four

    ONE TWO THREE FOUR

    five 123 six

    four five

    The following command finds all lines that contain the strings one and three with any number of intermediate characters, where the match involves a case-insensitive comparison:

    grep -i one.*three columns4.txt

    ONE TWO THREE FOUR

    one two three

    You can invert the preceding result by using the –v switch, as shown here:

    grep –iv one.*three columns4.txt

    123 ONE TWO

    456 three four

    five 123 six

    four five

    Sometimes you need to search a file for the presence of either of two strings. For example, the following command finds the files that contain start or end:

    grep -l 'start\|end' *

    ABC2.txt

    ABC4.sh

    abc3.sh

    Later in the chapter, you will see how to find files that contain a pair of strings via the grep and xargs commands.

    Character Classes and the grep Command

    This section contains some simple one-line commands that combine the grep command with character classes.

    echo abc | grep '[:alpha:]'

    abc

    echo 123 | grep '[:alpha:]'

    (returns nothing, no match)

    echo abc123 | grep '[:alpha:]'

    abc123

    echo abc | grep '[:alnum:]'

    abc

    echo 123 | grep '[:alnum:]'

    (returns nothing, no match)

    echo abc123 | grep '[:alnum:]'

    abc123

    echo 123 | grep '[:alnum:]'

    (returns nothing, no match)

    echo abc123 | grep '[:alnum:]'

    abc123

    echo abc | grep '[0-9]'

    (returns nothing, no match)

    echo 123 | grep '[0-9]'

    123

    echo abc123 | grep '[0-9]'

    abc123

    echo abc123 | grep -w '[0-9]'

    (returns nothing, no match)

    Working with the –c Option in grep

    Consider a scenario in which a directory (such as a log directory) has files created by an outside program. Your task is to write a shell script that determines which (if any) of the files that contain two occurrences of a string, after which additional processing is performed on the matching files (e.g., use email to send log files containing two or more errors messages to a system administrator for investigation).

    One solution involves the –c option for grep, followed by additional invocations of the grep command.

    The command snippets in this section assume the following data files whose contents are shown below.

    The file hello1.txt contains the

    Enjoying the preview?
    Page 1 of 1