Command Line Text Processing
Command Line Text Processing
of Contents
Introduction 1.1
Cat, Less, Tail and Head 1.2
GNU grep 1.3
GNU sed 1.4
GNU awk 1.5
Perl the swiss knife 1.6
Sorting stuff 1.7
Restructure text 1.8
File attributes 1.9
Miscellaneous 1.10
1
Introduction
Chapters
Cat, Less, Tail and Head
cat, less, tail, head, Text Editors
GNU grep
GNU sed
GNU awk
Perl the swiss knife
Sorting stuff
sort, uniq, comm, shuf
Restructure text
paste, column, pr, fold
File attributes
wc, du, df, touch, file
Miscellaneous
cut, tr, basename, dirname, xargs, seq
Webinar recordings
Recorded couple of videos based on content in the chapters, not sure if I'll do more
exercises
Check out exercises on github to solve practice questions, right from the command line itself
As of now, only grep exercises has been added. Stay tuned for more
2
Introduction
Acknowledgements
unix.stackexchange and stackoverflow - for getting answers to pertinent questions as well as
sharpening skills by understanding and answering questions
Forums like Linux users, /r/commandline/, /r/linux/, news.ycombinator, devup and others for valuable
feedback (especially spotting mistakes) and encouragement
See wikipedia entry 'Roses Are Red' for poem.txt used as sample text input file
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0
International License
3
Cat, Less, Tail and Head
cat
Concatenate files
Accepting input from stdin
Squeeze consecutive empty lines
Prefix line numbers
Viewing special characters
Writing text to file
tac
Useless use of cat
Further Reading for cat
less
Navigation commands
Further Reading for less
tail
linewise tail
characterwise tail
multiple file input for tail
Further Reading for tail
head
linewise head
characterwise head
multiple file input for head
combining head and tail
Further Reading for head
Text Editors
cat
4
Cat, Less, Tail and Head
$ man cat
CAT(1) User Commands CAT(1)
NAME
cat - concatenate files and print on the standard output
SYNOPSIS
cat [OPTION]... [FILE]...
DESCRIPTION
Concatenate FILE(s) to standard output.
Concatenate files
One or more files can be given as input and hence a lot of times, cat is used to quickly see
contents of small single file on terminal
To save the output of concatenation, just redirect stdout
$ ls
marks_2015.txt marks_2016.txt marks_2017.txt
$ cat marks_201*
Name Maths Science
foo 67 78
bar 87 85
Name Maths Science
foo 70 75
bar 85 88
Name Maths Science
foo 68 76
bar 90 90
5
Cat, Less, Tail and Head
world
world
6
Cat, Less, Tail and Head
2 world
$ whatis nl
nl (1) - number lines of files
$ cat -E marks_2015.txt
Name Maths Science $
foo 67 78$
bar 87 85$
TAB identified by ^I
$ cat -T marks_2015.txt
Name^IMaths^IScience
foo^I67^I78
bar^I87^I85
7
Cat, Less, Tail and Head
Non-printing characters
See Show Non-Printing Characters for more detailed info
$ # NUL character
$ printf 'foo\0bar\0baz\n' | cat -v
foo^@bar^@baz
$ cat sample.txt
This is an example of adding text to a new file using cat command.
Press Ctrl+d on a newline to save and quit.
tac
8
Cat, Less, Tail and Head
$ whatis tac
tac (1) - concatenate and print files in reverse
$ seq 3 | tac
3
2
1
$ tac marks_2015.txt
bar 87 85
foo 67 78
Name Maths Science
Useful in cases where logic is easier to write when working on reversed file
Consider this made up log file, many Warning lines but need to extract only from last such Warning
upto Error line
$ cat report.log
blah blah
Warning: something went wrong
more blah
whatever
Warning: something else went wrong
some text
some more text
Error: something seriously went wrong
blah blah blah
$ whatis rev
rev (1) - reverse lines characterwise
9
Cat, Less, Tail and Head
10
Cat, Less, Tail and Head
less
$ whatis less
less (1) - opposite of more
$ realpath /usr/bin/pager
/bin/less
$ realpath /usr/bin/less
/bin/less
$ diff -s /usr/bin/pager /usr/bin/less
Files /usr/bin/pager and /usr/bin/less are identical
cat command is NOT suitable for viewing contents of large files on the Terminal
less displays contents of a file, automatically fits to size of Terminal, allows scrolling in either
direction and other options for effective viewing
Usually, man command uses less command to display the help page
The navigation commands are similar to vi editor
Navigation commands
Commonly used commands are given below, press h for summary of options
g go to start of file
G go to end of file
q quit
/pattern search for the given pattern in forward direction
?pattern search for the given pattern in backward direction
n go to next pattern
N go to previous pattern
11
Cat, Less, Tail and Head
tail
$ man tail
TAIL(1) User Commands TAIL(1)
NAME
tail - output the last part of files
SYNOPSIS
tail [OPTION]... [FILE]...
DESCRIPTION
Print the last 10 lines of each FILE to standard output. With more
than one FILE, precede each with a header giving the file name.
linewise tail
Consider this sample file, with line numbers prefixed
12
Cat, Less, Tail and Head
$ cat sample.txt
1) Hello World!
2)
3) Good day
4) How do you do?
5)
6) Just do it
7) Believe it!
8)
9) Today is sunny
10) Not a bit funny
11) No doubt you like it too
12)
13) Much ado about nothing
14) He he he
15) Adios amigo
$ tail sample.txt
6) Just do it
7) Believe it!
8)
9) Today is sunny
10) Not a bit funny
11) No doubt you like it too
12)
13) Much ado about nothing
14) He he he
15) Adios amigo
13
Cat, Less, Tail and Head
when number is prefixed with + sign, all lines are fetched from that particular line number to end of
file
$ seq 13 17 | tail -n +3
15
16
17
characterwise tail
Note that this works byte wise and not suitable for multi-byte character encodings
14
Cat, Less, Tail and Head
head
15
Cat, Less, Tail and Head
$ man head
HEAD(1) User Commands HEAD(1)
NAME
head - output the first part of files
SYNOPSIS
head [OPTION]... [FILE]...
DESCRIPTION
Print the first 10 lines of each FILE to standard output. With more
than one FILE, precede each with a header giving the file name.
linewise head
default behavior - display starting 10 lines
$ head sample.txt
1) Hello World!
2)
3) Good day
4) How do you do?
5)
6) Just do it
7) Believe it!
8)
9) Today is sunny
10) Not a bit funny
16
Cat, Less, Tail and Head
when number is prefixed with - sign, all lines are fetched except those many lines to end of file
characterwise head
Note that this works byte wise and not suitable for multi-byte character encodings
$ # if output of command doesn't end with newline, prompt will be on same line
$ # to highlight working of command, the prompt for such cases is not shown here
17
Cat, Less, Tail and Head
18
Cat, Less, Tail and Head
Text Editors
For editing text files, the following applications can be used. Of these, gedit , nano , vi and/or
vim are available in most distros by default
Easy to use
gedit
geany
nano
vim
vim learning resources and vim reference for further info
emacs
atom
sublime
Check out this analysis for some performance/feature comparisons of various text editors
19
GNU grep
GNU grep
Table of Contents
20
GNU grep
Character Quantifiers
Character classes and backslash sequences
Pattern groups
Basic vs Extended Regular Expressions
Further Reading
$ grep -V | head -1
grep (GNU grep) 2.25
$ man grep
GREP(1) General Commands Manual GREP(1)
NAME
grep, egrep, fgrep, rgrep - print lines matching a pattern
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]
grep [OPTIONS] [-e PATTERN]... [-f FILE]... [FILE...]
DESCRIPTION
grep searches the named input FILEs for lines containing a match to the
given PATTERN. If no files are specified, or if the file “-” is given,
grep searches standard input. By default, grep prints the matching
lines.
In addition, the variant programs egrep, fgrep and rgrep are the same
as grep -E, grep -F, and grep -r, respectively. These variants are
deprecated, but are provided for backward compatibility.
...
Note For more detailed documentation and examples, use info grep
21
GNU grep
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
If search string contains any regular expression meta characters like ^$\.*[] (covered later), use
the -F option or fgrep if available
22
GNU grep
23
GNU grep
Match all
24
GNU grep
Another useful option is -x to match only complete line, not anywhere in the line
25
GNU grep
Colored output
Highlight search strings, line numbers, file name, etc in different colors
Depends on color support in terminal being used
options to --color are
auto when output is redirected (another command, file, etc) the color information won't be
passed
always when output is redirected (another command, file, etc) the color information will also
be passed
never explicitly specify no highlighting
Sample screenshot
26
GNU grep
Context matching
The -A , -B and -C options are useful to get lines after/before/around matching line
respectively
If there are multiple non-adjacent matching segments, by default grep adds a line -- to
separate them
27
GNU grep
Use --no-group-separator option if the separator line is a hindrance, for example feeding the
output of grep to another program
Recursive search
First let's create some more test files
$ mkdir -p test_files/hidden_files
$ printf 'Red\nGreen\nBlue\nBlack\nWhite\n' > test_files/colors.txt
$ printf 'Violet\nIndigo\nBlue\nGreen\nYellow\nOrange\nRed\n' > test_files/vibgyor.t
xt
$ printf '#!/usr/bin/python3\n\nprint("Hello World")\n' > test_files/hello.py
$ printf 'I like yellow\nWhat about you\n' > test_files/hidden_files/.fav_color.info
28
GNU grep
-r, --recursive
Read all files under each directory, recursively, following
symbolic links only if they are on the command line. Note that
if no file operand is given, grep searches the working
directory. This is equivalent to the -d recurse option.
-R, --dereference-recursive
Read all files under each directory, recursively. Follow all
symbolic links, unlike -r.
29
GNU grep
exclude-from=FILE
To search only files with specific pattern in their names, use --include=GLOB
Note: exclusion/inclusion applies only to basename of file/directory, not the entire path
To follow all symbolic links (not directly specificied as arguments, but found on recursive search),
use -R instead of -r
$ # exclude a directory
$ grep -ri --exclude-dir='hidden_files' 'you'
poem.txt:And so are you.
30
GNU grep
$ # recursive search
$ shopt -s globstar
$ grep -d skip -il 'yellow' **/*
test_files/vibgyor.txt
31
GNU grep
simple example to show filenames with space causing issue if -Z is not used
$ # no issues if -Z is used
$ grep -rilZ 'are' | xargs -0 grep 'you'
abc xyz.txt:hi how are you
poem.txt:And so are you.
32
GNU grep
Example for matching more than one search string anywhere in file
another example
33
GNU grep
$ if grep -qi 'rose' poem.txt; then echo 'match found!'; else echo 'match not found'
; fi
match found!
$ if grep -qi 'lily' poem.txt; then echo 'match found!'; else echo 'match not found'
; fi
match not found
34
GNU grep
$ touch foo.txt
$ chmod -r foo.txt
$ grep 'rose' foo.txt
grep: foo.txt: Permission denied
$ grep -s 'rose' foo.txt
$ echo $?
2
By default, grep treats the search pattern as BRE (Basic Regular Expression)
-G option can be used to specify explicitly that BRE is used
The -E option allows to use ERE (Extended Regular Expression) which in GNU grep's case only
differs in how meta characters are used, no difference in regular expression functionalities
If -F option is used, the search string is treated literally
35
GNU grep
If available, one can also use -P which indicates PCRE (Perl Compatible Regular Expression)
Line Anchors
Often, search must match from beginning of line or towards end of line
For example, an integer variable declaration in C will start with optional white-space, the keyword
int , white-space and then variable(s)
This way one can avoid matching declarations inside single line comments as well.
Similarly, one might want to match a variable at end of statement
The meta characters for line anchoring are ^ for beginning of line and $ for end of line
$ # start of line
$ grep '^Fantasy' fav.txt
Fantasy is my favorite genre
$ # end of line
$ grep 'Fantasy$' fav.txt
My favorite genre is Fantasy
$ # without anchors
$ grep 'Fantasy' fav.txt
Fantasy is my favorite genre
My favorite genre is Fantasy
As the meta characters have special meaning (assuming -F option is not used), they have to be
escaped using \ to match literally
The \ itself is meta character, so to match it literally, use \\
The line anchors ^ and $ have special meaning only when they are present at start/end of
regular expression
36
GNU grep
Word Anchors
The -w option works well to match whole words. But what about matching only start or end of
words?
Anchors \< and \> will match start/end positions of a word
\b can also be used instead of \< and \> which matches either edge of a word
$ printf 'spar\npar\npart\napparent\n'
spar
par
part
apparent
37
GNU grep
Alternation
The | meta character is similar to using multiple -e option
Each side of | is complete regular expression with their own start/end anchors
How each part of alternation is handled and order of evaluation/output is beyond the scope of this
tutorial
See this for more info on this topic.
| is one of meta characters that requires different syntax between BRE/ERE
38
GNU grep
See also
39
GNU grep
&
Quantifiers
Defines how many times a character (simplified for now) should be matched
40
GNU grep
$ printf 'late\npale\nfactor\nrare\nact\n'
late
pale
factor
rare
act
41
GNU grep
42
GNU grep
For more precise control on number of times to match, {} ( \{\} for BRE) is useful
It can take one of four forms, {n} , {n,m} , {,m} and {n,}
$ # {,m} - 0 to m times
$ echo 'ac abc abbc abbbc' | grep -Eo 'ab{,2}c'
ac
abc
abbc
Character classes
The meta character pairs [] allow to match any of the multiple characters within []
Meta characters like ^ , $ have different meaning inside and outside of []
Simple example first, matching any of the characters within []
43
GNU grep
Adding a quantifier
Check out unix words) and sample words file
Character ranges
Matching any alphabet, number, hexadecimal number etc becomes cumbersome if every character
has to be individually specified
So, there's a shortcut, using - to construct a range (has to be specified in ascending order)
See ascii codes table for reference
Note that behavior of range will differ for other character encodings
See Character Classes and Bracket Expressions as well as LC_COLLATE under
Environment Variables sections in info grep for more detail
Matching Numeric Ranges with a Regular Expression
44
GNU grep
45
GNU grep
46
GNU grep
Character
Description
classes
[:digit:] Same as [0-9]
47
GNU grep
Grouping
Character classes allow matching against a choice of multiple character list and then quantifier
48
GNU grep
added if needed
One of the uses of grouping is analogous to character classes for whole regular expressions,
instead of just list of characters
The meta characters () are used for grouping
requires \(\) for BRE
Similar to maths ab + ac = a(b+c) , think of regular expression a(b|c) = ab|ac
$ # nesting of () is allowed
$ grep -E '([as](p|c)[r-t]){2}' /usr/share/dict/words
scraps
Back reference
The matched string within () can also be used to be matched again by back referencing the
captured groups
\1 denotes the first matched group, \2 the second one and so on
Order is leftmost ( is \1 , next one is \2 and so on
Note that the matched string, not the regular expression itself is referenced
for ex: if ([0-9][a-f]) matches 3b , then back referencing will be 3b not any other valid
match of the regular expression like 8f , 0a etc
Other regular expressions like PCRE do allow referencing the regular expression itself
49
GNU grep
$ # note how first three and last three letters are same
$ grep -xE '([a-d]..)\1' /usr/share/dict/words
bonbon
cancan
chichi
$ # note how adding quantifier is not same as back-referencing
$ grep -m4 -xE '([a-d]..){2}' /usr/share/dict/words
abacus
abided
abides
ablaze
Note that there is an issue for certain usage of back-reference and quantifier
$ # no output
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
$ # works when nesting is unrolled
$ grep -m5 -xiE '[a-z]*([a-z])\1[a-z]*([a-z])\2[a-z]*' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
50
GNU grep
$ cat story.txt
singing tin in the rain
walking for for a cause
have a nice day
day and night
Multiline matching
If input is small enough to meet memory requirements, the -z option comes in handy to match
across multiple lines
Instead of newline being line separator, the ASCII NUL character is used
So, multiline matching depends on whether or not input file itself contains the NUL character
Usually text files won't have occasion to use the NUL character and presence of it marks it as
binary file for grep
51
GNU grep
The man page informs that -P is highly experimental. So far, haven't faced any issues. But do
keep this in mind.
Only a few highlights is presented here
For more info
man pcrepattern or read it online
perldoc - re - Perl regular expression syntax, also links to other related tutorials
What does this regex mean?
Backslash sequences
Some of the backslash constructs available in PCRE over already seen ones in ERE
\d for [0-9]
\s for [\ \t\r\n\f]
\h for [ \t]
\n for newline character
\D , \S , \H , \N etc for their opposites
52
GNU grep
See INTERNAL OPTION SETTING in man pcrepattern for more info on (?s) , (?m) etc
Specifying Modes Inside The Regular Expression also has some detail on such options
Non-greedy matching
Both BRE/ERE support only greedy matching quantifiers
match as much as possible
PCRE supports non-greedy version by adding ? after quantifiers
match as minimal as possible
See this Python notebook for an interesting project on palindrome sentences
53
GNU grep
$ echo 'foo and bar and baz went shopping bytes' | grep -oi '\w.*and'
foo and bar and
$ echo 'foo and bar and baz went shopping bytes' | grep -oiP '\w.*?and'
foo and
bar and
Lookarounds
Ability to add conditions to match before/after required pattern
There are four types
positive lookahead (?=
negative lookahead (?!
positive lookbehind (?<=
negative lookbehind (?<!
One way to remember is that behind uses < and negative uses ! instead of =
When used with -o option, lookarounds portion won't be part of output
54
GNU grep
55
GNU grep
Simple way to use is that regular expression which should be discarded is written first, (*SKIP)
(*F) is appended and then whichever is required by added after |
See Excluding Unwanted Matches for more info
Another common problem is unquoted search string will be open to shell's own globbing rules
56
GNU grep
Use double quotes for variable expansion, command substitution, etc (Note: could vary based on
shell used)
See mywiki.wooledge Quotes for detailed discussion of quoting in bash shell
$ # workaround by using \-
$ echo '5*3-2=13' | grep '\-2'
5*3-2=13
Tip: Options can be specified at end of command as well, useful if option was forgotten and have to
quickly add it to previous command from history
57
GNU grep
real 0m0.145s
real 0m0.011s
58
GNU grep
real 0m0.174s
$ time grep -xP '([a-z]..)\1' /usr/share/dict/words
bonbon
cancan
chichi
murmur
muumuu
pawpaw
pompom
tartar
testes
real 0m0.008s
Anchors
^ match from start of line
$ match end of line
\< match beginning of word
\> match end of word
\b match edge of word
\B match other than edge of word
Character Quantifiers
. match any single character
59
GNU grep
Pattern groups
| matches either of the given patterns
() patterns within () are grouped and treated as one pattern, useful in conjunction with |
\1 backreference to first grouped pattern within ()
\2 backreference to second grouped pattern within () and so on
Further Reading
60
GNU grep
61
GNU sed
GNU sed
Table of Contents
62
GNU sed
w modifier
e modifier
m modifier
Shell substitutions
Variable substitution
Command substitution
z and s command line options
change command
insert command
append command
adding contents of file
r for entire file
R for line by line
n and N commands
Control structures
if then else
replacing in specific column
overlapping substitutions
Lines between two REGEXPs
Include or Exclude matching REGEXPs
First or Last block
Broken blocks
sed scripts
Gotchas and Tips
Further Reading
63
GNU sed
$ man sed
SED(1) User Commands SED(1)
NAME
sed - stream editor for filtering and transforming text
SYNOPSIS
sed [OPTION]... {script-only-if-no-other-script} [input-file]...
DESCRIPTION
Sed is a stream editor. A stream editor is used to perform basic text
transformations on an input stream (a file or input from a pipeline).
While in some ways similar to an editor which permits scripted edits
(such as ed), sed works by making only one pass over the input(s), and
is consequently more efficient. But it is sed's ability to filter text
in a pipeline which particularly distinguishes it from other types of
editors.
...
Note: Multiline and manipulating pattern space with h,x,D,G,H,P etc is not covered in this chapter and
examples/information is based on ASCII encoded text input only
s/REGEXP/REPLACEMENT/FLAGS
The / character is idiomatically used as delimiter character. See also Using different delimiter for
REGEXP
editing stdin
64
GNU sed
Note: As a good practice, all examples use single quotes around arguments to prevent shell
interpretation. See Shell substitutions section on use of double quotes
$ cat greeting.txt
Hi there
Have a nice day
$ # change all 'e' to 'E' and save changed text to another file
$ sed 's/e/E/g' greeting.txt > out.txt
$ cat out.txt
Hi thErE
HavE a nicE day
65
GNU sed
Note:
Refer to man sed for details of how to use the -i option. It varies with different sed
implementations. As mentioned at start of this chapter, sed (GNU sed) 4.2.2 is being used here
See this Q&A when working with symlinks
With backup
When extension is given, the original input file is preserved with name changed according to
extension provided
Without backup
Use this option with caution, changes made cannot be undone
Multiple files
66
GNU sed
Multiple input files are treated individually and changes are written back to respective files
$ cat f1
I ate 3 apples
$ cat f2
I bought two bananas and 3 mangoes
$ cat var.txt
foo
bar
baz
$ cat bkp.var.txt
foo
bar
baz
67
GNU sed
$ mkdir bkp_dir
$ sed -i'bkp_dir/*' 's/bar/hi/' var.txt
$ cat var.txt
hello
hi
baz
$ cat bkp_dir/var.txt
hello
bar
baz
Print command
It is usually used in conjunction with -n option
By default, sed prints every input line, including any changes made by commands like substitution
printing here refers to line being part of sed output which may be shown on terminal,
redirected to file, etc
Using -n option and p command together, only specific lines needed can be filtered
Examples below use the /REGEXP/ addressing, other forms will be seen in sections to follow
68
GNU sed
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
Delete command
69
GNU sed
By default, sed prints every input line, including any changes like substitution
Using the d command, those specific lines will NOT be printed
Quit commands
Exit sed without processing further input
70
GNU sed
$ # useful to print from beginning of file up to but not including line matching REG
EXP
$ sed '/is/Q' poem.txt
Roses are red,
Violets are blue,
Use tac to get all lines starting from last occurrence of search string
$ # all lines from last occurrence of '7' excluding line with '7'
$ seq 50 | tac | sed '/7/Q' | tac
48
49
50
Note
This way of using quit commands won't work for inplace editing with multiple file input
See this Q&A for alternate solution, also has solutions using gawk and perl
71
GNU sed
72
GNU sed
Different ways to do same things. See also Alternation and Control structures
73
GNU sed
74
GNU sed
real 0m0.003s
user 0m0.000s
sys 0m0.000s
$ time seq 3542 4623452 | sed -n '2452p' > /dev/null
real 0m0.334s
user 0m0.396s
sys 0m0.024s
75
GNU sed
If needed, matching line can also be printed. But there will be newline separation
$ # or
$ sed -n '/blue/{p;=}' poem.txt
Violets are blue,
2
Address range
So far, we've seen how to filter specific line based on REGEXP and line numbers
sed also allows to combine them to enable selecting a range of lines
Consider the sample input file for this section
76
GNU sed
$ cat addr_range.txt
Hello World
Good day
How are you
Just do-it
Believe it
Today is sunny
Not a bit funny
No doubt you like it too
$ # the second REGEXP will always be checked after the line matching first address
$ sed -n '/No/,/No/p' addr_range.txt
Not a bit funny
No doubt you like it too
Just do-it
No doubt you like it too
77
GNU sed
Just do-it
Believe it
Just do-it
78
GNU sed
Good day
How are you
Just do-it
Believe it
Relative addressing
Prefixing + to a number for second address gives relative filtering
Similar to using grep -A<num> --no-group-separator 'REGEXP' but grep merges adjacent
groups while sed does not
79
GNU sed
Just do-it
Believe it
Another relative format is i~j which acts on ith line and i+j, i+2j, i+3j, etc
1~2 means 1st, 3rd, 5th, 7th, etc (i.e odd numbered lines)
5~3 means 5th, 8th, 11th, etc
80
GNU sed
Today is sunny
Not a bit funny
$ # instead of this
$ echo '/home/learnbyexample/reports' | sed 's/\/home\/learnbyexample\//~\//'
~/reports
81
GNU sed
$ printf '/foo/bar/1\n/foo/baz/1\n'
/foo/bar/1
/foo/baz/1
Regular Expressions
By default, sed treats REGEXP as BRE (Basic Regular Expression)
The -E option enables ERE (Extended Regular Expression) which in GNU sed's case only differs
in how meta characters are used, no difference in functionalities
Initially GNU sed only had -r option to enable ERE and man sed doesn't even mention -
E
Other sed versions use -E and grep uses -E as well. So -r won't be used in
examples in this tutorial
See also sed manual - BRE-vs-ERE
See sed manual - Regular Expressions for more details
Line Anchors
Often, search must match from beginning of line or towards end of line
For example, an integer variable declaration in C will start with optional white-space, the keyword
int , white-space and then variable(s)
This way one can avoid matching declarations inside single line comments as well
Similarly, one might want to match a variable at end of statement
Consider the input file and sample substitution without using any anchoring
82
GNU sed
$ cat anchors.txt
cat and dog
too many cats around here
to concatenate, use the cmd cat
catapults laid waste to the village
just scat and quit bothering me
that is quite a fabricated tale
try the grape variety muscat
83
GNU sed
Word Anchors
A word character is any alphabet (irrespective of case) or any digit or the underscore character
The word anchors help in matching or not matching boundaries of a word
For example, to distinguish between par , spar and apparent
\b matches word boundary
\ is meta character and certain combinations like \b and \B have special meaning
One can also use these alternatives for \b
\< for start of word
\> for end of word
84
GNU sed
85
GNU sed
Certain characters like & and \ have special meaning in REPLACEMENT section of substitute
as well. They too have to be escaped using \
And the delimiter character has to be escaped of course
See back reference section for use of & in REPLACEMENT section
86
GNU sed
Alternation
Two or more REGEXP can be combined as logical OR using the | meta character
syntax is \| for BRE and | for ERE
Each side of | is complete regular expression with their own start/end anchors
How each part of alternation is handled and order of evaluation/output is beyond the scope of this
tutorial
See this for more info on this topic.
87
GNU sed
$ # BRE
$ sed -n '/red\|blue/p' poem.txt
Roses are red,
Violets are blue,
$ # ERE
$ sed -nE '/red|blue/p' poem.txt
Roses are red,
Violets are blue,
$ # replace all sequence of 3 characters starting with 'c' and ending with 't'
$ echo 'coat cut fit c#t' | sed 's/c.t/XYZ/g'
coat XYZ fit XYZ
$ # replace all sequence of 4 characters starting with 'c' and ending with 't'
$ echo 'coat cut fit c#t' | sed 's/c..t/ABCD/g'
ABCD cut fit c#t
$ # space, tab etc are also characters which will be matched by '.'
$ echo 'coat cut fit c#t' | sed 's/t.f/IJK/g'
coat cuIJKit c#t
Quantifiers
88
GNU sed
All quantifiers in sed are greedy, i.e longest match wins as long as overall REGEXP is satisfied and
precedence is left to right. In this section, we'll cover usage of quantifiers on characters
$ printf 'late\npale\nfactor\nrare\nact\n'
late
pale
factor
rare
act
89
GNU sed
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n'
abc
ac
adc
abbc
bbb
bc
abbbbbc
$ # ERE
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n' | sed -nE '/ab+c/p'
abc
abbc
abbbbbc
90
GNU sed
$ # exactly 5 times
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n' | sed -nE '/ab{5}c/p'
abbbbbc
$ # minimum of 2 times
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n' | sed -nE '/ab{2,}c/p'
abbc
abbbbbc
$ # BRE
$ printf 'abc\nac\nadc\nabbc\nbbb\nbc\nabbbbbc\n' | sed -n '/ab\{2,\}c/p'
abbc
abbbbbc
Character classes
The . meta character provides a way to match any character
Character class provides a way to match any character among a specified set of characters
enclosed within []
91
GNU sed
Character ranges
Matching any alphabet, number, hexadecimal number etc becomes cumbersome if every character
has to be individually specified
So, there's a shortcut, using - to construct a range (has to be specified in ascending order)
See ascii codes table for reference
Note that behavior of range will depend on locale settings
arch wiki - locale
Linux: Define Locale and Language Settings
$ # filter lines made up entirely of lower case alphabets and digits, at least one
$ printf 'cat5\nfoo\n123\n42\n' | sed -nE '/^[a-z0-9]+$/p'
cat5
foo
123
42
92
GNU sed
Numeric ranges, easy for certain cases but not suitable always. Use awk or perl for arithmetic
computation
See also Matching Numeric Ranges with a Regular Expression
$ # numbers between 10 to 29
$ printf '23\n154\n12\n26\n98234\n' | sed -n '/^[12][0-9]$/p'
23
12
26
93
GNU sed
94
GNU sed
Character
Description
classes
[:digit:] Same as [0-9]
[:lower:] Same as [a-z]
95
GNU sed
\s Same as [[:space:]]
\S Same as [:space:]
Escape sequences
Certain ASCII characters like tab, carriage return, newline, etc have escape sequence to represent
them
Unlike backslash character classes, these can be used within [] as well
Any ASCII character can be also represented using their decimal or octal or hexadecimal value
See ascii codes table for reference
See sed manual - Escapes for more details
96
GNU sed
$ # most common use case for hex escape sequence is to represent single quotes
$ # equivalent is '\d039' and '\o047' for decimal and octal respectively
$ echo "foo: '34'"
foo: '34'
$ echo "foo: '34'" | sed 's/\x27/"/g'
foo: "34"
$ echo 'foo: "34"' | sed 's/"/\x27/g'
foo: '34'
Grouping
Character classes allow matching against a choice of multiple character list and then quantifier
added if needed
One of the uses of grouping is analogous to character classes for whole regular expressions,
instead of just list of characters
The meta characters () are used for grouping
requires \(\) for BRE
Similar to maths ab + ac = a(b+c) , think of regular expression a(b|c) = ab|ac
97
GNU sed
$ # quantifier example
$ printf 'handed\nhand\nhandy\nhands\nhandle\n' | sed -nE '/^hand([sy]|le)?$/p'
hand
handy
hands
handle
Back reference
The matched string within () can also be used to be matched again by back referencing the
captured groups
\1 denotes the first matched group, \2 the second one and so on
Order is leftmost ( is \1 , next one is \2 and so on
Can be used both in REGEXP as well as in REPLACEMENT sections
& or \0 represents entire matched string in REPLACEMENT section
Note that the matched string, not the regular expression itself is referenced
for ex: if ([0-9][a-f]) matches 3b , then back referencing will be 3b not any other valid
match of the regular expression like 8f , 0a etc
As \ and & are special characters in REPLACEMENT section, use \\ and \& respectively
for literal representation
98
GNU sed
Changing case
Applies only to REPLACEMENT section, unlike perl where these can be used in REGEXP
portion as well
See sed manual - The s Command for more details and corner cases
99
GNU sed
s/REGEXP/REPLACEMENT/FLAGS
Modifiers (or FLAGS) like g , p and I have been already seen. For completeness, they will be
discussed again along with rest of the modifiers
See sed manual - The s Command for more details and corner cases
100
GNU sed
g modifier
By default, substitute command will replace only first occurrence of match. g modifier is needed to
replace all occurrences
Replacing Nth match from end of line when number of matches is unknown
Makes use of greediness of quantifiers
101
GNU sed
102
GNU sed
Ignoring case
Either i or I can be used for replacing in case-insensitive manner
Since only I can be used for address filtering (for ex: sed '/rose/Id' poem.txt ), use I for
substitute command as well for consistency
p modifier
Usually used in conjunction with -n option to output only modified lines
103
GNU sed
$ # no output if no substitution
$ echo 'hi there. have a nice day' | sed -n 's/xyz/XYZ/p'
$ # modified line if there is substitution
$ echo 'hi there. have a nice day' | sed -n 's/\bh/H/pg'
Hi there. Have a nice day
w modifier
Allows to write only the changes to specified file name instead of default stdout
104
GNU sed
e modifier
Allows to use shell command output in REPLACEMENT section
Trailing newline from command output is suppressed
m modifier
105
GNU sed
Before seeing example with m modifier, let's see a simple example to get two lines in pattern space
106
GNU sed
Shell substitutions
Examples presented works with bash shell, might differ for other shells
See also Difference between single and double quotes in Bash
For robust substitutions taking care of meta characters in REGEXP and REPLACEMENT sections,
see
How to ensure that string interpolated into sed substitution escapes all metachars
What characters do I need to escape when using sed in a sh script?
Is it possible to escape regex metacharacters reliably with sed
Variable substitution
Entire command in double quotes can be used for simple use cases
$ word='are'
$ sed -n "/$word/p" poem.txt
Roses are red,
Violets are blue,
And so are you.
$ replace='ARE'
$ sed "s/$word/$replace/g" poem.txt
Roses ARE red,
Violets ARE blue,
Sugar is sweet,
And so ARE you.
If command has characters like \ , backtick, ! etc, double quote only the variable
107
GNU sed
Command substitution
Much more flexible than using e modifier as part of line can be modified as well
The -z option will cause sed to separate input based on ASCII NUL character instead of newlines
108
GNU sed
$ # also useful to process whole file(not having NUL characters) as a single string
$ # adds ; to previous line if current line starts with c
$ printf 'cat\ndog\ncoat\ncut\nmat\n' | sed -z 's/\nc/;&/g'
cat
dog;
coat;
cut
mat
The -s option will cause sed to treat multiple input files separately instead of treating them as single
concatenated input. If -i is being used, -s is implied
change command
The change command c will delete line(s) represented by address or address range and replace it
with given string
Note the string used cannot have literal newline character, use escape sequence instead
109
GNU sed
\ is special immediately after c , see sed manual - other commands for details
If escape sequence is needed at beginning of replacement string, use an additional \
110
GNU sed
Since ; cannot be used to distinguish between string and end of command, use -e for multiple
commands
111
GNU sed
$ text='good day'
$ seq 3 | sed '2c'"$text"
1
good day
3
insert command
The insert command allows to add string before a line matching given address
Note the string used cannot have literal newline character, use escape sequence instead
112
GNU sed
\ is special immediately after i , see sed manual - other commands for details
If escape sequence is needed at beginning of replacement string, use an additional \
Since ; cannot be used to distinguish between string and end of command, use -e for multiple
commands
113
GNU sed
$ text='good day'
$ seq 3 | sed '2i'"$text"
1
good day
2
3
append command
114
GNU sed
The append command allows to add string after a line matching given address
Note the string used cannot have literal newline character, use escape sequence instead
\ is special immediately after a , see sed manual - other commands for details
If escape sequence is needed at beginning of replacement string, use an additional \
115
GNU sed
Since ; cannot be used to distinguish between string and end of command, use -e for multiple
commands
116
GNU sed
$ text='good day'
$ seq 3 | sed '2a'"$text"
1
2
good day
3
See this Q&A for using a command to make sure last line of input has a newline character
117
GNU sed
$ cat 5.txt
five
1five
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
118
GNU sed
119
GNU sed
$ # replacing a line
$ seq 3 | sed -e '3r /dev/stdin' -e '3d' poem.txt
Roses are red,
Violets are blue,
1
2
3
And so are you.
120
GNU sed
number of lines from file to be read different from number of matching address lines
121
GNU sed
n and N commands
These two commands will fetch next line (newline or NUL character separated, depending on
options)
If auto-print is not disabled, print the pattern space, then, regardless, replace the pattern space with
the next line of input. If there is no more input then sed exits without processing any more
commands.
$ # if line contains 'blue', replace 'e' with 'E' only for following line
$ sed '/blue/{n;s/e/E/g}' poem.txt
Roses are red,
Violets are blue,
Sugar is swEEt,
And so are you.
$ # if line contains 'blue', replace 'e' with 'E' only for next to next line
$ sed -n '/blue/{n;n;s/e/E/pg}' poem.txt
And so arE you.
Add a newline to the pattern space, then append the next line of input to the pattern space. If there
is no more input then sed exits without processing any more commands
When -z is used, a zero byte (the ascii ‘NUL’ character) is added between the lines (instead of a
new line)
See this Q&A for an interesting case of applying substitution every 4 lines but excluding the 4th line
122
GNU sed
$ # if line contains 'blue', replace 'e' with 'E' both in current line and next
$ sed '/blue/{N;s/e/E/g}' poem.txt
Roses are red,
ViolEts arE bluE,
Sugar is swEEt,
And so are you.
Combination
Control structures
123
GNU sed
Using :label one can mark a command location to branch to conditionally or unconditionally
See sed manual - Commands for sed gurus for more details
if then else
Simple if-then-else can be simulated using b command
b command will unconditionally branch to specified label
Without label, b will skip rest of commands and start next cycle
See processing only lines between REGEXPs for interesting use case
124
GNU sed
$ # whether or not 'R' is found on lines containing 'are', branch will happen
$ sed '/are/{s/R/*/g;b}; s/e/#/g' poem.txt
*oses are red,
Violets are blue,
Sugar is sw##t,
And so are you.
overlapping substitutions
t command looping with label comes in handy for overlapping substitutions as well
Note that in general this method will work recursively, see substitute recursively for example
125
GNU sed
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
126
GNU sed
$ # remember that empty REGEXP section will reuse previously matched REGEXP
$ sed -n '/BEGIN/,/END/{//!p}' range.txt
1234
6789
a
b
c
127
GNU sed
$ # remember that empty REGEXP section will reuse previously matched REGEXP
$ sed '/BEGIN/,/END/{//!d}' range.txt
foo
BEGIN
END
bar
BEGIN
END
baz
128
GNU sed
To get last block, reverse the input linewise, the order of REGEXPs and finally reverse again
To get a specific block, say 3rd one, awk or perl would be a better choice
See Specific blocks for awk examples
Broken blocks
If there are blocks with ending REGEXP but without corresponding starting REGEXP, sed -n
'/BEGIN/,/END/p' will suffice
Consider the modified input file where final starting REGEXP doesn't have corresponding ending
129
GNU sed
$ cat broken_range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
baz
All lines till end of file gets printed with simple use of sed -n '/BEGIN/,/END/p'
The file reversing trick comes in handy here as well
But if both kinds of broken blocks are present, further processing will be required. Better to use
awk or perl in such cases
See Broken blocks for awk examples
If there are multiple starting REGEXP but single ending REGEXP, the reversing trick comes handy
again
130
GNU sed
$ cat uneven_range.txt
foo
BEGIN
1234
BEGIN
42
6789
END
bar
BEGIN
a
BEGIN
b
BEGIN
c
BEGIN
d
BEGIN
e
END
baz
sed scripts
sed commands can be placed in a file and called using -f option or directly executed using
shebang)
See sed manual - Some Sample Scripts for more examples
See sed manual - Often-Used Commands for more details on using comments
131
GNU sed
$ cat script.sed
# each line is a command
/is/cfoo bar
/you/r 3.txt
/you/d
# single quotes can be used freely
s/are/'are'/g
command line options can be specified along with shebang as well as added at time of invocation
Note usage of options along with shebang depends on lot of factors
132
GNU sed
$ type sed
sed is /bin/sed
$ cat executable.sed
#!/bin/sed -f
/is/cfoo bar
/you/r 3.txt
/you/d
s/are/'are'/g
$ chmod +x executable.sed
$ ./executable.sed poem.txt
Roses 'are' red,
Violets 'are' blue,
foo bar
3
13
$ ./executable.sed -n poem.txt
foo bar
3
13
133
GNU sed
$ # bash functions
$ unix2dos() { sed -i 's/$/\r/' "$@" ; }
$ dos2unix() { sed -i 's/\r$//' "$@" ; }
$ cat -A 5.txt
five$
1five$
$ unix2dos 5.txt
$ cat -A 5.txt
five^M$
1five^M$
$ dos2unix 5.txt
$ cat -A 5.txt
five$
1five$
variable/command substitution
134
GNU sed
135
GNU sed
136
GNU sed
real 0m0.058s
$ time LC_ALL=C sed -nE '/^([a-d][r-z]){3}$/p' /usr/share/dict/words
avatar
awards
cravat
real 0m0.038s
real 0m0.111s
$ time LC_ALL=C sed -nE '/^([a-z]..)\1$/p' /usr/share/dict/words > /dev/null
real 0m0.073s
Further Reading
Manual and related
man sed and info sed for more details, known issues/limitations as well as
options/commands not covered in this tutorial
GNU sed manual has even more detailed information and examples
sed FAQ, but last modified '10 March 2003'
BSD/macOS Sed vs GNU Sed vs the POSIX Sed specification
Differences between sed on Mac OSX and other standard sed
Tutorials and Q&A
sed basics
sed detailed tutorial - has details on differences between various sed versions as well
sed one-liners explained
cheat sheet
common search and replace examples
sed Q&A on unix stackexchange
sed Q&A on stackoverflow
Selected examples - portable solutions, commands not covered in this tutorial, same problem solved
using different tools, etc
replace multiline string
deleting empty lines with optional white spaces
137
GNU sed
138
GNU awk
GNU awk
Table of Contents
Field processing
Default field separation
Specifying different input field separator
Specifying different output field separator
Filtering
Idiomatic print usage
Field comparison
Regular expressions based filtering
Fixed string matching
Line number based filtering
Case Insensitive filtering
Changing record separators
Paragraph mode
Multicharacter RS
Substitute functions
Inplace file editing
Using shell variables
Multiple file input
Control Structures
if-else and loops
next and nextfile
Multiline processing
Two file processing
Comparing whole lines
Comparing specific fields
getline
Creating new fields
Dealing with duplicates
Lines between two REGEXPs
All unbroken blocks
Specific blocks
Broken blocks
Arrays
awk scripts
Miscellaneous
FPAT and FIELDWIDTHS
String functions
139
GNU awk
$ man awk
GAWK(1) Utility Commands GAWK(1)
NAME
gawk - pattern scanning and processing language
SYNOPSIS
gawk [ POSIX or GNU style options ] -f program-file [ -- ] file ...
gawk [ POSIX or GNU style options ] [ -- ] program-text file ...
DESCRIPTION
Gawk is the GNU Project's implementation of the AWK programming lan‐
guage. It conforms to the definition of the language in the POSIX
1003.1 Standard. This version in turn is based on the description in
The AWK Programming Language, by Aho, Kernighan, and Weinberger. Gawk
provides the additional features found in the current version of Brian
Kernighan's awk and a number of GNU-specific extensions.
...
familiarity with programming concepts like variables, printing, control structures, arrays, etc
familiarity with regular expressions
if not, check out ERE portion of GNU sed regular expressions which is close enough to features
available in gawk
this tutorial is primarily focussed on short programs that are easily usable from command line,
similar to using grep , sed , etc
see Gawk: Effective AWK Programming manual for complete reference, has information on other
awk versions as well as notes on POSIX standard
Field processing
140
GNU awk
$ cat fruits.txt
fruit qty
apple 42
banana 31
fig 90
guava 6
141
GNU awk
See FPAT and FIELDWIDTHS section for other ways of defining input fields
$ # last field
$ echo 'foo:123:bar:789' | awk -F: '{print $NF}'
789
default input field separator is one or more of continuous space, tab or newline characters (will be
termed as whitespace here on)
exact same behavior if FS is assigned single space character
in addition, leading and trailing whitespaces won't be considered when splitting the input record
142
GNU awk
assigning empty string to FS will split the input record character wise
note the use of command line option -v to set FS
Further Reading
143
GNU awk
$ # statements inside BEGIN are executed before processing any input text
$ echo 'foo:123:bar:789' | awk 'BEGIN{FS=OFS=":"} {print $1, $NF}'
foo:789
$ # can also be set using command line option -v
$ echo 'foo:123:bar:789' | awk -F: -v OFS=':' '{print $1, $NF}'
foo:789
Filtering
144
GNU awk
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
Field comparison
Each block of statements within {} can be prefixed by an optional condition so that those
statements will execute only if condition evaluates to true
Condition specified without corresponding statements will lead to printing contents of $0 if
condition evaluates to true
145
GNU awk
Further Reading
146
GNU awk
147
GNU awk
148
GNU awk
$ cat eqns.txt
a=b,a+b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b
$ # start of line
$ awk 'index($0,"a+b")==1' eqns.txt
a+b,pi=3.14,5e12
$ # end of line
$ # length function returns number of characters, by default acts on $0
$ awk 'index($0,"a+b")==length()-length("a+b")+1' eqns.txt
i*(t+9-g)/8,4-a+b
$ # to avoid repetitions, save the search string in variable
$ awk -v s="a+b" 'index($0,s)==length()-length(s)+1' eqns.txt
i*(t+9-g)/8,4-a+b
149
GNU awk
real 0m0.004s
user 0m0.004s
sys 0m0.000s
$ time seq 14323 14563435 | awk 'NR==234{print}'
14556
real 0m2.167s
user 0m2.280s
sys 0m0.092s
See also unix.stackexchange - filtering list of lines from every X number of lines
150
GNU awk
$ # for small enough set, can also use REGEXP character class
$ awk '/[rR]ose/' poem.txt
Roses are red,
151
GNU awk
Paragraph mode
When RS is set to empty string, one or more consecutive empty lines is used as input record
separator
Can also use regular expression RS=\n\n+ but there are subtle differences, see gawk manual -
multiline records. Important points from that link quoted below
However, there is an important difference between ‘RS = ""’ and ‘RS = "\n\n+"’. In the first case,
leading newlines in the input data file are ignored, and if a file ends without extra blank lines after
the last record, the final newline is removed from the record. In the second case, this special
processing is not done
Now that the input is separated into records, the second step is to separate the fields in the records.
One way to do this is to divide each of the lines into fields in the normal manner. This happens by
default as the result of a special feature. When RS is set to the empty string and FS is set to a
single character, the newline character always acts as a field separator. This is in addition to
whatever field separations result from FS
When FS is the null string ("") or a regexp, this special feature of RS does not apply. It does apply to
the default field separator of a single space: ‘FS = " "’
152
GNU awk
$ cat sample.txt
Hello World
Good day
How are you
Just do-it
Believe it
Today is sunny
Not a bit funny
No doubt you like it too
Filtering paragraphs
Today is sunny
Not a bit funny
No doubt you like it too
Re-structuring paragraphs
153
GNU awk
$ # a better usecase
$ awk 'BEGIN{FS="\n"; OFS=". "; RS=""; ORS="\n\n"} {$1=$1} 1' sample.txt
Hello World
Further Reading
Multicharacter RS
Some marker like Error or Warning etc
154
GNU awk
$ cat report.log
blah blah
Error: something went wrong
more blah
whatever
Error: something surely went wrong
some text
some more text
blah blah blah
155
GNU awk
$ s='Sample123string54with908numbers'
$ printf "$s" | awk -v RS='[0-9]+' 'NR==1'
Sample
$ cat msg.txt
Hello there.
It will rain to-
day. Have a safe
and pleasant jou-
rney.
156
GNU awk
Further Reading
Substitute functions
Use sub string function for replacing first occurrence
Use gsub for replacing all occurrences
By default, $0 which contains input record is modified, can specify any other field or variable as
needed
Use gensub to return the modified string unlike sub or gsub which modifies inplace
157
GNU awk
back-reference examples
use \" within double-quotes to represent " character in replacement string
use \\1 to represent \1 - the first captured group and so on
& or \0 will back-reference entire matched string
158
GNU awk
$ # replacing last occurrence without knowing how many occurrences are there
$ echo 'foo:123:bar:baz' | awk '{$0=gensub(/(.*):/, "\\1-", 1)} 1'
foo:123:bar-baz
$ echo 'foo and bar and baz land good' | awk '{$0=gensub(/(.*)and/, "\\1XYZ", 1)} 1'
foo and bar and baz lXYZ good
saving quotes in variables - to avoid escaping double quotes or having to use octal code for single
quotes
Further Reading
159
GNU awk
$ cat greeting.txt
Hi there
Have a nice day
Multiple input files are treated individually and changes are written back to respective files
$ cat f1
I ate 3 apples
$ cat f2
I bought two bananas and 3 mangoes
See gawk manual - Enabling In-Place File Editing for implementation details
160
GNU awk
$ f='apple'
$ awk -v word="$f" '$1==word' fruits.txt
apple 42
$ f='fig'
$ awk -v word="$f" '$1==word' fruits.txt
fig 90
$ q='20'
$ awk -v threshold="$q" 'NR==1 || $2>threshold' fruits.txt
fruit qty
apple 42
banana 31
fig 90
passing REGEXP
See also gawk manual - Using Dynamic Regexps
161
GNU awk
$ s='are'
$ # for: awk '!/are/' poem.txt
$ awk -v s="$s" '$0 !~ s' poem.txt
Sugar is sweet,
$ # for: awk '/are/ && !/so/' poem.txt
$ awk -v s="$s" '$0 ~ s && !/so/' poem.txt
Roses are red,
Violets are blue,
$ r='[^-]+'
$ echo '1-2-3-4-5' | awk -v r="$r" '{gsub(r, "abc")} 1'
abc-abc-abc-abc-abc
$ # or use ENVIRON
$ r='(.*)\<and\>'
$ echo "$s" | r="$r" awk '{$0=gensub(ENVIRON["r"], "\\1XYZ", 1)} 1'
foo and bar XYZ baz land good
Constructs to do some processing before starting each file as well as at the end
BEGINFILE - to add code to be executed before start of each input file
162
GNU awk
Further Reading
Control Structures
163
GNU awk
Syntax is similar to C language and single statements inside control structures don't require to be
grouped within {}
See gawk manual - Control Statements for details
Remember that by default there is a loop that goes over all input records and constructs like BEGIN
and END fall outside that loop
$ cat nums.txt
42
-2
10101
-3.14
-75
$ awk '{sum += $1} END{print sum}' nums.txt
10062.9
$ # if-else example
$ awk 'NR>1{if($2>40) $0="+"$0; else $0="-"$0} 1' fruits.txt
fruit qty
+apple 42
-banana 31
+fig 90
-guava 6
164
GNU awk
ternary operator
See also stackoverflow - finding min and max value of a column
$ cat nums.txt
42
-2
10101
-3.14
-75
for loop
similar to C language, break and continue statements are also available
See also stackoverflow - find missing numbers from sequential list
while loop
do-while is also available
165
GNU awk
$ # recursive substitution
$ # here again return value of sub/gsub is useful
$ echo 'titillate' | awk '{while( gsub(/til/, "") ) print}'
tilate
ate
nextfile is useful to skip remaining lines from current file being processed and move on to next
file
166
GNU awk
$ # specific field
$ awk 'FNR>2{nextfile} {print $1}' poem.txt greeting.txt fruits.txt
Roses
Violets
Hi
HavE
fruit
apple
Multiline processing
Processing consecutive lines
167
GNU awk
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
$ # common mistake
$ sed -n '/are/{N;/is/p}' poem.txt
$ # would need something like this and not practical to extend for other cases
$ sed '$!N; /are.*\n.*is/p; D' poem.txt
Violets are blue,
Sugar is sweet,
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
168
GNU awk
Checking if multiple strings are present at least once in entire input file
If there are lots of strings to check, use arrays
169
GNU awk
$ cat colors_1.txt
Blue
Brown
Purple
Red
Teal
Yellow
$ cat colors_2.txt
Black
Blue
Green
Red
White
170
GNU awk
just referencing a key will create it if it doesn't already exist, with value as empty string (will also
act as zero in numeric context)
$0 in a will be true if key already exists in array a
$ # common lines
$ # same as: grep -Fxf colors_1.txt colors_2.txt
$ awk 'NR==FNR{a[$0]; next} $0 in a' colors_1.txt colors_2.txt
Blue
Red
$ cat marks.txt
Dept Name Marks
ECE Raj 53
ECE Joel 72
EEE Moi 68
CSE Surya 81
EEE Tia 59
ECE Om 92
CSE Amy 67
single field
For ex: only first field comparison by using $1 instead of $0 as key
171
GNU awk
$ cat list1
ECE
CSE
multiple fields
create a string by adding some character between the fields to act as key
for ex: to avoid matching two field values abc and 123 to match with two other field values
ab and c123
by adding character, say _ , the key would be abc_123 for first case and ab_c123 for
second case
this can still lead to false match if input data has _
there is also a built-in way to do this using gawk manual - Multidimensional Arrays
172
GNU awk
$ cat list2
EEE Moi
CSE Amy
ECE Raj
$ cat list3
ECE 70
EEE 65
CSE 80
getline
If entire line (instead of fields) from one file is needed to change the other file, using getline
would be faster
But use it with caution
gawk manual - getline for details, especially about corner cases, errors, etc
gawk manual - Closing Input and Output Redirections if you have to start from beginning of file
again
173
GNU awk
$ # without getline, but slower due to NR==FNR check for every line processed
$ awk -v m=3 -v n=2 'NR==FNR{if(FNR==n){s=$0; nextfile} next}
FNR==m{$0=s} 1' nums.txt poem.txt
Roses are red,
Violets are blue,
-2
And so are you.
Another use case is if two files are to be processed exactly for same line numbers
$ # print line from fruits.txt if corresponding line from nums.txt is +ve number
$ awk -v file='nums.txt' '{getline num < file; if(num>0) print}' fruits.txt
fruit qty
banana 31
Further Reading
stackoverflow - Fastest way to find lines of a text file from another larger text file
unix.stackexchange - filter lines based on line numbers specified in another file
stackoverflow - three file processing to extract a matrix subset
unix.stackexchange - column wise merging
stackoverflow - extract specific rows from a text file using an index file
174
GNU awk
$ # reducing fields
$ echo 'foo,bar,123,baz' | awk -F, -v OFS=, '{NF=2} 1'
foo,bar
175
GNU awk
$ cat list4
Raj class_rep
Amy sports_rep
Tia placement_rep
$ printf 'mad\n42\n42\ndam\n42\n'
mad
42
42
dam
42
176
GNU awk
$ cat duplicates.txt
abc 7 4
food toy ****
abc 7 4
test toy 123
good toy ****
$ # whole line
$ awk '!seen[$0]++' duplicates.txt
abc 7 4
food toy ****
test toy 123
good toy ****
$ # particular column
$ awk '!seen[$2]++' duplicates.txt
abc 7 4
food toy ****
$ # total count
$ awk '!seen[$2]++{c++} END{print +c}' duplicates.txt
2
For multiple fields, separate them using , or form a string with some character in between
choose a character unlikely to appear in input data, else there can be false matches
177
GNU awk
$ # reverse the input line-wise, retain first copy and then reverse again
$ tac duplicates.txt | awk '!seen[$2]++' | tac
abc 7 4
good toy ****
178
GNU awk
179
GNU awk
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
180
GNU awk
Specific blocks
Getting first block
181
GNU awk
$ # reverse input linewise, change the order of REGEXPs, finally reverse again
$ tac range.txt | awk '/END/{f=1} f; /BEGIN/{exit}' | tac
BEGIN
a
b
c
END
$ # or, save the blocks in a buffer and print the last one alone
$ # ORS contains output record separator, which is newline by default
$ seq 30 | awk '/4/{f=1; b=$0; next} f{b=b ORS $0} /6/{f=0} END{print b}'
24
25
26
182
GNU awk
$ # all blocks
$ seq 30 | sed -n '/4/,/6/p'
4
5
6
14
15
16
24
25
26
Broken blocks
If there are blocks with ending REGEXP but without corresponding start, awk '/BEGIN/{f=1} f;
/END/{f=0}' will suffice
183
GNU awk
Consider the modified input file where starting REGEXP doesn't have corresponding ending
$ cat broken_range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
baz
But if both kinds of broken blocks are present, accumulate the records and print accordingly
184
GNU awk
$ cat multiple_broken.txt
qqqqqqq
BEGIN
foo
BEGIN
1234
6789
END
bar
END
0-42-1
BEGIN
a
BEGIN
b
END
;as;s;sd;
Further Reading
Arrays
We've already seen examples using arrays, some more examples discussed in this section
array looping
185
GNU awk
Sorting
See gawk manual - Predefined Array Scanning Orders for more details
186
GNU awk
$ cat list5
CSE Surya 75
EEE Jai 69
ECE Kal 83
Further Reading
awk scripts
For larger programs, save the code in a file and use -f command line option
; is not needed to terminate a statement
187
GNU awk
See also gawk manual - Command-Line Options for other related options
$ cat buf.awk
/BEGIN/{
f=1
buf=$0
next
}
f{
buf=buf ORS $0
}
/END/{
f=0
if(buf)
print buf
buf=""
}
$ cat quotes.awk
{
$0 = gensub(/[^:]+/, "'&'", "g")
}
If the code has been first tried out on command line, add -o option to get a pretty printed version
188
GNU awk
File name can be passed along -o option, otherwise by default awkprof.out will be used
$ cat awkprof.out
# gawk profile, created Tue Oct 24 15:10:02 2017
# Rule(s)
NR == FNR {
r[$1] = $2
next
}
{
NF++
if (FNR == 1) {
$NF = "Role"
} else {
$NF = r[$2]
}
}
1 {
print $0
}
Miscellaneous
189
GNU awk
$ s='Sample123string54with908numbers'
$ # define fields to be one or more consecutive digits
$ echo "$s" | awk -v FPAT='[0-9]+' '{print $1, $2, $3}'
123 54 908
$ # define fields to be one or more consecutive alphabets
$ echo "$s" | awk -v FPAT='[a-zA-Z]+' '{print $1, $2, $3, $4}'
Sample string with numbers
For simpler csv input having quoted strings if fields themselves have , in them, using FPAT is
reasonable approach
Use a proper parser if input can have other cases like newlines in fields
See unix.stackexchange - using csv parser for a sample program in perl
$ s='foo,"bar,123",baz,abc'
$ echo "$s" | awk -F, '{print $2}'
"bar
$ echo "$s" | awk -v FPAT='"[^"]*"|[^,]*' '{print $2}'
"bar,123"
if input has well defined fields based on number of characters, FIELDWIDTHS can be used to
specify width of each field
$ # without FIELDWIDTHS
$ awk '/fig/{$2=35} 1' fruits.txt
fruit qty
apple 42
banana 31
fig 35
guava 6
Further Reading
190
GNU awk
String functions
length function - returns length of string, by default acts on $0
$ # character count and not byte count is calculated, similar to 'wc -m'
$ printf 'hi' | awk '{print length()}'
3
191
GNU awk
substr function allows to extract specified number of characters from given string
indexing starts with 1
See gawk manual - substr function for corner cases and details
192
GNU awk
$ # if only few characters are needed from input line, can use empty FS
$ echo 'abcdefghij' | awk -v FS= '{print $3}'
c
$ echo 'abcdefghij' | awk -v FS= '{print $3, $5}'
c e
193
GNU awk
$ wc poem.txt
4 13 65 poem.txt
$ awk 'BEGIN{system("wc poem.txt")}'
4 13 65 poem.txt
$ ls xyz.txt
ls: cannot access 'xyz.txt': No such file or directory
$ echo $?
2
$ awk 'BEGIN{s=system("ls xyz.txt"); print "Status: " s}'
ls: cannot access 'xyz.txt': No such file or directory
Status: 2
$ cat f2
I bought two bananas and three mangoes
$ echo 'f1,f2,odd.txt' | awk -F, '{system("cat " $2)}'
I bought two bananas and three mangoes
printf formatting
Similar to printf function in C and shell built-in command
use sprintf function to save result in variable instead of printing
See also gawk manual - printf
194
GNU awk
strings
195
GNU awk
$ # truncate
$ awk 'BEGIN{printf "%.2s\n", "foobar"}'
fo
196
GNU awk
$ # same as: echo 'foo good 123' | awk '{printf $2 $3 | "wc -c"}'
$ echo 'foo good 123' | awk '{printf $2 | "wc -c"; printf $3 | "wc -c"}'
7
Further Reading
197
GNU awk
$ # wrong
$ awk -v word="apple" '$1==$word' fruits.txt
$ # right
$ awk -v word="apple" '$1==word' fruits.txt
apple 42
198
GNU awk
199
GNU awk
$ cat misc.txt
foo
good bad ugly
123 xyz
a b c d
good
a b
real 0m0.075s
real 0m0.045s
Further Reading
Manual and related
man awk and info awk for quick reference from command line
200
GNU awk
201
Perl the swiss knife
202
Perl the swiss knife
$ man perl
PERL(1) Perl Programmers Reference Guide PERL(1)
NAME
perl - The Perl 5 language interpreter
SYNOPSIS
perl [ -sTtuUWX ] [ -hv ] [ -V[:configvar] ]
[ -cw ] [ -d[t][:debugger] ] [ -D[number/list] ]
[ -pna ] [ -Fpattern ] [ -l[octal] ] [ -0[octal/hexadecimal] ]
[ -Idir ] [ -m[-]module ] [ -M[-]'module...' ] [ -f ]
[ -C [number/list] ] [ -S ] [ -x[dir] ]
[ -i[extension] ]
[ [-e|-E] 'command' ] [ -- ] [ programfile ] [ argument ]...
For more information on these options, you can run "perldoc perlrun".
...
familiarity with programming concepts like variables, printing, control structures, arrays, etc
Perl borrows syntax/features from C, shell scripting, awk, sed etc. Prior experience working with
them would help a lot
familiarity with regular expression basics
if not, check out ERE portion of GNU sed regular expressions
examples for non-greedy, lookarounds, etc will be covered here
this tutorial is primarily focussed on short programs that are easily usable from command line,
similar to using grep , sed , awk etc
do NOT use style/syntax presented here when writing full fledged Perl programs which should
use strict, warnings etc
203
Perl the swiss knife
see perldoc - perlintro and learnxinyminutes - perl for quick intro to using Perl for full fledged
programs
links to Perl documentation will be added as necessary
unless otherwise specified, consider input as ASCII encoded text only
see also stackoverflow - why UTF-8 is not default
$ cat code.pl
print "Hello Perl\n"
$ perl code.pl
Hello Perl
$ # similar to bash
$ cat code.sh
echo 'Hello Bash'
$ bash code.sh
Hello Bash
For short programs, one can use -e commandline option to provide code from command line itself
Use -E option to use newer features like say . See perldoc - new features
This entire chapter is about using perl this way from commandline
$ # similar to
$ bash -c 'echo "Hello Bash"'
Hello Bash
204
Perl the swiss knife
Perl is (in)famous for being able to things more than one way
examples in this chapter will mostly try to use the syntax that avoids (){}
Further Reading
205
Perl the swiss knife
$ cat greeting.txt
Hi there
Have a nice day
$ # same as: sed 's/nice day/safe journey/' greeting.txt
$ perl -pe 's/nice day/safe journey/' greeting.txt
Hi there
Have a safe journey
inplace editing
similar to GNU sed - using * with inplace option, one can also use * to either prefix the backup
name or place the backup files in another existing directory
See also effectiveperlprogramming - caveats of using -i option
206
Perl the swiss knife
Multiple input files are treated individually and changes are written back to respective files
$ cat f1
I ate 3 apples
$ cat f2
I bought two bananas and 3 mangoes
Line filtering
207
Perl the swiss knife
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
With the m you can use any pair of non-alphanumeric, non-whitespace characters as delimiters
$ cat paths.txt
/foo/a/report.log
/foo/y/power.log
/foo/abc/errors.log
208
Perl the swiss knife
209
Perl the swiss knife
$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b
$ # start of line
$ # same as: s='a+b' awk 'index($0, ENVIRON["s"])==1' eqns.txt
$ s='a+b' perl -ne 'print if index($_, $ENV{s})==0' eqns.txt
a+b,pi=3.14,5e12
$ # end of line
$ # length function returns number of characters, by default acts on $_
$ s='a+b' perl -ne '$pos = length() - length($ENV{s}) - 1;
print if index($_, $ENV{s}) == $pos' eqns.txt
i*(t+9-g)/8,4-a+b
210
Perl the swiss knife
Field processing
-a option will auto-split each input record based on one or more continuous white-space, similar
to default behavior in awk
211
Perl the swiss knife
$ cat fruits.txt
fruit qty
apple 42
banana 31
fig 90
guava 6
by default, leading and trailing whitespaces won't be considered when splitting the input record
mimicking awk 's default behavior
212
Perl the swiss knife
Field comparison
for numeric context, Perl automatically tries to convert the string to number, ignoring white-space
for string comparison, use eq for == , ne for != and so on
213
Perl the swiss knife
214
Perl the swiss knife
215
Perl the swiss knife
216
Perl the swiss knife
$ echo 'foo:123:bar:789' | perl -F: -lane 'print join "-", $F[1], $F[-1]'
123-789
$ echo 'foo:123:bar:789' | perl -F: -lane 'print join " - ", @F'
foo - 123 - bar - 789
Method 3: use $" to change separator when array is interpolated, default is space character
could be remembered easily by noting that interpolation happens within double quotes
$ # default is space
$ echo 'foo:123:bar:789' | perl -F: -lane 'print "@F[1,-1]"'
123 789
217
Perl the swiss knife
$ # -l option will chomp off the record separator (among other things)
$ echo 'foo' | perl -l -pe 's/\n/ 123\n/'
foo
$ # -l also sets output record separator which gets added to print statements
$ # ORS gets input record separator value if no argument is passed to -l
$ # hence the newline automatically getting added for print in this example
$ perl -lane 'print $F[0] if $F[1]<35 && $.>1' fruits.txt
banana
guava
218
Perl the swiss knife
-0 option used without argument will use the ASCII NUL character as input record separator
for paragraph mode (two more more consecutive newline characters), use -00 or assign empty
219
Perl the swiss knife
string to $/
$ cat sample.txt
Hello World
Good day
How are you
Just do-it
Believe it
Today is sunny
Not a bit funny
No doubt you like it too
again, input record will have the separator too and using -l will chomp it
however, if more than two consecutive newline characters separate the paragraphs, only two
newlines will be preserved and the rest discarded
use $/="\n\n" to avoid this behavior
220
Perl the swiss knife
Today is sunny
Not a bit funny
No doubt you like it too
Re-structuring paragraphs
$ # same as: awk 'BEGIN{FS="\n"; OFS=". "; RS=""; ORS="\n\n"} {$1=$1} 1'
$ perl -F'\n' -00 -ane 'print join ". ", @F; print "\n\n"' sample.txt
Hello World
multi-character separator
221
Perl the swiss knife
$ cat report.log
blah blah
Error: something went wrong
more blah
whatever
Error: something surely went wrong
some text
some more text
blah blah blah
$ cat msg.txt
Hello there.
It will rain to-
day. Have a safe
and pleasant jou-
rney.
222
Perl the swiss knife
223
Perl the swiss knife
Multiline processing
Processing consecutive lines
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
224
Perl the swiss knife
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
225
Perl the swiss knife
$ # print only line after matching line, same as: awk 'n && n--; /BEGIN/{n=1}'
$ perl -ne 'print if $n && $n--; $n=1 if /BEGIN/' range.txt
1234
a
$ # generic case: print nth line after match, awk 'n && !--n; /BEGIN/{n=3}'
$ perl -ne 'print if $n && !--$n; $n=3 if /BEGIN/' range.txt
END
c
$ # use reversing trick for generic case of nth line before match
$ # same as: tac range.txt | awk 'n && !--n; /END/{n=3}' | tac
$ tac range.txt | perl -ne 'print if $n && !--$n; $n=3 if /END/' | tac
BEGIN
a
Further Reading
226
Perl the swiss knife
227
Perl the swiss knife
228
Perl the swiss knife
back reference
See also perldoc - Warning on \1 Instead of $1
$ # use \1, \2, etc or \g1, \g2 etc for back referencing in search section
$ # use $1, $2, etc in replacement section
$ echo 'a a a walking for for a cause' | perl -pe 's/\b(\w+)( \1)+\b/$1/g'
a walking for a cause
Backslash sequences
\d for [0-9]
\s for [ \t\r\n\f\v]
\h for [ \t]
\n for newline character
229
Perl the swiss knife
Non-greedy quantifier
adding a ? to ? or * or + or {} quantifiers will change matching from greedy to non-
greedy. In other words, to match as minimally as possible
also known as lazy quantifier
See also regular-expressions.info - Possessive Quantifiers
230
Perl the swiss knife
$ # greedy matching
$ echo 'foo and bar and baz land good' | perl -pe 's/foo.*and//'
good
$ # non-greedy matching
$ echo 'foo and bar and baz land good' | perl -pe 's/foo.*?and//'
bar and baz land good
Lookarounds
Ability to add if conditions to match before/after required pattern
There are four types
positive lookahead (?=
negative lookahead (?!
positive lookbehind (?<=
negative lookbehind (?<!
One way to remember is that behind uses < and negative uses ! instead of =
The string matched by lookarounds are like word boundaries and anchors, do not constitute as part of
matched string. They are termed as zero-width patterns
231
Perl the swiss knife
232
Perl the swiss knife
233
Perl the swiss knife
Further Reading
234
Perl the swiss knife
$ cat nums.txt
42
-2
10101
-3.14
-75
Further Reading
235
Perl the swiss knife
use (?: to group regular expressions without capturing it, so this won't be counted for
backreference
See also
stackoverflow - what is non-capturing group
stackoverflow - extract specific fields and key-value pairs
236
Perl the swiss knife
Further Reading
Modifiers
some are already seen, like the g (global match) and i (case insensitive matching)
first up, the r modifier which returns the substitution result instead of modifying the variable it is
acting upon
237
Perl the swiss knife
$ # formatting string
$ echo 'a1-2-deed' | perl -lpe 's/[^-]+/sprintf "%04s", $&/ge'
00a1-0002-deed
$ # calling a function
$ echo 'food:12:explain:789' | perl -pe 's/\w+/length($&)/ge'
4:2:7:3
multiline modifiers
Further Reading
238
Perl the swiss knife
Quoting metacharacters
part of regular expression can be surrounded within \Q and \E to prevent matching meta
characters within that portion
however, $ and @ would still be interpolated as long as delimiter isn't single quotes
\E is optional if applying \Q till end of search expression
typical use case is string to be protected is already present in a variable, for ex: user input or result
of another command
quotemeta will add a backslash to all characters other than \w characters
See also perldoc - Quoting metacharacters
$ # quotemeta in action
$ perl -le '$x="[a].b+c^"; print quotemeta $x'
\[a\]\.b\+c\^
239
Perl the swiss knife
$ # q in action
$ perl -le '$x="[a].b+c^$@123"; print $x'
[a].b+c^123
$ perl -le '$x=q([a].b+c^$@123); print $x'
[a].b+c^$@123
$ perl -le '$x=q([a].b+c^$@123); print quotemeta $x'
\[a\]\.b\+c\^\$\@123
Matching position
From perldoc - perlvar
$+[0] is the offset into the string of the end of the entire match
240
Perl the swiss knife
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
for multiple matches, use while loop to go over all the matches
Using modules
There are many standard modules available that come with Perl installation
and many more available from Comprehensive Perl Archive Network (CPAN)
stackoverflow - easiest way to install a missing module
241
Perl the swiss knife
$ echo '34,17,6' | perl -F, -lane 'BEGIN{use List::Util qw(max)} print max @F'
34
$ # -M option provides a way to specify modules from command line
$ echo '34,17,6' | perl -MList::Util=max -F, -lane 'print max @F'
34
$ echo '34,17,6' | perl -MList::Util=sum0 -F, -lane 'print sum0 @F'
57
$ echo '34,17,6' | perl -MList::Util=product -F, -lane 'print product @F'
3468
$ s='1,2,3,4,5'
$ echo "$s" | perl -MList::Util=shuffle -F, -lane 'print join ",",shuffle @F'
5,3,4,1,2
$ s='3,b,a,c,d,1,d,c,2,3,1,b'
$ echo "$s" | perl -MList::MoreUtils=uniq -F, -lane 'print join ",",uniq @F'
3,b,a,c,d,1,2
242
Perl the swiss knife
Further Reading
perldoc - perlmodlib
perldoc - Core modules
unix.stackexchange - example for Algorithm::Combinatorics
unix.stackexchange - example for Text::ParseWords
stackoverflow - regular expression modules
metacpan - String::Approx - Perl extension for approximate matching (fuzzy matching)
metacpan - Tie::IxHash - ordered associative arrays for Perl
243
Perl the swiss knife
$ cat colors_1.txt
Blue
Brown
Purple
Red
Teal
Yellow
$ cat colors_2.txt
Black
Blue
Green
Red
White
For two files as input, $#ARGV will be 0 only when first file is being processed
Using next will skip rest of code
entire line is used as key
$ # common lines
$ # same as: grep -Fxf colors_1.txt colors_2.txt
$ # same as: awk 'NR==FNR{a[$0]; next} $0 in a' colors_1.txt colors_2.txt
$ perl -ne 'if(!$#ARGV){$h{$_}=1; next}
print if $h{$_}' colors_1.txt colors_2.txt
Blue
Red
alternative constructs
<FILEHANDLE> reads line(s) from the specified file
defaults to current file argument(includes stdin as well), so <> can be used as shortcut
<STDIN> will read only from stdin, there are also predefined handles for stdout/stderr
in list context, all the lines would be read
See perldoc - I/O Operators for details
244
Perl the swiss knife
$ cat marks.txt
Dept Name Marks
ECE Raj 53
ECE Joel 72
EEE Moi 68
CSE Surya 81
EEE Tia 59
ECE Om 92
CSE Amy 67
single field
For ex: only first field comparison instead of entire line as key
245
Perl the swiss knife
$ cat list1
ECE
CSE
246
Perl the swiss knife
$ cat list2
EEE Moi
CSE Amy
ECE Raj
$ cat list3
ECE 70
EEE 65
CSE 80
See also stackoverflow - Fastest way to find lines of a text file from another larger text file
247
Perl the swiss knife
$ # print line from fruits.txt if corresponding line from nums.txt is +ve number
$ # same as: awk -v file='nums.txt' '{getline num < file; if(num>0) print}'
$ file='nums.txt' perl -ne 'BEGIN{open($f,$ENV{file})}
$num=<$f>; print if $num>0' fruits.txt
fruit qty
banana 31
$ # or pass contents of nums.txt as standard input
$ <nums.txt perl -ne '$num=<STDIN>; print if $num>0' fruits.txt
fruit qty
banana 31
$ s='foo,bar,123,baz'
$ # reducing fields
$ # same as: awk -F, -v OFS=, '{NF=2} 1'
$ echo "$s" | perl -F, -lane '$,=","; $#F=1; print @F'
foo,bar
$ # assigning to field greater than $#F will create empty fields as needed
$ # same as: awk -F, -v OFS=, '{$7=42} 1'
$ echo "$s" | perl -F, -lane '$,=","; $F[6]=42; print @F'
foo,bar,123,baz,,,42
248
Perl the swiss knife
$ cat list4
Raj class_rep
Amy sports_rep
Tia placement_rep
249
Perl the swiss knife
$ cat duplicates.txt
abc 7 4
food toy ****
abc 7 4
test toy 123
good toy ****
250
Perl the swiss knife
multiple fields
See also unix.stackexchange - based on same fields that could be in different order
251
Perl the swiss knife
252
Perl the swiss knife
$ cat range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
END
baz
other variations
253
Perl the swiss knife
Specific blocks
Getting first block
254
Perl the swiss knife
$ # reverse input linewise, change the order of REGEXPs, finally reverse again
$ # same as: tac range.txt | awk '/END/{f=1} f; /BEGIN/{exit}' | tac
$ tac range.txt | perl -ne '$f=1 if /END/; print if $f; exit if /BEGIN/' | tac
BEGIN
a
b
c
END
$ # or, save the blocks in a buffer and print the last one alone
$ # same as: awk '/4/{f=1; b=$0; next} f{b=b ORS $0} /6/{f=0} END{print b}'
$ seq 30 | perl -ne 'if(/4/){$f=1; $b=$_; next}
$b.=$_ if $f; $f=0 if /6/; END{print $b}'
24
25
26
255
Perl the swiss knife
Broken blocks
If there are blocks with ending REGEXP but without corresponding start, earlier techniques used will
suffice
Consider the modified input file where starting REGEXP doesn't have corresponding ending
256
Perl the swiss knife
$ cat broken_range.txt
foo
BEGIN
1234
6789
END
bar
BEGIN
a
b
c
baz
$ cat multiple_broken.txt
qqqqqqq
BEGIN
foo
BEGIN
1234
6789
END
bar
END
0-42-1
BEGIN
a
BEGIN
b
END
;as;s;sd;
257
Perl the swiss knife
Array operations
initialization
258
Perl the swiss knife
array slices
See also perldoc - Range Operators
259
Perl the swiss knife
260
Perl the swiss knife
$ # compare values
$ s='23 756 -983 5'
$ echo "$s" | perl -lane 'print join " ", grep $_<100, @F'
23 -983 5
more examples
$ cat split.txt
foo,1:2:5,baz
wry,4,look
free,3:8,oh
$ # print line if more than one column has a digit
$ perl -F: -lane 'print if (grep /\d/, @F) > 1' split.txt
foo,1:2:5,baz
free,3:8,oh
261
Perl the swiss knife
Sorting
See perldoc - sort for details
$a and $b are special variables used for sorting, avoid using them as user defined variables
262
Perl the swiss knife
$ cat words.txt
bot
art
are
boat
toe
flee
reed
263
Perl the swiss knife
$ # need to get indexes of order required for header, then use it for all lines
$ perl -lane '@i = sort {$F[$a] cmp $F[$b]} 0..$#F if $.==1;
print join "\t", @F[@i]' marks.txt
Dept Marks Name
ECE 53 Raj
ECE 72 Joel
EEE 68 Moi
CSE 81 Surya
EEE 59 Tia
ECE 92 Om
CSE 67 Amy
Further Reading
Transforming
shuffling list elements
264
Perl the swiss knife
265
Perl the swiss knife
$ echo '23 756 -983 5' | perl -lane 'print join " ", map {$_*$_} @F'
529 571536 966289 25
$ echo 'a b c' | perl -lane 'print join ",", map {"\"$_\""} @F'
"a","b","c"
$ echo 'a b c' | perl -lane 'print join ",", map {uc "\"$_\""} @F'
"A","B","C"
$ cat para.txt
Why cannot I go back to my ignorant days with wild imaginations and fantasies?
Perhaps the answer lies in not being able to adapt to my freedom.
Those little dreams, goal setting, anticipation of results, used to be my world.
All joy within the soul and less dependent on outside world.
But all these are absent for a long time now.
Hope I can wake those dreams all over again.
reverse array
266
Perl the swiss knife
Miscellaneous
split
the -a command line option uses split and automatically saves the results in @F array
default separator is \s+
by default acts on $_
and by default all splits are performed
See also perldoc - split function
267
Perl the swiss knife
$ cat split.txt
foo,1:2:5,baz
wry,4,look
free,3:8,oh
$ perl -F, -ane 'print join ",", $F[0],$_,$F[2] for split /:/,$F[1]' split.txt
foo,1,baz
foo,2,baz
foo,5,baz
wry,4,look
free,3,oh
free,8,oh
268
Perl the swiss knife
Further Reading
269
Perl the swiss knife
perldoc - substr
stackoverflow - extract columns from a fixed-width format
stackoverflow - build fixed-width template from header
stackoverflow - convert fixed-width to delimited format
$ # replicate a string
$ perl -le 'print "abc" x 5'
abcabcabcabcabc
$ # replicating file
$ wc -c poem.txt
65 poem.txt
$ perl -0777 -ne 'print $_ x 100' poem.txt | wc -c
6500
270
Perl the swiss knife
$ # hacking
$ # same as: echo {1,3}{a,b}
$ perl -le '@x=glob q/{1,3}{a,b}/; print "@x"'
1a 1b 3a 3b
$ # same as: echo {1,3}{1,3}{1,3}
$ perl -le '@x=glob "{1,3}" x 3; print "@x"'
111 113 131 133 311 313 331 333
$ cat f2
I bought two bananas and three mangoes
$ echo 'f1,f2,odd.txt' | perl -F, -lane 'system "cat $F[1]"'
I bought two bananas and three mangoes
return value of system will have exit status information or $? can be used
see perldoc - system for details
271
Perl the swiss knife
See also stackoverflow - difference between backticks, system, exec and open
Further Reading
Manual and related
perldoc - overview
perldoc - faqs
perldoc - tutorials
perldoc - functions
perldoc - special variables
perldoc - perlretut
Tutorials and Q&A
Perl one-liners explained
perl Q&A on stackoverflow
regex FAQ on SO
regexone - interative tutorial
regexcrossword - practice by solving crosswords, read 'How to play' section before you start
Alternatives
bioperl
ruby
unix.stackexchange - When to use grep, sed, awk, perl, etc
272
Perl the swiss knife
273
Sorting stuff
Sorting stuff
Table of Contents
sort
Default sort
Reverse sort
Various number sorting
Random sort
Specifying output file
Unique sort
Column based sorting
Further reading for sort
uniq
Default uniq
Only duplicates
Only unique
Prefix count
Ignoring case
Combining multiple files
Column options
Further reading for uniq
comm
Default three column output
Suppressing columns
Files with duplicates
Further reading for comm
shuf
Random lines
Random integer numbers
Further reading for shuf
sort
274
Sorting stuff
$ man sort
SORT(1) User Commands SORT(1)
NAME
sort - sort lines of text files
SYNOPSIS
sort [OPTION]... [FILE]...
sort [OPTION]... --files0-from=F
DESCRIPTION
Write sorted concatenation of all FILE(s) to standard output.
Note: All examples shown here assumes ASCII encoded input file
Default sort
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
$ sort poem.txt
And so are you.
Roses are red,
Sugar is sweet,
Violets are blue,
Well, that was easy. The lines were sorted alphabetically (ascending order by default) and it so
happened that first letter alone was enough to decide the order
For next example, let's extract all the words and sort them
also allows to showcase sort accepting stdin
See GNU grep chapter if the grep command used below looks alien
275
Sorting stuff
heed hereunto
See also
arch wiki - locale
Linux: Define Locale and Language Settings
276
Sorting stuff
Reverse sort
This is simply reversing from default ascending order to descending order
$ sort -r poem.txt
Violets are blue,
Sugar is sweet,
Roses are red,
And so are you.
$ cat numbers.txt
20
53
3
101
$ sort numbers.txt
101
20
3
53
277
Sorting stuff
Whoops, what happened there? sort won't know to treat them as numbers unless specified
Depending on format of numbers, different options have to be used
First up is -n option, which sorts based on numerical value
$ sort -n numbers.txt
3
20
53
101
278
Sorting stuff
$ cat generic_numbers.txt
+120
-1.53
3.14e+4
42.1e-2
$ sort -g generic_numbers.txt
-1.53
42.1e-2
+120
3.14e+4
$ du -sh *
104K power.log
746M projects
316K report.log
20K sample.txt
$ du -sh * | sort -h
20K sample.txt
104K power.log
316K report.log
746M projects
279
Sorting stuff
$ cat versions.txt
foo_v1.2
bar_v2.1.3
foobar_v2
foo_v1.2.1
foo_v1.3
$ sort -V versions.txt
bar_v2.1.3
foobar_v2
foo_v1.2
foo_v1.2.1
foo_v1.3
Another common use case is when there are multiple filenames differentiated by numbers
$ cat files.txt
file0
file10
file3
file4
$ sort -V files.txt
file0
file3
file4
file10
Can be used when dealing with numbers reported by time command as well
280
Sorting stuff
Random sort
Note that duplicate lines will always end up next to each other
might be useful as a feature for some cases ;)
Use shuf if this is not desirable
See also How can I shuffle the lines of a text file on the Unix command line or in a shell script?
281
Sorting stuff
$ cat nums.txt
1
10
10
12
23
563
282
Sorting stuff
Unique sort
Keep only first copy of lines that are deemed to be same according to sort option used
283
Sorting stuff
$ cat duplicates.txt
foo
12 carrots
foo
12 apples
5 guavas
284
Sorting stuff
$ cat words.txt
CAR
are
car
Are
foot
are
‘-k POS1[,POS2]’
‘--key=POS1[,POS2]’
Specify a sort field that consists of the part of the line between
POS1 and POS2 (or the end of the line, if POS2 is omitted),
_inclusive_.
Each POS has the form ‘F[.C][OPTS]’, where F is the number of the
field to use, and C is the number of the first character from the
beginning of the field. Fields and character positions are
numbered starting with 1; a character position of zero in POS2
indicates the field’s last character. If ‘.C’ is omitted from
POS1, it defaults to 1 (the beginning of the field); if omitted
from POS2, it defaults to 0 (the end of the field). OPTS are
ordering options, allowing individual keys to be sorted according
to different rules; see below for details. Keys can span multiple
fields.
285
Sorting stuff
$ cat fruits.txt
apple 42
guava 6
fig 90
banana 31
$ sort fruits.txt
apple 42
banana 31
fig 90
guava 6
$ # name:pet_name:no_of_pets
$ cat pets.txt
foo:dog:2
xyz:cat:1
baz:parrot:5
abcd:cat:3
joe:dog:1
bar:fox:1
temp_var:squirrel:4
boss:dog:10
286
Sorting stuff
287
Sorting stuff
$ # default sort for 2nd column, numeric sort on 3rd column to resolve ties
$ sort -t: -k2,2 -k3,3n pets.txt
xyz:cat:1
abcd:cat:3
joe:dog:1
foo:dog:2
boss:dog:10
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
$ # numeric sort on 3rd column, default sort for 2nd column to resolve ties
$ sort -t: -k3,3n -k2,2 pets.txt
xyz:cat:1
joe:dog:1
bar:fox:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10
288
Sorting stuff
Sometimes, the input has to be sorted first and then -u used on the sorted output
See also remove duplicates based on the value of another column
289
Sorting stuff
$ cat marks.txt
fork,ap_12,54
flat,up_342,1.2
fold,tn_48,211
more,ap_93,7
rest,up_5,63
$ # for 2nd column, sort numerically only from 4th character to end
$ sort -t, -k2.4,2n marks.txt
rest,up_5,63
fork,ap_12,54
fold,tn_48,211
more,ap_93,7
flat,up_342,1.2
$ cat header.txt
fruit qty
apple 42
guava 6
fig 90
banana 31
See also sort by last field value when number of fields varies
290
Sorting stuff
uniq
$ man uniq
UNIQ(1) User Commands UNIQ(1)
NAME
uniq - report or omit repeated lines
SYNOPSIS
uniq [OPTION]... [INPUT [OUTPUT]]
DESCRIPTION
Filter adjacent matching lines from INPUT (or standard input), writing
to OUTPUT (or standard output).
Default uniq
291
Sorting stuff
$ cat word_list.txt
are
are
to
good
bad
bad
bad
good
are
bad
Only duplicates
292
Sorting stuff
$ uniq -D word_list.txt
are
are
bad
bad
bad
$ # using --all-repeated=prepend will add a newline before the first group as well
$ sort word_list.txt | uniq --all-repeated=separate
are
are
are
bad
bad
bad
bad
good
good
Only unique
293
Sorting stuff
Prefix count
$ # adjacent lines
$ uniq -c word_list.txt
2 are
1 to
1 good
3 bad
1 good
1 are
1 bad
$ # entire file
$ sort word_list.txt | uniq -c
3 are
4 bad
2 good
1 to
Sorting by count
294
Sorting stuff
$ # sort by count
$ sort word_list.txt | uniq -c | sort -n
1 to
2 good
3 are
4 bad
To get only entries with min/max count, bit of awk magic would help
295
Sorting stuff
Ignoring case
296
Sorting stuff
$ cat another_list.txt
food
Food
good
are
bad
Are
297
Sorting stuff
If only adjacent lines (not sorted) is required, need to concatenate files using another command
Column options
uniq has few options dealing with column manipulations. Not extensive as sort -k but handy
for some cases
First up, skipping fields
No option to specify different delimiter
298
Sorting stuff
From info uniq : Fields are sequences of non-space non-tab characters that are separated
from each other by at least one space or tab
Number of spaces/tabs between fields should be same
$ cat shopping.txt
lemon 5
mango 5
banana 8
bread 1
orange 5
Skipping characters
$ cat text
glue
blue
black
stack
stuck
299
Sorting stuff
Combining -s and -w
Can be combined with -f as well
comm
300
Sorting stuff
$ man comm
COMM(1) User Commands COMM(1)
NAME
comm - compare two sorted files line by line
SYNOPSIS
comm [OPTION]... FILE1 FILE2
DESCRIPTION
Compare sorted files FILE1 and FILE2 line by line.
301
Sorting stuff
Suppressing columns
-1 suppress lines unique to first file
-2 suppress lines unique to second file
-3 suppress lines common to both files
$ # suppressing column 3
$ comm -3 colors_1.txt colors_2.txt
Black
Brown
Green
Purple
Teal
White
Yellow
302
Sorting stuff
See also how the above three cases can be done using grep alone
Note input files do not need to be sorted for grep solution
If different sort order than default is required, use --nocheck-order to ignore error message
303
Sorting stuff
shuf
304
Sorting stuff
$ man shuf
SHUF(1) User Commands SHUF(1)
NAME
shuf - generate random permutations
SYNOPSIS
shuf [OPTION]... [FILE]
shuf -e [OPTION]... [ARG]...
shuf -i LO-HI [OPTION]...
DESCRIPTION
Write a random permutation of the input lines to standard output.
Random lines
Without repeating input lines
305
Sorting stuff
$ cat nums.txt
1
10
10
12
23
563
306
Sorting stuff
use -e option to specify multiple input lines from command line itself
307
Sorting stuff
$ shuf -i 3-8
3
7
6
4
8
5
Use seq input if negative numbers, floating point, etc are needed
$ seq 2 -1 -2 | shuf
2
-1
-2
0
1
308
Sorting stuff
man shuf and info shuf for more options and detailed documentation
Generate random numbers in specific range
Variable - randomly choose among three numbers
Related to 'random' stuff:
How to generate a random string?
How can I populate a file with random data?
Run commands at random
309
Restructure text
Restructure text
Table of Contents
paste
Concatenating files column wise
Interleaving lines
Lines to multiple columns
Different delimiters between columns
Multiple lines to single row
Further reading for paste
column
Pretty printing tables
Specifying different input delimiter
Further reading for column
pr
Converting lines to columns
Changing PAGE_WIDTH
Combining multiple input files
Transposing a table
Further reading for pr
fold
Examples
Further reading for fold
paste
310
Restructure text
$ man paste
PASTE(1) User Commands PASTE(1)
NAME
paste - merge lines of files
SYNOPSIS
paste [OPTION]... [FILE]...
DESCRIPTION
Write lines consisting of the sequentially corresponding lines from
each FILE, separated by TABs, to standard output.
311
Restructure text
$ # empty cells if number of lines is not same for all input files
$ # -d\| can also be used
$ paste -d'|' <(seq 3) <(seq 4 6) <(seq 7 10)
1|4|7
2|5|8
3|6|9
||10
Interleaving lines
Interleave lines by using newline as delimiter
312
Restructure text
313
Restructure text
combination of -d and /dev/null (empty file) can give multi-character separation between
columns
If this is too confusing to use, consider pr instead
314
Restructure text
$ paste -d' : ' <(seq 3) /dev/null /dev/null <(seq 4 6) /dev/null /dev/null <(seq 7
9)
1 : 4 : 7
2 : 5 : 8
3 : 6 : 9
315
Restructure text
For multiple character delimiter, post-process if separator is unique or use another tool like perl
$ # post-process
$ seq 10 | paste -sd, | sed 's/,/ : /g'
1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10
column
316
Restructure text
NAME
column — columnate lists
SYNOPSIS
column [-entx] [-c columns] [-s sep] [file ...]
DESCRIPTION
The column utility formats its input into multiple columns. Rows are
filled before columns. Input is taken from file operands, or, by
default, from the standard input. Empty lines are ignored unless the -e
option is used.
...
$ cat dishes.txt
North alootikki baati khichdi makkiroti poha
South appam bisibelebath dosa koottu sevai
West dhokla khakhra modak shiro vadapav
East handoguri litti momo rosgulla shondesh
$ column -t dishes.txt
North alootikki baati khichdi makkiroti poha
South appam bisibelebath dosa koottu sevai
West dhokla khakhra modak shiro vadapav
East handoguri litti momo rosgulla shondesh
often useful to get neatly aligned columns from output of another command
317
Restructure text
318
Restructure text
pr
$ man pr
PR(1) User Commands PR(1)
NAME
pr - convert text files for printing
SYNOPSIS
pr [OPTION]... [FILE]...
DESCRIPTION
Paginate or columnate FILE(s) for printing.
319
Restructure text
Fruits
apple
guava
watermelon
banana
pomegranate
$ # note how the input got split into two and resulting splits joined by ,
$ seq 6 | pr -2ts,
1,4
2,5
3,6
320
Restructure text
$ seq 8 | pr -4ats:
1:2:3:4
5:6:7:8
321
Restructure text
Changing PAGE_WIDTH
The default PAGE_WIDTH is 72
The formula (col-1)*len(delimiter) + col seems to work in determining minimum
PAGE_WIDTH required for multiple column output
col is number of columns required
$ # (37-1)*1 + 37 = 73
$ seq 74 | pr -J -w73 -37ats,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,
32,33,34,35,36,37
38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,
66,67,68,69,70,71,72,73,74
$ # (3-1)*4 + 3 = 11
$ seq 6 | pr -J -w10 -3ats'::::'
pr: page width too narrow
$ seq 6 | pr -J -w11 -3ats'::::'
1::::2::::3
4::::5::::6
322
Restructure text
Transposing a table
323
Restructure text
fold
324
Restructure text
$ man fold
FOLD(1) User Commands FOLD(1)
NAME
fold - wrap each input line to fit in specified width
SYNOPSIS
fold [OPTION]... [FILE]...
DESCRIPTION
Wrap input lines in each FILE, writing to standard output.
Examples
$ nl story.txt
1 The princess of a far away land fought bravely to rescue a travelling grou
p from bandits. And the happy story ends here. Have a nice day.
2 Still here? okay, read on: The prince of Happalakkahuhu wished he could be
as brave as his sister and vowed to train harder
$ fold story.txt | nl
1 The princess of a far away land fought bravely to rescue a travelling grou
p from
2 bandits. And the happy story ends here. Have a nice day.
3 Still here? okay, read on: The prince of Happalakkahuhu wished he could be
as br
4 ave as his sister and vowed to train harder
325
Restructure text
$ fold -s story.txt
The princess of a far away land fought bravely to rescue a travelling group
from bandits. And the happy story ends here. Have a nice day.
Still here? okay, read on: The prince of Happalakkahuhu wished he could be as
brave as his sister and vowed to train harder
326
File attributes
File attributes
Table of Contents
wc
Various counts
subtle differences
Further reading for wc
du
Default size
Various size formats
Dereferencing links
Filtering options
Further reading for du
df
Examples
Further reading for df
touch
Creating empty file
Updating timestamps
Preserving timestamp
Further reading for touch
file
File type examples
Further reading for file
wc
327
File attributes
$ man wc
WC(1) User Commands WC(1)
NAME
wc - print newline, word, and byte counts for each file
SYNOPSIS
wc [OPTION]... [FILE]...
wc [OPTION]... --files0-from=F
DESCRIPTION
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified. A word is a non-zero-length sequence
of characters delimited by white space.
Various counts
328
File attributes
$ cat sample.txt
Hello World
Good day
No doubt you like it too
Much ado about nothing
He he he
$ cat greeting.txt
Hello there
Have a safe journey
$ cat fruits.txt
Fruit Price
apple 42
banana 31
fig 90
guava 6
$ wc *.txt
5 10 57 fruits.txt
2 6 32 greeting.txt
5 17 78 sample.txt
12 33 167 total
329
File attributes
$ wc -L < sample.txt
24
$ # last line will show max value, not sum of all input
$ wc -L *.txt
13 fruits.txt
19 greeting.txt
24 sample.txt
24 total
subtle differences
byte count vs character count
$ printf 'hi' | wc -m
3
$ printf 'hi' | wc -c
6
330
File attributes
From man wc "A word is a non-zero-length sequence of characters delimited by white space"
-L won't count non-printable characters and tabs are converted to equivalent spaces
$ printf 'food\tgood' | wc -L
12
$ printf 'food\tgood' | wc -m
9
$ printf 'food\tgood' | awk '{print length()}'
9
$ printf 'foo\0bar\0baz' | wc -L
9
$ printf 'foo\0bar\0baz' | wc -m
11
$ printf 'foo\0bar\0baz' | awk '{print length()}'
11
331
File attributes
du
$ man du
DU(1) User Commands DU(1)
NAME
du - estimate file space usage
SYNOPSIS
du [OPTION]... [FILE]...
du [OPTION]... --files0-from=F
DESCRIPTION
Summarize disk usage of the set of FILEs, recursively for directories.
...
Default size
By default, size is given in size of 1024 bytes
Files are ignored, all directories and sub-directories are recursively reported
$ ls -F
projs/ py_learn@ words.txt
$ du
17920 ./projs/full_addr
14316 ./projs/half_addr
32952 ./projs
33880 .
332
File attributes
$ du -a
712 ./projs/report.log
17916 ./projs/full_addr/faddr.v
17920 ./projs/full_addr
14312 ./projs/half_addr/haddr.v
14316 ./projs/half_addr
32952 ./projs
0 ./py_learn
924 ./words.txt
33880 .
$ du -s
33880 .
$ du -s projs words.txt
32952 projs
924 words.txt
use -S to show directory size without taking into account size of its sub-directories
$ du -S
17920 ./projs/full_addr
14316 ./projs/half_addr
716 ./projs
928 .
333
File attributes
$ # number of bytes
$ stat -c %s words.txt
938848
$ du -b words.txt
938848 words.txt
sorting
334
File attributes
to get size based on number of characters in file rather than disk space alloted
$ du -b words.txt
938848 words.txt
$ du -h words.txt
924K words.txt
$ # 938848/1024 = 916.84
$ du --apparent-size -h words.txt
917K words.txt
Dereferencing links
See man and info pages for other related options
Filtering options
335
File attributes
$ du -ah projs
712K projs/report.log
18M projs/full_addr/faddr.v
18M projs/full_addr
14M projs/half_addr/haddr.v
14M projs/half_addr
33M projs
$ # >= 15M
$ du -Sh -t 15M
18M ./projs/full_addr
$ # <= 1M
$ du -ah -t -1M
712K ./projs/report.log
0 ./py_learn
924K ./words.txt
336
File attributes
df
$ man df
DF(1) User Commands DF(1)
NAME
df - report file system disk space usage
SYNOPSIS
df [OPTION]... [FILE]...
DESCRIPTION
This manual page documents the GNU version of df. df displays the
amount of disk space available on the file system containing each file
name argument. If no file name is given, the space available on all
currently mounted file systems is shown.
...
337
File attributes
Examples
$ # use df without arguments to get information on all currently mounted file systems
$ df .
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 98298500 58563816 34734748 63% /
$ df -h --output=size,used,file / /media/learnbyexample/projs
Size Used File
94G 56G /
92G 35G /media/learnbyexample/projs
$ df -h --output=pcent .
Use%
63%
touch
338
File attributes
$ man touch
TOUCH(1) User Commands TOUCH(1)
NAME
touch - change file timestamps
SYNOPSIS
touch [OPTION]... FILE...
DESCRIPTION
Update the access and modification times of each FILE to the current
time.
$ ls foo.txt
ls: cannot access 'foo.txt': No such file or directory
$ touch foo.txt
$ ls foo.txt
foo.txt
Updating timestamps
Updating both access and modification timestamp to current time
339
File attributes
$ touch fruits.txt
$ stat -c %x fruits.txt
2017-07-21 10:11:44.241921229 +0530
$ stat -c %y fruits.txt
2017-07-21 10:11:44.241921229 +0530
$ touch -a greeting.txt
$ stat -c %x greeting.txt
2017-07-21 10:14:08.457268564 +0530
$ stat -c %y greeting.txt
2017-07-13 13:54:26.004499660 +0530
$ touch -m sample.txt
$ stat -c %x sample.txt
2017-07-13 13:48:24.945450646 +0530
$ stat -c %y sample.txt
2017-07-21 10:14:40.770006144 +0530
340
File attributes
$ # add -a or -m as needed
$ touch -d '2010-03-17 17:04:23' report.log
$ stat -c $'%x\n%y' report.log
2010-03-17 17:04:23.000000000 +0530
2010-03-17 17:04:23.000000000 +0530
Preserving timestamp
Text processing on files would update the timestamps
$ # after text processing, copy back the timestamps and remove temporary file
$ sed -i 's/cat/dog/g' story.txt
$ touch -r tmp.txt story.txt && rm tmp.txt
$ stat -c $'%x\n%y' story.txt
2017-06-24 13:00:31.773583923 +0530
2017-06-24 12:59:53.316751651 +0530
341
File attributes
file
$ man file
FILE(1) BSD General Commands Manual FILE(1)
NAME
file — determine file type
SYNOPSIS
file [-bcEhiklLNnprsvzZ0] [--apple] [--extension] [--mime-encoding]
[--mime-type] [-e testname] [-F separator] [-f namefile]
[-m magicfiles] [-P name=value] file ...
file -C [-m magicfiles]
file [--help]
DESCRIPTION
This manual page documents version 5.25 of the file command.
file tests each argument in an attempt to classify it. There are three
sets of tests, performed in this order: filesystem tests, magic tests,
and language tests. The first test that succeeds causes the file type to
be printed.
...
342
File attributes
$ file sample.txt
sample.txt: ASCII text
$ # without file name in output
$ file -b sample.txt
ASCII text
$ file ch
ch: Bourne-Again shell script, ASCII text executable
find all files of particular type in current directory, for example image files
$ find -type f -exec bash -c '(file -b "$0" | grep -wq "image data") && echo "$0"' {
} \;
./sunset.jpg
./moon.png
343
File attributes
344
Miscellaneous
Miscellaneous
Table of Contents
cut
select specific fields
suppressing lines without delimiter
specifying delimiters
complement
select specific characters
Further reading for cut
tr
translation
escape sequences and character classes
deletion
squeeze
Further reading for tr
basename
dirname
xargs
seq
integer sequences
specifying separator
floating point sequences
Further reading for seq
cut
345
Miscellaneous
$ man cut
CUT(1) User Commands CUT(1)
NAME
cut - remove sections from each line of files
SYNOPSIS
cut OPTION... [FILE]...
DESCRIPTION
Print selected parts of lines from each FILE to standard output.
346
Miscellaneous
$ printf 'foo\tbar\t123\tbaz\n'
foo bar 123 baz
$ # single field
$ printf 'foo\tbar\t123\tbaz\n' | cut -f2
bar
$ cat marks.txt
jan 2017
foobar 12 45 23
feb 2017
foobar 18 38 19
347
Miscellaneous
specifying delimiters
use -d option to specify input delimiter other than default tab character
only single character can be used, for multi-character/regex based delimiter use awk or perl
complement
348
Miscellaneous
349
Miscellaneous
tr
$ man tr
TR(1) User Commands TR(1)
NAME
tr - translate or delete characters
SYNOPSIS
tr [OPTION]... SET1 [SET2]
DESCRIPTION
Translate, squeeze, and/or delete characters from standard input, writ‐
ing to standard output.
...
translation
one-to-one mapping of characters, all occurrences are translated
as good practice, enclose the arguments in single quotes to avoid issues due to shell interpretation
350
Miscellaneous
$ # changing case
$ echo 'foo bar cat baz' | tr 'a-z' 'A-Z'
FOO BAR CAT BAZ
$ echo 'Hello World' | tr 'a-zA-Z' 'A-Za-z'
hELLO wORLD
$ echo 'foo;bar;baz' | tr ; :
tr: missing operand
Try 'tr --help' for more information.
$ echo 'foo;bar;baz' | tr ';' ':'
foo:bar:baz
rot13 example
$ cat marks.txt
jan 2017
foobar 12 45 23
feb 2017
foobar 18 38 19
351
Miscellaneous
since - is used for character ranges, place it at the end to represent it literally
cannot be used at start of argument as it would get treated as option
or use -- to indicate end of option processing
similarly, to represent \ literally, use \\
352
Miscellaneous
deletion
use -d option to specify characters to be deleted
add complement option -c if it is easier to define which characters are to be retained
squeeze
353
Miscellaneous
basename
354
Miscellaneous
$ man basename
BASENAME(1) User Commands BASENAME(1)
NAME
basename - strip directory and suffix from filenames
SYNOPSIS
basename NAME [SUFFIX]
basename OPTION... NAME...
DESCRIPTION
Print NAME with any leading directory components removed. If speci‐
fied, also remove a trailing SUFFIX.
...
Examples
$ basename "$PWD"
learnbyexample
$ # use single quotes if arguments contain space and other special shell characters
$ # use suffix option -s to strip file extension from filename
$ basename -s '.log' '/home/learnbyexample/proj adder/power.log'
power
$ # -a is implied when using -s option
$ basename -s'.log' foo/a/report.log bar/y/power.log
report
power
Can also use Parameter expansion if working on file paths saved in variables
assumes bash shell and similar that support this feature
355
Miscellaneous
$ t="${file##*/}"
$ # remove .log from end of string
$ echo "${t%.log}"
power
dirname
$ man dirname
DIRNAME(1) User Commands DIRNAME(1)
NAME
dirname - strip last component from file name
SYNOPSIS
dirname [OPTION] NAME...
DESCRIPTION
Output each NAME with its last non-slash component and trailing slashes
removed; if NAME contains no /'s, output '.' (meaning the current
directory).
...
Examples
356
Miscellaneous
$ echo "$PWD"
/home/learnbyexample
$ dirname "$PWD"
/home
$ # use single quotes if arguments contain space and other special shell characters
$ dirname '/home/learnbyexample/proj adder/power.log'
/home/learnbyexample/proj adder
Can also use Parameter expansion if working on file paths saved in variables
assumes bash shell and similar that support this feature
357
Miscellaneous
$ # apply basename trick to get just directory name instead of full path
$ t="${file%/*}"
$ echo "${t##*/}"
proj adder
xargs
$ whatis xargs
xargs (1) - build and execute command lines from standard input
While xargs is primarily used for passing output of command or file contents to another command as
input arguments and/or parallel processing, it can be quite handy for certain text processing stuff with
default echo command
358
Miscellaneous
$ cat marks.txt
jan 2017
foobar 12 45 23
feb 2017
foobar 18 38 19
$ xargs -a marks.txt
jan 2017 foobar 12 45 23 feb 2017 foobar 18 38 19
Note since echo is the command being executed, it will cause issue with option interpretation
359
Miscellaneous
seq
$ man seq
SEQ(1) User Commands SEQ(1)
NAME
seq - print a sequence of numbers
SYNOPSIS
seq [OPTION]... LAST
seq [OPTION]... FIRST LAST
seq [OPTION]... FIRST INCREMENT LAST
DESCRIPTION
Print numbers from FIRST to LAST, in steps of INCREMENT.
...
integer sequences
see info seq for details of how large numbers are handled
for ex: seq 50000000000000000000 2 50000000000000000004 may not work
360
Miscellaneous
$ # default increment=1
$ seq 25434 25437
25434
25435
25436
25437
$ seq -5 -3
-5
-4
-3
361
Miscellaneous
$ seq -w 0003
0001
0002
0003
specifying separator
As seen already, default is newline separator between numbers
-s option allows to use custom string between numbers
A newline is always added at end
$ seq -s: 4
1:2:3:4
362
Miscellaneous
$ # default increment=1
$ seq 0.5 2.5
0.5
1.5
2.5
363