Awk - A Pattern Scanning and Processing Language (Second Edition)
Awk - A Pattern Scanning and Processing Language (Second Edition)
Awk - A Pattern Scanning and Processing Language (Second Edition)
(Second Edition)
Alfred V. Aho
Brian W. Kernighan
Peter J. Weinberger
Bell Laboratories
Murray Hill, New Jersey 07974
ABSTRACT
length > 72
prints all input lines whose length exceeds 72 characters; the program
NF % 2 == 0
prints all lines with an even number of fields; and the program
{ $1 = log($1); print }
September 1, 1978
Awk — A Pattern Scanning and Processing Language
(Second Edition)
Alfred V. Aho
Brian W. Kernighan
Peter J. Weinberger
Bell Laboratories
Murray Hill, New Jersey 07974
1. Introduction
Awk is a programming language designed to 1.2. Program Structure
make many common information retrieval and text An awk program is a sequence of statements
manipulation tasks easy to state and to perform. of the form:
The basic operation of awk is to scan a set of
pattern { action }
input lines in order, searching for lines which match
pattern { action }
any of a set of patterns which the user has specified.
...
For each pattern, an action can be specified; this
action will be performed on each line that matches Each line of input is matched against each of the pat-
the pattern. terns in turn. For each pattern that matches, the asso-
Readers familiar with the UNIX† program ciated action is executed. When all the patterns have
grep 1 will recognize the approach, although in awk been tested, the next line is fetched and the matching
the patterns may be more general than in grep , and starts over.
the actions allowed are more involved than merely Either the pattern or the action may be left out,
printing the matching line. For example, the awk but not both. If there is no action for a pattern, the
program matching line is simply copied to the output. (Thus a
line which matches several patterns can be printed
{print $3, $2}
several times.) If there is no pattern for an action,
prints the third and second columns of a table in that then the action is performed for every input line. A
order. The program line which matches no pattern is ignored.
$2 ∼ /A |B |C/ Since patterns and actions are both optional,
actions must be enclosed in braces to distinguish them
prints all input lines with an A, B, or C in the second from patterns.
field. The program
$1 != prev { print; prev = $1 } 1.3. Records and Fields
prints all lines in which the first field is different from Awk input is divided into ‘‘records’’ ter-
the previous first field. minated by a record separator. The default record
separator is a newline, so by default awk processes
1.1. Usage its input a line at a time. The number of the current
record is available in a variable named NR.
The command
Each input record is considered to be divided
awk program [files] into ‘‘fields.’’ Fields are normally separated by white
executes the awk commands in the string program space — blanks or tabs — but the input field separa-
on the set of named files, or on the standard input if tor may be changed, as described below. Fields are
there are no files. The statements can also be placed referred to as $1, $2, and so forth, where $1 is the
in a file pfile, and executed by the command first field, and $0 is the whole input record itself.
Fields may be assigned to. The number of fields in
awk – f pfile [files] the current record is available in a variable named
_____________________ NF.
†UNIX is a Trademark of Bell Laboratories.
The variables FS and RS refer to the input
field and record separators; they may be changed at
-2-
it is clearly a string. Strings are converted to splits the the string s into array[1], ..., array[n]. The
numbers and vice versa whenever context demands it. number of elements found is returned. If the sep
For instance, argument is provided, it is used as the field separator;
otherwise FS is used as the separator.
x = "3" + "4"
assigns 7 to x. Strings which cannot be interpreted as 3.4. String Concatenation
numbers in a numerical context will generally have Strings may be concatenated. For example
numeric value zero, but it is unwise to count on this
length($1 $2 $3)
behavior.
By default, variables (other than built-ins) are returns the length of the first three fields. Or in a
initialized to the null string, which has numerical print statement,
value zero; this eliminates the need for most BEGIN print $1 " is " $2
sections. For example, the sums of the first two fields
can be computed by prints the two fields separated by ‘‘ is ’’. Variables
and numeric expressions may also appear in concate-
{ s1 += $1; s2 += $2 } nations.
END { print s1, s2 }
3.5. Arrays
Arithmetic is done internally in floating point.
The arithmetic operators are +, – , ∗, /, and % (mod). Array elements are not declared; they spring
The C increment ++ and decrement – – operators are into existence by being mentioned. Subscripts may
also available, and so are the assignment operators have any non-null value, including non-numeric
+=, – =, ∗=, /=, and %=. These operators may all be strings. As an example of a conventional numeric
used in expressions. subscript, the statement
x[NR] = $0
3.3. Field Variables
assigns the current input record to the NR-th element
Fields in awk share essentially all of the pro- of the array x. In fact, it is possible in principle
perties of variables — they may be used in arithmetic (though perhaps slow) to process the entire input in a
or string operations, and may be assigned to. Thus random order with the awk program
one can replace the first field with a sequence number
like this:
-5-
In practice, awk usage seems to fall into two As might be expected, awk is not as fast as the
broad categories. One is what might be called specialized tools wc , sed , or the programs in the
‘‘report generation’’ — processing an input to extract grep family, but is faster than the more general tool
counts, sums, sub-totals, etc. This also includes the lex . In all cases, the tasks were about as easy to
writing of trivial data validation programs, such as express as awk programs as programs in these other
verifying that a field contains only numeric informa- languages; tasks involving fields were considerably
tion or that certain delimiters are properly balanced. easier to express as awk programs. Some of the test
The combination of textual and numeric processing is programs are shown in awk , sed and lex .
invaluable here.
A second area of use is as a data transformer, References
converting data from the form produced by one pro- 1. K. Thompson and D. M. Ritchie, UNIX
gram into that expected by another. The simplest Programmer’s Manual, Bell Laboratories (May
examples merely select fields, perhaps with rearrange- 1975). Sixth Edition
ments.
2. B. W. Kernighan and D. M. Ritchie, The C
5. Implementation Programming Language, Prentice-Hall, Engle-
wood Cliffs, New Jersey (1978).
The actual implementation of awk uses the
language development tools available on the UNIX 3. M. E. Lesk, ‘‘Lex — A Lexical Analyzer Gen-
operating system. The grammar is specified with erator,’’ Comp. Sci. Tech. Rep. No. 39, Bell
yacc ;4 the lexical analysis is done by lex ; the regular Laboratories, Murray Hill, New Jersey (1975).
expression recognizers are deterministic finite auto- 4. S. C. Johnson, ‘‘Yacc — Yet Another
mata constructed directly from the expressions. An Compiler-Compiler,’’ Comp. Sci. Tech. Rep.
awk program is translated into a parse tree which is No. 32, Bell Laboratories, Murray Hill, New
then directly executed by a simple interpreter. Jersey (July 1975).
Awk was designed for ease of use rather than
processing speed; the delayed evaluation of variable
types and the necessity to break input into fields
makes high speed difficult to achieve in any case.
Nonetheless, the program has not proven to be
unworkably slow.
Table I below shows the execution (user + sys-
tem) time on a PDP-11/70 of the UNIX programs wc ,
grep , egrep , fgrep , sed , lex , and awk on the follow-
ing simple tasks:
1. count the number of lines.
2. print all lines containing ‘‘doug’’.
3. print all lines containing ‘‘doug’’, ‘‘ken’’ or
‘‘dmr’’.
4. print the third field of each line.
5. print the third and second fields of each line, in
that order.
6. append all lines containing ‘‘doug’’, ‘‘ken’’,
and ‘‘dmr’’ to files ‘‘jdoug’’, ‘‘jken’’, and
‘‘jdmr’’, respectively.
7. print each line prefixed by ‘‘line-number : ’’.
8. sum the fourth column of a table.
The program wc merely counts words, lines and char-
acters in its input; we have already mentioned the
others. In all cases the input was a file containing
10,000 lines as created by the command ls – l ; each
line has the form
– rw– rw– rw– 1 ava 123 Oct 15 17:05 xxx
The total length of this input is 452,960 characters.
Times for lex do not include compile or load.
-7-
Task
_Program
_______________________________________________________________
1 2 3 4 5 6 7 8
wc 8.6
grep 11.7 13.1
egrep 6.2 11.5 11.6
fgrep 7.7 13.8 16.1
sed 10.2 11.6 15.8 29.0 30.5 16.1
lex 65.1 150.1 144.2 67.7 70.3 104.0 81.7 92.8
15.0 25.6 29.9 33.3 38.9 46.4 71.4 31.1
________________________________________________________________
awk
SED:
1. $=
2. /doug/p
3. /doug/p
/doug/d
/ken/p
/ken/d
/dmr/p
/dmr/d
6. /ken/w jken
/doug/w jdoug
/dmr/w jdmr