Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Awk - A Pattern Scanning and Processing Language (Second Edition)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8
At a glance
Powered by AI
Awk is a programming language for processing text files and extracting information. It allows searching files for patterns and performing actions on lines that match those patterns. Some key uses include data extraction, transformation and validation.

Awk is commonly used for information retrieval and text manipulation tasks like selecting and printing columns of data from files. It allows combining both textual and numeric processing which makes it useful for tasks like data validation and conversion.

The basic operation of Awk is to scan input files line by line, searching for lines that match user-specified patterns. For each pattern, the user can specify an associated action to perform on matching lines. Actions allow printing, arithmetic operations, string manipulation and more.

Awk — A Pattern Scanning and Processing Language

(Second Edition)

Alfred V. Aho
Brian W. Kernighan
Peter J. Weinberger
Bell Laboratories
Murray Hill, New Jersey 07974

ABSTRACT

Awk is a programming language whose basic operation is to search a set of files


for patterns, and to perform specified actions upon lines or fields of lines which contain
instances of those patterns. Awk makes certain data selection and transformation
operations easy to express; for example, the awk program

length > 72

prints all input lines whose length exceeds 72 characters; the program

NF % 2 == 0

prints all lines with an even number of fields; and the program

{ $1 = log($1); print }

replaces the first field of each line by its logarithm.


Awk patterns may include arbitrary boolean combinations of regular expressions
and of relational operators on strings, numbers, fields, variables, and array elements.
Actions may include the same pattern-matching constructions as in patterns, as well as
arithmetic and string expressions and assignments, if-else, while, for statements, and
multiple output streams.
This report contains a user’s guide, a discussion of the design and implementa-
tion of awk , and some timing statistics.

September 1, 1978
Awk — A Pattern Scanning and Processing Language
(Second Edition)

Alfred V. Aho
Brian W. Kernighan
Peter J. Weinberger
Bell Laboratories
Murray Hill, New Jersey 07974

1. Introduction
Awk is a programming language designed to 1.2. Program Structure
make many common information retrieval and text An awk program is a sequence of statements
manipulation tasks easy to state and to perform. of the form:
The basic operation of awk is to scan a set of
pattern { action }
input lines in order, searching for lines which match
pattern { action }
any of a set of patterns which the user has specified.
...
For each pattern, an action can be specified; this
action will be performed on each line that matches Each line of input is matched against each of the pat-
the pattern. terns in turn. For each pattern that matches, the asso-
Readers familiar with the UNIX† program ciated action is executed. When all the patterns have
grep 1 will recognize the approach, although in awk been tested, the next line is fetched and the matching
the patterns may be more general than in grep , and starts over.
the actions allowed are more involved than merely Either the pattern or the action may be left out,
printing the matching line. For example, the awk but not both. If there is no action for a pattern, the
program matching line is simply copied to the output. (Thus a
line which matches several patterns can be printed
{print $3, $2}
several times.) If there is no pattern for an action,
prints the third and second columns of a table in that then the action is performed for every input line. A
order. The program line which matches no pattern is ignored.
$2 ∼ /A |B |C/ Since patterns and actions are both optional,
actions must be enclosed in braces to distinguish them
prints all input lines with an A, B, or C in the second from patterns.
field. The program
$1 != prev { print; prev = $1 } 1.3. Records and Fields

prints all lines in which the first field is different from Awk input is divided into ‘‘records’’ ter-
the previous first field. minated by a record separator. The default record
separator is a newline, so by default awk processes
1.1. Usage its input a line at a time. The number of the current
record is available in a variable named NR.
The command
Each input record is considered to be divided
awk program [files] into ‘‘fields.’’ Fields are normally separated by white
executes the awk commands in the string program space — blanks or tabs — but the input field separa-
on the set of named files, or on the standard input if tor may be changed, as described below. Fields are
there are no files. The statements can also be placed referred to as $1, $2, and so forth, where $1 is the
in a file pfile, and executed by the command first field, and $0 is the whole input record itself.
Fields may be assigned to. The number of fields in
awk – f pfile [files] the current record is available in a variable named
_____________________ NF.
†UNIX is a Trademark of Bell Laboratories.
The variables FS and RS refer to the input
field and record separators; they may be changed at
-2-

any time to any single character. The optional


print | "mail bwk"
command-line argument – Fc may also be used to set
FS to the character c . mails the output to bwk.
If the record separator is empty, an empty The variables OFS and ORS may be used to
input line is taken as the record separator, and blanks, change the current output field separator and output
tabs and newlines are treated as field separators. record separator. The output record separator is
The variable FILENAME contains the name of appended to the output of the print statement.
the current input file. Awk also provides the printf statement for out-
put formatting:
1.4. Printing
printf format expr, expr, ...
An action may have no pattern, in which case
the action is executed for all lines. The simplest formats the expressions in the list according to the
action is to print some or all of a record; this is specification in format and prints them. For example,
accomplished by the awk command print. The awk printf "%8.2f %10ld\n", $1, $2
program
prints $1 as a floating point number 8 digits wide,
{ print } with two after the decimal point, and $2 as a 10-digit
prints each record, thus copying the input to the out- long decimal number, followed by a newline. No
put intact. More useful is to print a field or fields output separators are produced automatically; you
from each record. For instance, must add them yourself, as in this example. The ver-
sion of printf is identical to that used with C.2
print $2, $1
prints the first two fields in reverse order. Items 2. Patterns
separated by a comma in the print statement will be A pattern in front of an action acts as a selec-
separated by the current output field separator when tor that determines whether the action is to be exe-
output. Items not separated by commas will be con- cuted. A variety of expressions may be used as pat-
catenated, so terns: regular expressions, arithmetic relational
expressions, string-valued expressions, and arbitrary
print $1 $2
boolean combinations of these.
runs the first and second fields together.
The predefined variables NF and NR can be 2.1. BEGIN and END
used; for example The special pattern BEGIN matches the begin-
ning of the input, before the first record is read. The
{ print NR, NF, $0 }
pattern END matches the end of the input, after the
prints each record preceded by the record number and last record has been processed. BEGIN and END
the number of fields. thus provide a way to gain control before and after
Output may be diverted to multiple files; the processing, for initialization and wrapup.
program As an example, the field separator can be set to
a colon by
{ print $1 >"foo1"; print $2 >"foo2" }
BEGIN { FS = ":" }
writes the first field, $1, on the file foo1, and the
... rest of program ...
second field on file foo2. The >> notation can also
be used: Or the input lines may be counted by
print $1 >>"foo" END { print NR }
appends the output to the file foo. (In each case, the If BEGIN is present, it must be the first pattern; END
output files are created if necessary.) The file name must be the last if used.
can be a variable or a field as well as a constant; for
example, 2.2. Regular Expressions
print $1 >$2 The simplest regular expression is a literal
string of characters enclosed in slashes, like
uses the contents of field 2 as a file name.
/smith/
Naturally there is a limit on the number of out-
put files; currently it is 10. This is actually a complete awk program which will
Similarly, output can be piped into another print all lines which contain any occurrence of the
process (on UNIX only); for instance, name ‘‘smith’’. If a line contains ‘‘smith’’ as part of
a larger word, it will also be printed, as in
-3-

strings, so the program


blacksmithing
$1 > $2
Awk regular expressions include the regular
expression forms found in the UNIX text editor ed 1 will perform a string comparison.
and grep (without back-referencing). In addition,
awk allows parentheses for grouping, | for alterna- 2.4. Combinations of Patterns
tives, + for ‘‘one or more’’, and ? for ‘‘zero or one’’, A pattern can be any boolean combination of
all as in lex . Character classes may be abbreviated: patterns, using the operators | | (or), && (and), and !
[a– zA– Z0– 9] is the set of all letters and digits. As (not). For example,
an example, the awk program
$1 >= "s" && $1 < "t" && $1 != "smith"
/[Aa]ho |[Ww]einberger |[Kk]ernighan/
selects lines where the first field begins with ‘‘s’’, but
will print all lines which contain any of the names is not ‘‘smith’’. && and | | guarantee that their
‘‘Aho,’’ ‘‘Weinberger’’ or ‘‘Kernighan,’’ whether operands will be evaluated from left to right; evalua-
capitalized or not. tion stops as soon as the truth or falsehood is deter-
Regular expressions (with the extensions listed mined.
above) must be enclosed in slashes, just as in ed and
sed . Within a regular expression, blanks and the reg- 2.5. Pattern Ranges
ular expression metacharacters are significant. To The ‘‘pattern’’ that selects an action may also
turn of the magic meaning of one of the regular consist of two patterns separated by a comma, as in
expression characters, precede it with a backslash.
pat1, pat2 { ... }
An example is the pattern
In this case, the action is performed for each line
/ \/ .∗\//
between an occurrence of pat1 and the next
which matches any string of characters enclosed in occurrence of pat2 (inclusive). For example,
slashes.
/start/, /stop/
One can also specify that any field or variable
matches a regular expression (or does not match it) prints all lines between start and stop, while
with the operators ∼ and !∼. The program NR == 100, NR == 200 { ... }
$1 ∼ /[jJ]ohn/ does the action for lines 100 through 200 of the input.
prints all lines where the first field matches ‘‘john’’ or
‘‘John.’’ Notice that this will also match ‘‘Johnson’’, 3. Actions
‘‘St. Johnsbury’’, and so on. To restrict it to exactly An awk action is a sequence of action state-
[jJ]ohn, use ments terminated by newlines or semicolons. These
action statements can be used to do a variety of book-
$1 ∼ /ˆ[jJ]ohn$/
keeping and string manipulating tasks.
The caret ˆ refers to the beginning of a line or field;
the dollar sign $ refers to the end. 3.1. Built-in Functions
Awk provides a ‘‘length’’ function to compute
2.3. Relational Expressions the length of a string of characters. This program
An awk pattern can be a relational expression prints each record, preceded by its length:
involving the usual relational operators <, <=, ==, !=,
{print length, $0}
>=, and >. An example is
length by itself is a ‘‘pseudo-variable’’ which yields
$2 > $1 + 100
the length of the current record; length(argument) is
which selects lines where the second field is at least a function which yields the length of its argument, as
100 greater than the first field. Similarly, in the equivalent
NF % 2 == 0 {print length($0), $0}
prints lines with an even number of fields. The argument may be any expression.
In relational tests, if neither operand is Awk also provides the arithmetic functions
numeric, a string comparison is made; otherwise it is sqrt, log, exp, and int, for square root, base e loga-
numeric. Thus, rithm, exponential, and integer part of their respective
arguments.
$1 >= "s"
The name of one of these built-in functions,
selects lines that begin with an s, t, u, etc. In the without argument or parentheses, stands for the value
absence of any other information, fields are treated as
-4-

of the function on the whole record. The program


{ $1 = NR; print }
length < 10 || length > 20
or accumulate two fields into a third, like this:
prints lines whose length is less than 10 or greater
{ $1 = $2 + $3; print $0 }
than 20.
The function substr(s, m, n) produces the sub- or assign a string to a field:
string of s that begins at position m (origin 1) and is { if ($3 > 1000)
at most n characters long. If n is omitted, the sub- $3 = "too big"
string goes to the end of s. The function print
index(s1, s2) returns the position where the string s2 }
occurs in s1, or zero if it does not.
which replaces the third field by ‘‘too big’’ when it
The function sprintf(f, e1, e2, ...) produces the is, and in any case prints the record.
value of the expressions e1, e2, etc., in the printf for-
mat specified by f. Thus, for example, Field references may be numerical expressions,
as in
x = sprintf("%8.2f %10ld", $1, $2)
{ print $i, $(i+1), $(i+n) }
sets x to the string produced by formatting the values
of $1 and $2. Whether a field is deemed numeric or string depends
on context; in ambiguous cases like
3.2. Variables, Expressions, and Assignments if ($1 == $2) ...
Awk variables take on numeric (floating point) fields are treated as strings.
or string values according to context. For example, in
Each input line is split into fields automatically
x = 1 as necessary. It is also possible to split any variable
x is clearly a number, while in or string into fields:

x = "smith" n = split(s, array, sep)

it is clearly a string. Strings are converted to splits the the string s into array[1], ..., array[n]. The
numbers and vice versa whenever context demands it. number of elements found is returned. If the sep
For instance, argument is provided, it is used as the field separator;
otherwise FS is used as the separator.
x = "3" + "4"
assigns 7 to x. Strings which cannot be interpreted as 3.4. String Concatenation
numbers in a numerical context will generally have Strings may be concatenated. For example
numeric value zero, but it is unwise to count on this
length($1 $2 $3)
behavior.
By default, variables (other than built-ins) are returns the length of the first three fields. Or in a
initialized to the null string, which has numerical print statement,
value zero; this eliminates the need for most BEGIN print $1 " is " $2
sections. For example, the sums of the first two fields
can be computed by prints the two fields separated by ‘‘ is ’’. Variables
and numeric expressions may also appear in concate-
{ s1 += $1; s2 += $2 } nations.
END { print s1, s2 }
3.5. Arrays
Arithmetic is done internally in floating point.
The arithmetic operators are +, – , ∗, /, and % (mod). Array elements are not declared; they spring
The C increment ++ and decrement – – operators are into existence by being mentioned. Subscripts may
also available, and so are the assignment operators have any non-null value, including non-numeric
+=, – =, ∗=, /=, and %=. These operators may all be strings. As an example of a conventional numeric
used in expressions. subscript, the statement
x[NR] = $0
3.3. Field Variables
assigns the current input record to the NR-th element
Fields in awk share essentially all of the pro- of the array x. In fact, it is possible in principle
perties of variables — they may be used in arithmetic (though perhaps slow) to process the entire input in a
or string operations, and may be assigned to. Thus random order with the awk program
one can replace the first field with a sequence number
like this:
-5-

The statement next causes awk to skip


{ x[NR] = $0 }
immediately to the next record and begin scanning the
END { ... program ... }
patterns from the top. The statement exit causes the
The first action merely records each input line in the program to behave as if the end of the input had
array x. occurred.
Array elements may be named by non-numeric Comments may be placed in awk programs:
values, which gives awk a capability rather like the they begin with the character # and end with the end
associative memory of Snobol tables. Suppose the of the line, as in
input contains fields with values like apple, orange,
print x, y # this is a comment
etc. Then the program
/apple/ { x["apple"]++ }
4. Design
/orange/ { x["orange"]++ }
END { print x["apple"], x["orange"] } The UNIX system already provides several pro-
grams that operate by passing input through a selec-
increments counts for the named array elements, and tion mechanism. Grep , the first and simplest, merely
prints them at the end of the input. prints all lines which match a single specified pattern.
Egrep provides more general patterns, i.e., regular
3.6. Flow-of-Control Statements expressions in full generality; fgrep searches for a set
Awk provides the basic flow-of-control state- of keywords with a particularly fast algorithm. Sed 1
ments if-else, while, for, and statement grouping with provides most of the editing facilities of the editor
braces, as in C. We showed the if statement in sec- ed , applied to a stream of input. None of these pro-
tion 3.3 without describing it. The condition in grams provides numeric capabilities, logical relations,
parentheses is evaluated; if it is true, the statement or variables.
following the if is done. The else part is optional. Lex 3 provides general regular expression
The while statement is exactly like that of C. recognition capabilities, and, by serving as a C pro-
For example, to print all input fields one per line, gram generator, is essentially open-ended in its capa-
bilities. The use of lex , however, requires a
i = 1
knowledge of C programming, and a lex program
while (i <= NF) {
must be compiled and loaded before use, which
print $i
discourages its use for one-shot applications.
++i
} Awk is an attempt to fill in another part of the
matrix of possibilities. It provides general regular
The for statement is also exactly that of C: expression capabilities and an implicit input/output
loop. But it also provides convenient numeric pro-
for (i = 1; i <= NF; i++)
cessing, variables, more general selection, and control
print $i
flow in the actions. It does not require compilation or
does the same job as the while statement above. a knowledge of C. Finally, awk provides a con-
There is an alternate form of the for statement venient way to access fields within lines; it is unique
which is suited for accessing the elements of an asso- in this respect.
ciative array: Awk also tries to integrate strings and numbers
completely, by treating all quantities as both string
for (i in array)
and numeric, deciding which representation is
statement
appropriate as late as possible. In most cases the user
does statement with i set in turn to each element of can simply ignore the differences.
array. The elements are accessed in an apparently Most of the effort in developing awk went into
random order. Chaos will ensue if i is altered, or if deciding what awk should or should not do (for
any new elements are accessed during the loop. instance, it doesn’t do string substitution) and what
The expression in the condition part of an if, the syntax should be (no explicit operator for concate-
while or for can include relational operators like <, nation) rather than on writing or debugging the code.
<=, >, >=, == (‘‘is equal to’’), and != (‘‘not equal We have tried to make the syntax powerful but easy
to’’); regular expression matches with the match to use and well adapted to scanning files. For exam-
operators ∼ and !∼; the logical operators | |, &&, and ple, the absence of declarations and implicit initializa-
!; and of course parentheses for grouping. tions, while probably a bad idea for a general-purpose
The break statement causes an immediate exit programming language, is desirable in a language that
from an enclosing while or for; the continue state- is meant to be used for tiny programs that may even
ment causes the next iteration to begin. be composed on the command line.
-6-

In practice, awk usage seems to fall into two As might be expected, awk is not as fast as the
broad categories. One is what might be called specialized tools wc , sed , or the programs in the
‘‘report generation’’ — processing an input to extract grep family, but is faster than the more general tool
counts, sums, sub-totals, etc. This also includes the lex . In all cases, the tasks were about as easy to
writing of trivial data validation programs, such as express as awk programs as programs in these other
verifying that a field contains only numeric informa- languages; tasks involving fields were considerably
tion or that certain delimiters are properly balanced. easier to express as awk programs. Some of the test
The combination of textual and numeric processing is programs are shown in awk , sed and lex .
invaluable here.
A second area of use is as a data transformer, References
converting data from the form produced by one pro- 1. K. Thompson and D. M. Ritchie, UNIX
gram into that expected by another. The simplest Programmer’s Manual, Bell Laboratories (May
examples merely select fields, perhaps with rearrange- 1975). Sixth Edition
ments.
2. B. W. Kernighan and D. M. Ritchie, The C
5. Implementation Programming Language, Prentice-Hall, Engle-
wood Cliffs, New Jersey (1978).
The actual implementation of awk uses the
language development tools available on the UNIX 3. M. E. Lesk, ‘‘Lex — A Lexical Analyzer Gen-
operating system. The grammar is specified with erator,’’ Comp. Sci. Tech. Rep. No. 39, Bell
yacc ;4 the lexical analysis is done by lex ; the regular Laboratories, Murray Hill, New Jersey (1975).
expression recognizers are deterministic finite auto- 4. S. C. Johnson, ‘‘Yacc — Yet Another
mata constructed directly from the expressions. An Compiler-Compiler,’’ Comp. Sci. Tech. Rep.
awk program is translated into a parse tree which is No. 32, Bell Laboratories, Murray Hill, New
then directly executed by a simple interpreter. Jersey (July 1975).
Awk was designed for ease of use rather than
processing speed; the delayed evaluation of variable
types and the necessity to break input into fields
makes high speed difficult to achieve in any case.
Nonetheless, the program has not proven to be
unworkably slow.
Table I below shows the execution (user + sys-
tem) time on a PDP-11/70 of the UNIX programs wc ,
grep , egrep , fgrep , sed , lex , and awk on the follow-
ing simple tasks:
1. count the number of lines.
2. print all lines containing ‘‘doug’’.
3. print all lines containing ‘‘doug’’, ‘‘ken’’ or
‘‘dmr’’.
4. print the third field of each line.
5. print the third and second fields of each line, in
that order.
6. append all lines containing ‘‘doug’’, ‘‘ken’’,
and ‘‘dmr’’ to files ‘‘jdoug’’, ‘‘jken’’, and
‘‘jdmr’’, respectively.
7. print each line prefixed by ‘‘line-number : ’’.
8. sum the fourth column of a table.
The program wc merely counts words, lines and char-
acters in its input; we have already mentioned the
others. In all cases the input was a file containing
10,000 lines as created by the command ls – l ; each
line has the form
– rw– rw– rw– 1 ava 123 Oct 15 17:05 xxx
The total length of this input is 452,960 characters.
Times for lex do not include compile or load.
-7-

Task
_Program
_______________________________________________________________
1 2 3 4 5 6 7 8
wc  8.6        
grep  11.7  13.1       
        
egrep  6.2  11.5  11.6      
fgrep  7.7  13.8  16.1      
sed  10.2  11.6  15.8  29.0  30.5  16.1   
lex  65.1  150.1  144.2  67.7  70.3  104.0  81.7  92.8 
 15.0  25.6  29.9  33.3  38.9  46.4  71.4  31.1 
________________________________________________________________
awk         

Table I. Execution Times of Programs. (Times are in sec.)

The programs for some of these jobs are LEX:


shown below. The lex programs are generally too
long to show. 1. %{
int i;
AWK:
%}
1. END {print NR} %%
\n i++;
2. /doug/ . ;
%%
3. /ken|doug|dmr/ yywrap() {
printf("%d\n", i);
4. {print $3} }

5. {print $3, $2} 2. %%


ˆ.∗doug.∗$ printf("%s\n", yytext);
6. /ken/ {print >"jken"} . ;
/doug/ {print >"jdoug"} \n ;
/dmr/ {print >"jdmr"}

7. {print NR ": " $0}

8. {sum = sum + $4}


END {print sum}

SED:

1. $=

2. /doug/p

3. /doug/p
/doug/d
/ken/p
/ken/d
/dmr/p
/dmr/d

4. /[ˆ ]∗ [ ]∗[ˆ ]∗ [ ]∗\([ˆ ]∗\) .∗/s//\1/p

5. /[ˆ ]∗ [ ]∗\([ˆ ]∗\) [ ]∗\([ˆ ]∗\) .∗/s//\2 \1/p

6. /ken/w jken
/doug/w jdoug
/dmr/w jdmr

You might also like