Home > Computers & Technology > Operating Systems > Unix

Effective awk Programming: Universal Text Processing and Pattern Matching by Arnold Robbins

Author:Arnold Robbins [Robbins, Arnold] , Date: July 19, 2015 ,Views: 307

Effective awk Programming: Universal Text Processing and Pattern Matching by Arnold Robbins

Author:Arnold Robbins [Robbins, Arnold]
Language: eng
Format: epub, pdf
ISBN: 9781491904619
Publisher: O'Reilly Media
Published: 2015-03-03T05:00:00+00:00

# Each label is 5 lines of data that may have blank lines.

# The label sheets have 2 blank lines at the top and 2 at

# the bottom.

BEGIN { RS = "" ; MAXLINES = 100 }

function printpage( i, j)

{

if (Nlines <= 0)

return

printf "\n\n" # header

for (i = 1; i <= Nlines; i = 10) {

if (i == 21 || i == 61)

print ""

for (j = 0; j < 5; j++) {

if (i j > MAXLINES)

break

printf " %-41s %s\n", line[i+j], line[i+j+5]

}

print ""

}

printf "\n\n" # footer

delete line

}

# main rule

{

if (Count >= 20) {

printpage()

Count = 0

Nlines = 0

}

n = split($0, a, "\n")

for (i = 1; i <= n; i++)

line[++Nlines] = a[i]

for (; i <= 5; i++)

line[++Nlines] = ""

Count++

}

END {

printpage()

}

Generating Word-Usage Counts

When working with large amounts of text, it can be interesting to know how often

different words appear. For example, an author may overuse certain words, in which case he or she might wish to find synonyms to substitute for words that appear too often. This subsection develops a program for counting words and presenting the frequency

information in a useful format.

At first glance, a program like this would seem to do the job:

# wordfreq-first-try.awk --- print list of word frequencies

{

for (i = 1; i <= NF; i++)

freq[$i]++

}

END {

for (word in freq)

printf "%s\t%d\n", word, freq[word]

}

The program relies on awk’s default field-splitting mechanism to break each line up into

“words” and uses an associative array named freq, indexed by each word, to count the number of times the word occurs. In the END rule, it prints the counts.

This program has several problems that prevent it from being useful on real text files: The awk language considers upper- and lowercase characters to be distinct. Therefore,

“bartender” and “Bartender” are not treated as the same word. This is undesirable,

because words are capitalized if they begin sentences in normal text, and a frequency analyzer should not be sensitive to capitalization.

Words are detected using the awk convention that fields are separated just by

whitespace. Other characters in the input (except newlines) don’t have any special

meaning to awk. This means that punctuation characters count as part of words.

The output does not come out in any useful order. You’re more likely to be interested in which words occur most frequently or in having an alphabetized table of how

frequently each word occurs.

The first problem can be solved by using tolower() to remove case distinctions. The

second problem can be solved by using gsub() to remove punctuation characters. Finally, we solve the third problem by using the system sort utility to process the output of the awk script. Here is the new version of the program:

# wordfreq.awk --- print list of word frequencies

{

$0 = tolower($0) # remove case distinctions

# remove punctuation

gsub(/[^[:alnum:]_[:blank:]]/, "", $0)

for (i = 1; i <= NF; i++)

freq[$i]++

}

END {

for (word in freq)

printf "%s\t%d\n", word, freq[word]

}

The regexp /[^[:alnum:]_[:blank:]]/ might have been written /[[:punct:]]/, but

then underscores would also be removed, and we want to keep them.

Assuming we have saved this program in a file named wordfreq.awk, and that the data is in file1, the following pipeline:

awk -f wordfreq.awk file1 | sort -k 2nr

produces a table of the words appearing in file1 in order of decreasing frequency.

The awk program suitably massages the data and produces a word frequency table, which is not ordered. The awk script’s output is then sorted by the sort utility and printed on the screen.

The options given to sort specify a sort that uses the second field of each input line (skipping one field), that the sort keys should be treated as numeric quantities (otherwise

‘15’ would come before ‘5’), and that the sorting should be done in descending (reverse) order.

The sort could even be done from within the program, by changing the END action to:

END {

sort = "sort -k 2nr"

for (word in freq)

printf "%s\t%d\n", word, freq[word] | sort

close(sort)

}

This way of sorting must be used on systems that do not have true pipes at the command-line (or batch-file) level. See the general operating system documentation for more

information on how to use the sort program.

Removing Duplicates from Unsorted Text

The uniq program (see Printing Nonduplicated Lines of Text) removes duplicate lines from sorted data.

Suppose, however, you need to remove duplicate lines from a datafile but that you want to preserve the order the lines are in. A good example of this might be a shell history file.

The history file keeps a copy of all the commands you have entered, and it is not unusual to repeat a command several times in a row. Occasionally you might want to compact the history by removing duplicate entries. Yet it is desirable to maintain the order of the original commands.

This simple program does the job. It uses two arrays. The data array is indexed by the text of each line. For each line, data[$0] is incremented. If a particular line has not been seen before, then data[$0] is zero. In this case, the text of the line is stored in lines[count].

Each element of lines is a unique command, and the indices of lines indicate the order in which those lines are encountered. The END rule simply prints out the lines, in order:

# histsort.awk --- compact a shell history file

# Thanks to Byron Rakitzis for the general idea

{

if (data[$0]++ == 0)

lines[++count] = $0

}

END {

for (i = 1; i <= count; i++)

print lines[i]

}

This program also provides a foundation for generating other useful information. For example, using the following print statement in the END rule indicates how often a

particular command is used:

print data[lines[i]], lines[i]

This works because data[$0] is incremented each time a line is seen.

Extracting Programs from Texinfo Source Files

Both this chapter and the previous chapter (Chapter 10) present a large number of awk programs. If you want to experiment with these programs, it is tedious to type them in by hand. Here we present a program that can extract parts of a Texinfo input file into separate files.

This book is written in Texinfo, the GNU Project’s document formatting language. A single Texinfo source file can be used to produce both printed documentation, with TeX, and online documentation. (Texinfo is fully documented in the book Texinfo — The GNU

Documentation Format, available from the Free Software Foundation, and also available

online. )

For our purposes, it is enough to know three things about Texinfo input files:

The “at” symbol (‘@’) is special in Texinfo, much as the backslash (‘\’) is in C or awk.

Literal ‘@’ symbols are represented in Texinfo source files as ‘@@’.

Comments start with either ‘@c’ or ‘@comment’. The file-extraction program works by

using special comments that start at the beginning of a line.

Lines containing ‘@group’ and ‘@end group’ commands bracket example text that

should not be split across a page boundary. (Unfortunately, TeX isn’t always smart

enough to do things exactly right, so we have to give it some help.)

The following program, extract.awk, reads through a Texinfo source file and does two things, based on the special comments. Upon seeing ‘@c system …’, it runs a command, by extracting the command text from the control line and passing it on to the system() function (see Input/Output Functions). Upon seeing ‘@c file filename’, each subsequent line is sent to the file filename, until ‘@c endfile’ is encountered. The rules in extract.awk match either ‘@c’ or ‘@comment’ by letting the ‘omment’ part be optional.

Lines containing ‘@group’ and ‘@end group’ are simply removed. extract.awk uses the

join() library function (see Merging an Array into a String).

The example programs in the online Texinfo source for Effective awk Programming

(gawktexi.in) have all been bracketed inside ‘file’ and ‘endfile’ lines. The gawk

distribution uses a copy of extract.awk to extract the sample programs and install many of them in a standard directory where gawk can find them. The Texinfo file looks

something like this:

…

This program has a @code{BEGIN} rule

that prints a nice message:

@example

@c file examples/messages.awk

BEGIN @{ print "Don't panic!" @}

@c end file

@end example

It also prints some final advice:

@example

@c file examples/messages.awk

END @{ print "Always avoid bored archaeologists!" @}

@c end file

@end example

…

extract.awk begins by setting IGNORECASE to one, so that mixed upper- and lowercase

letters in the directives won’t matter.

The first rule handles calling system(), checking that a command is given (NF is at least three), and also checking that the command exits with a zero exit status, signifying OK:

# extract.awk --- extract files and run programs from Texinfo files

BEGIN { IGNORECASE = 1 }

/^@c(omment)?[ \t]+system/ {

if (NF < 3) {

e = ("extract: " FILENAME ":" FNR)

e = (e ": badly formed `system' line")

print e > "/dev/stderr"

}

$1 = ""

$2 = ""

stat = system($0)

if (stat != 0) {

e = ("extract: " FILENAME ":" FNR)

e = (e ": warning: system returned " stat)

print e > "/dev/stderr"

}

The variable e is used so that the rule fits nicely on the page.

The second rule handles moving data into files. It verifies that a filename is given in the directive. If the file named is not the current file, then the current file is closed. Keeping the current file open until a new file is encountered allows the use of the ‘>’ redirection for printing the contents, keeping open-file management simple.

The for loop does the work. It reads lines using getline (see Explicit Input with getline).

For an unexpected end-of-file, it calls the unexpected_eof() function. If the line is an

“endfile” line, then it breaks out of the loop. If the line is an ‘@group’ or ‘@end group’ line, then it ignores it and goes on to the next line. Similarly, comments within examples are also ignored.

Most of the work is in the following few lines. If the line has no ‘@’ symbols, the program can print it directly. Otherwise, each leading ‘@’ must be stripped off. To remove the ‘@’

symbols, the line is split into separate elements of the array a, using the split() function (see String-Manipulation Functions). The ‘@’ symbol is used as the separator character.

Each element of a that is empty indicates two successive ‘@’ symbols in the original line.

For each two empty elements (‘@@’ in the original file), we have to add a single ‘@’ symbol back in.

When the processing of the array is finished, join() is called with the value of SUBSEP

(see Multidimensional Arrays), to rejoin the pieces back into a single line. That line is then printed to the output file:

/^@c(omment)?[ \t]+file/ {

if (NF != 3) {

e = ("extract: " FILENAME ":" FNR ": badly formed `file' line") print e > "/dev/stderr"

}

if ($3 != curfile) {

if (curfile != "")

close(curfile)

curfile = $3

}

for (;;) {

if ((getline line) <= 0)

unexpected_eof()

if (line ~ /^@c(omment)?[ \t]+endfile/)

break

else if (line ~ /^@(end[ \t]+)?group/)

continue

else if (line ~ /^@c(omment+)?[ \t]+/)

continue

if (index(line, "@") == 0) {

print line > curfile

continue

}

n = split(line, a, "@")

# if a[1] == "", means leading @,

# don't add one back in.

for (i = 2; i <= n; i++) {

if (a[i] == "") { # was an @@

a[i] = "@"

if (a[i+1] == "")

i++

}

print join(a, 1, n, SUBSEP) > curfile

}

An important thing to note is the use of the ‘>’ redirection. Output done with ‘>’ only opens the file once; it stays open and subsequent output is appended to the file (see

Redirecting Output of print and printf). This makes it easy to mix program text and explanatory prose for the same sample source file (as has been done here!) without any hassle. The file is only closed when a new datafile name is encountered or at the end of the input file.

Finally, the function unexpected_eof() prints an appropriate error message and then

exits. The END rule handles the final cleanup, closing the open file:

function unexpected_eof()

{

printf("extract: %s:%d: unexpected EOF or error\n",

FILENAME, FNR) > "/dev/stderr"

exit 1

}

END {

if (curfile)

close(curfile)

}

A Simple Stream Editor

The sed utility is a stream editor, a program that reads a stream of data, makes changes to it, and passes it on. It is often used to make global changes to a large file or to a stream of data generated by a pipeline of commands. Although sed is a complicated program in its own right, its most common use is to perform global substitutions in the middle of a pipeline:

command1 < orig.data | sed 's/old/new/g' | command2 > result

Here, ‘s/old/new/g’ tells sed to look for the regexp ‘old’ on each input line and globally replace it with the text ‘new’ (i.e., all the occurrences on a line). This is similar to awk’s gsub() function (see String-Manipulation Functions).

The following program, awksed.awk, accepts at least two command-line arguments: the

pattern to look for and the text to replace it with. Any additional arguments are treated as datafile names to process. If none are provided, the standard input is used:

# awksed.awk --- do s/foo/bar/g using just print

# Thanks to Michael Brennan for the idea

function usage()

{

print "usage: awksed pat repl [files…]" > "/dev/stderr"

exit 1

}

BEGIN {

# validate arguments

if (ARGC < 3)

usage()

RS = ARGV[1]

ORS = ARGV[2]

# don't use arguments as files

ARGV[1] = ARGV[2] = ""

}

# look ma, no hands!

{

if (RT == "")

printf "%s", $0

else

}

The program relies on gawk’s ability to have RS be a regexp, as well as on the setting of RT

to the actual text that terminates the record (see How Input Is Split into Records).

The idea is to have RS be the pattern to look for. gawk automatically sets $0 to the text between matches of the pattern. This is text that we want to keep, unmodified. Then, by setting ORS to the replacement text, a simple print statement outputs the text we want to keep, followed by the replacement text.

There is one wrinkle to this scheme, which is what to do if the last record doesn’t end with text that matches RS. Using a print statement unconditionally prints the replacement text, which is not correct. However, if the file did not end in text that matches RS, RT is set to

the null string. In this case, we can print $0 using printf (see Using printf Statements for

Fancier Printing).

The BEGIN rule handles the setup, checking for the right number of arguments and calling usage() if there is a problem. Then it sets RS and ORS from the command-line arguments and sets ARGV[1] and ARGV[2] to the null string, so that they are not treated as filenames (see Using ARGC and ARGV).

The usage() function prints an error message and exits. Finally, the single rule handles the printing scheme outlined earlier, using print or printf as appropriate, depending upon the value of RT.

An Easy Way to Use Library Functions

In Including Other Files into Your Program, we saw how gawk provides a built-in file-inclusion capability. However, this is a gawk extension. This section provides the

motivation for making file inclusion available for standard awk, and shows how to do it using a combination of shell and awk programming.

Using library functions in awk can be very beneficial. It encourages code reuse and the writing of general functions. Programs are smaller and therefore clearer. However, using library functions is only easy when writing awk programs; it is painful when running them, requiring multiple -f options. If gawk is unavailable, then so too is the AWKPATH

environment variable and the ability to put awk functions into a library directory (see

Command-Line Options). It would be nice to be able to write programs in the following manner:

# library functions

@include getopt.awk

@include join.awk

…

# main program

BEGIN {

while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)

…

}

The following program, igawk.sh, provides this service. It simulates gawk’s searching of the AWKPATH variable and also allows nested includes (i.e., a file that is included with

@include can contain further @include statements). igawk makes an effort to only include files once, so that nested includes don’t accidentally include a library function twice.

igawk should behave just like gawk externally. This means it should accept all of gawk’s command-line arguments, including the ability to have multiple source files specified via

-f and the ability to mix command-line and library source files.

The program is written using the POSIX Shell (sh) command language.[75] It works as follows:

1. Loop through the arguments, saving anything that doesn’t represent awk source code for later, when the expanded program is run.

2. For any arguments that do represent awk text, put the arguments into a shell variable that will be expanded. There are two cases:

a. Literal text, provided with -e or --source. This text is just appended directly.

b. Source filenames, provided with -f. We use a neat trick and append ‘@include

filename’ to the shell variable’s contents. Because the file-inclusion program works the way gawk does, this gets the text of the file included in the program

at the correct point.

3. Run an awk program (naturally) over the shell variable’s contents to expand

@include statements. The expanded program is placed in a second shell variable.

4. Run the expanded program with gawk and any other original command-line

arguments that the user supplied (such as the datafile names).

This program uses shell variables extensively: for storing command-line arguments and the text of the awk program that will expand the user’s program, for the user’s original program, and for the expanded program. Doing so removes some potential problems that might arise were we to use temporary files instead, at the cost of making the script somewhat more complicated.

The initial part of the program turns on shell tracing if the first argument is ‘debug’.

The next part loops through all the command-line arguments. There are several cases of interest:

This ends the arguments to igawk. Anything else should be passed on to the user’s awk program without being evaluated.

-W

This indicates that the next option is specific to gawk. To make argument processing easier, the -W is appended to the front of the remaining arguments and the loop

continues. (This is an sh programming trick. Don’t worry about it if you are not familiar with sh.)

-v, -F

These are saved and passed on to gawk.

-f, --file, --file=, -Wfile=

The filename is appended to the shell variable program with an @include statement.

The expr utility is used to remove the leading option part of the argument (e.g., ‘--

file=’). (Typical sh usage would be to use the echo and sed utilities to do this work.

Unfortunately, some versions of echo evaluate escape sequences in their arguments,

possibly mangling the program text. Using expr avoids this problem.)

--source, --source=, -Wsource=

The source text is appended to program.

--version, -Wversion

igawk prints its version number, runs ‘gawk --version’ to get the gawk version

information, and then exits.

If none of the -f, --file, -Wfile, --source, or -Wsource arguments are supplied, then the first nonoption argument should be the awk program. If there are no command-line

arguments left, igawk prints an error message and exits. Otherwise, the first argument is appended to program. In any case, after the arguments have been processed, the shell variable program contains the complete text of the original awk program.

The program is as follows:

#! /bin/sh

# igawk --- like gawk but do @include processing

if [ "$1" = debug ]

then

set -x

shift

# A literal newline, so that program text is formatted correctly

n='

# Initialize variables to empty

program=

opts=

while [ $# -ne 0 ] # loop over arguments

case $1 in

--) shift

break ;;

-W) shift

# The ${x?'message here'} construct prints a

# diagnostic if $x is the null string

set—-W"${@?'missing operand'}"

continue ;;

-[vF]) opts="$opts $1 '${2?'missing operand'}'"

shift ;;

-[vF]*) opts="$opts '$1'" ;;

-f) program="$program$n@include ${2?'missing operand'}"

shift ;;

-f*) f=$(expr "$1" : '-f$.*$')

program="$program$n@include $f" ;;

-[W-]file=*)

f=$(expr "$1" : '-.file=$.*$')

program="$program$n@include $f" ;;

-[W-]file)

program="$program$n@include ${2?'missing operand'}"

shift ;;

-[W-]source=*)

t=$(expr "$1" : '-.source=$.*$')

program="$program$n$t" ;;

-[W-]source)

program="$program$n${2?'missing operand'}"

shift ;;

-[W-]version)

echo igawk: version 3.0 1>&2

gawk --version

exit 0 ;;

-[W-]*) opts="$opts '$1'" ;;

*) break ;;

esac

shift

done

if [ -z "$program" ]

then

program=${1?'missing program'}

shift

# At this point, `program' has the program.

The awk program to process @include directives is stored in the shell variable

expand_prog. Doing this keeps the shell script readable. The awk program reads through the user’s program, one line at a time, using getline (see Explicit Input with getline). The input filenames and @include statements are managed using a stack. As each @include is encountered, the current filename is “pushed” onto the stack and the file named in the

@include directive becomes the current filename. As each file is finished, the stack is

“popped,” and the previous input file becomes the current input file again. The process is started by making the original file the first one on the stack.

The pathto() function does the work of finding the full path to a file. It simulates gawk’s

behavior when searching the AWKPATH environment variable (see The AWKPATH

Environment Variable). If a filename has a ‘/’ in it, no path search is done. Similarly, if the filename is "-", then that string is used as-is. Otherwise, the filename is concatenated with the name of each directory in the path, and an attempt is made to open the generated filename. The only way to test if a file can be read in awk is to go ahead and try to read it with getline; this is what pathto() does.[76] If the file can be read, it is closed and the filename is returned:

expand_prog='

function pathto(file, i, t, junk)

{

if (index(file, "/") != 0)

return file

if (file == "-")

return file

for (i = 1; i <= ndirs; i++) {

t = (pathlist[i] "/" file)

if ((getline junk < t) > 0) {

# found it

close(t)

return t

}

return ""

}

The main program is contained inside one BEGIN rule. The first thing it does is set up the pathlist array that pathto() uses. After splitting the path on ‘:’, null elements are replaced with ".", which represents the current directory:

BEGIN {

path = ENVIRON["AWKPATH"]

ndirs = split(path, pathlist, ":")

for (i = 1; i <= ndirs; i++) {

if (pathlist[i] == "")

pathlist[i] = "."

}

The stack is initialized with ARGV[1], which will be "/dev/stdin". The main loop comes next. Input lines are read in succession. Lines that do not start with @include are printed verbatim. If the line does start with @include, the filename is in $2. pathto() is called to generate the full path. If it cannot, then the program prints an error message and continues.

The next thing to check is if the file is included already. The processed array is indexed by the full filename of each included file and it tracks this information for us. If the file is seen again, a warning message is printed. Otherwise, the new filename is pushed onto the stack and processing continues.

Finally, when getline encounters the end of the input file, the file is closed and the stack is popped. When stackptr is less than zero, the program is done:

stackptr = 0

input[stackptr] = ARGV[1] # ARGV[1] is first file

for (; stackptr >= 0; stackptr--) {

while ((getline < input[stackptr]) > 0) {

if (tolower($1) != "@include") {

continue

}

fpath = pathto($2)

if (fpath == "") {

printf("igawk: %s:%d: cannot find %s\n",

input[stackptr], FNR, $2) > "/dev/stderr"

continue

}

if (! (fpath in processed)) {

processed[fpath] = input[stackptr]

input[++stackptr] = fpath # push onto stack

} else

print $2, "included in", input[stackptr],

"already included in",

processed[fpath] > "/dev/stderr"

}

close(input[stackptr])

}

}' # close quote ends èxpand_prog' variable

processed_program=$(gawk—"$expand_prog" /dev/stdin << EOF

$program

EOF

)

The shell construct ‘ command << marker’ is called a here document. Everything in the shell script up to the marker is fed to command as input. The shell processes the contents of the here document for variable and command substitution (and possibly other things as well, depending upon the shell).

The shell construct ‘$(…)’ is called command substitution. The output of the command inside the parentheses is substituted into the command line. Because the result is used in a variable assignment, it is saved as a single string, even if the results contain whitespace.

The expanded program is saved in the variable processed_program. It’s done in these

steps:

1. Run gawk with the @include-processing program (the value of the expand_prog

shell variable) reading standard input.

2. Standard input is the contents of the user’s program, from the shell variable program.

Feed its contents to gawk via a here document.

3. Save the results of this processing in the shell variable processed_program by using command substitution.

The last step is to call gawk with the expanded program, along with the original options and command-line arguments that the user supplied:

eval gawk $opts—'"$processed_program"' '"$@"'

The eval command is a shell construct that reruns the shell’s parsing process. This keeps things properly quoted.

This version of igawk represents the fifth version of this program. There are four key simplifications that make the program work better:

Using @include even for the files named with -f makes building the initial collected awk program much simpler; all the @include processing can be done once.

Not trying to save the line read with getline in the pathto() function when testing for the file’s accessibility for use with the main program simplifies things considerably.

Using a getline loop in the BEGIN rule does it all in one place. It is not necessary to call out to a separate loop for processing nested @include statements.

Instead of saving the expanded program in a temporary file, putting it in a shell

variable avoids some potential security problems. This has the disadvantage that the script relies upon more features of the sh language, making it harder to follow for those who aren’t familiar with sh.

Also, this program illustrates that it is often worthwhile to combine sh and awk

programming together. You can usually accomplish quite a lot, without having to resort to low-level programming in C or C++, and it is frequently easier to do certain kinds of string and argument manipulation using the shell than it is in awk.

Finally, igawk shows that it is not always necessary to add new features to a program; they can often be layered on top. [77]

Finding Anagrams from a Dictionary

An interesting programming challenge is to search for anagrams in a word list (such as

/usr/share/dict/words on many GNU/Linux systems). One word is an anagram of

another if both words contain the same letters (e.g., “babbling” and “blabbing”).

Column 2, Problem C, of Jon Bentley’s Programming Pearls, Second Edition, presents an elegant algorithm. The idea is to give words that are anagrams a common signature, sort all the words together by their signatures, and then print them. Dr. Bentley observes that taking the letters in each word and sorting them produces those common signatures.

The following program uses arrays of arrays to bring together words with the same

signature and array sorting to print the words in sorted order:

# anagram.awk --- An implementation of the anagram-finding algorithm

# from Jon Bentley's "Programming Pearls," 2nd edition.

# Addison Wesley, 2000, ISBN 0-201-65788-0.

# Column 2, Problem C, section 2.8, pp 18-20.

/'s$/ { next } # Skip possessives

The program starts with a header, and then a rule to skip possessives in the dictionary file.

The next rule builds up the data structure. The first dimension of the array is indexed by the signature; the second dimension is the word itself:

{

key = word2key($1) # Build signature

data[key][$1] = $1 # Store word with signature

}

The word2key() function creates the signature. It splits the word apart into individual letters, sorts the letters, and then joins them back together:

# word2key --- split word apart into letters, sort, and join back together

function word2key(word, a, i, n, result)

{

n = split(word, a, "")

asort(a)

for (i = 1; i <= n; i++)

result = result a[i]

return result

}

Finally, the END rule traverses the array and prints out the anagram lists. It sends the output to the system sort command because otherwise the anagrams would appear in arbitrary

order:

END {

sort = "sort"

for (key in data) {

# Sort words with same key

nwords = asorti(data[key], words)

if (nwords == 1)

continue

# And print. Minor glitch: trailing space at end of each line

for (j = 1; j <= nwords; j++)

printf("%s ", words[j]) | sort

print "" | sort

}

close(sort)

}

Here is some partial output when the program is run:

$ gawk -f anagram.awk /usr/share/dict/words | grep '^b'

…

babbled blabbed

babbler blabber brabble

babblers blabbers brabbles

babbling blabbing

babbly blabby

babel bable

babels beslab

babery yabber

…

And Now for Something Completely Different

The following program was written by Davide Brini and is published on his website. It serves as his signature in the Usenet group comp.lang.awk. He supplies the following copyright terms:

Copying and distribution of the code published in this page, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved.

Here is the program:

awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c"; printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O,

X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O,

O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O}'

We leave it to you to determine what the program does. (If you are truly desperate to understand it, see Chris Johansen’s explanation, which is embedded in the Texinfo source file for this book.)

Summary

The programs provided in this chapter continue on the theme that reading programs is an excellent way to learn Good Programming.

Using ‘#!’ to make awk programs directly runnable makes them easier to use.

Otherwise, invoke the program using ‘awk -f …’.

Reimplementing standard POSIX programs in awk is a pleasant exercise; awk’s

expressive power lets you write such programs in relatively few lines of code, yet they are functionally complete and usable.

One of standard awk’s weaknesses is working with individual characters. The ability to use split() with the empty string as the separator can considerably simplify such

tasks.

The examples here demonstrate the usefulness of the library functions from Chapter 10

for a number of real (if small) programs.

Besides reinventing POSIX wheels, other programs solved a selection of interesting

problems, such as finding duplicate words in text, printing mailing labels, and finding anagrams.

[69] It also introduces a subtle bug; if a match happens, we output the translated line, not the original.

[70] This is the traditional usage. The POSIX usage is different, but not relevant for what the program aims to demonstrate.

[71] This is the definition returned from entering define: state machine into Google.

[72] Because gawk understands multibyte locales, this code counts characters, not bytes.

[73] On some older systems, including Solaris, the system version of tr may require that the lists be written as range expressions enclosed in square brackets (‘[a-z]’) and quoted, to prevent the shell from attempting a filename expansion.

This is not a feature.

[74] “Real world” is defined as “a program actually used to get something done.”

[75] Fully explaining the sh language is beyond the scope of this book. We provide some minimal explanations, but see a good shell programming book if you wish to understand things in more depth.

[76] On some very old versions of awk, the test ‘getline junk < t’ can loop forever if the file exists but is empty.

[77] gawk does @include processing itself in order to support the use of awk programs as Web CGI scripts.

Part III. Moving Beyond Standard awk with gawk

Part III focuses on features specific to gawk. It contains the following chapters:

Chapter 12, Advanced Features of gawk

Chapter 13, Internationalization with gawk

Chapter 14, Debugging awk Programs

Chapter 15, Arithmetic and Arbitrary-Precision Arithmetic with gawk

Chapter 16, Writing Extensions for gawk

Chapter 12. Advanced Features of gawk

Write documentation as if whoever reads it is a violent psychopath who knows where you live.

— Steve English, as quoted by Peter Langston

This chapter discusses advanced features in gawk. It’s a bit of a “grab bag” of items that are otherwise unrelated to each other. First, we look at a command-line option that allows gawk to recognize nondecimal numbers in input data, not just in awk programs. Then,

gawk’s special features for sorting arrays are presented. Next, two-way I/O, discussed briefly in earlier parts of this book, is described in full detail, along with the basics of TCP/IP networking. Finally, we see how gawk can profile an awk program, making it possible to tune it for performance.

Additional advanced features are discussed in separate chapters of their own:

Chapter 13, Internationalization with gawk, discusses how to internationalize your awk programs, so that they can speak multiple national languages.

Chapter 14, Debugging awk Programs, describes gawk’s built-in command-line debugger for debugging awk programs.

Chapter 15, Arithmetic and Arbitrary-Precision Arithmetic with gawk, describes how you can use gawk to perform arbitrary-precision arithmetic.

Chapter 16, Writing Extensions for gawk, discusses the ability to dynamically add new built-in functions to gawk.

Allowing Nondecimal Input Data

If you run gawk with the --non-decimal-data option, you can have nondecimal values in your input data:

$ echo 0123 123 0x123 |

> gawk --non-decimal-data '{ printf "%d, %d, %d\n", $1, $2, $3 }'

83, 123, 291

For this feature to work, write your program so that gawk treats your data as numeric: $ echo 0123 123 0x123 | gawk '{ print $1, $2, $3 }'

0123 123 0x123

The print statement treats its expressions as strings. Although the fields can act as numbers when necessary, they are still strings, so print does not try to treat them

numerically. You need to add zero to a field to force it to be treated as a number. For example:

$ echo 0123 123 0x123 | gawk --non-decimal-data '

> { print $1, $2, $3

> print $1 0, $2 0, $3 0 }'

0123 123 0x123

83 123 291

Because it is common to have decimal data with leading zeros, and because using this facility could lead to surprising results, the default is to leave it disabled. If you want it, you must explicitly request it.

CAUTION

Use of this option is not recommended. It can break old programs very badly. Instead, use the strtonum() function to convert your data (see String-Manipulation Functions). This makes your programs easier to write and easier to read, and leads to less surprising results.

This option may disappear in a future version of gawk.

Controlling Array Traversal and Array Sorting

gawk lets you control the order in which a ‘for ( indx in array)’ loop traverses an array.

In addition, two built-in functions, asort() and asorti(), let you sort arrays based on the array values and indices, respectively. These two functions also provide control over the sorting criteria used to order the elements during sorting.

Controlling Array Traversal

By default, the order in which a ‘for ( indx in array)’ loop scans an array is not defined; it is generally based upon the internal implementation of arrays inside awk.

Often, though, it is desirable to be able to loop over the elements in a particular order that you, the programmer, choose. gawk lets you do this.

Using Predefined Array Scanning Orders with gawk describes how you can assign special, predefined values to PROCINFO["sorted_in"] in order to control the order in which gawk traverses an array during a for loop.

In addition, the value of PROCINFO["sorted_in"] can be a function name. [78] This lets you traverse an array based on any custom criterion. The array elements are ordered according to the return value of this function. The comparison function should be defined with at least four arguments:

function comp_func(i1, v1, i2, v2)

{

compare elements 1 and 2 in some fashion

return < 0; 0; or > 0

}

Here, i1 and i2 are the indices, and v1 and v2 are the corresponding values of the two elements being compared. Either v1 or v2, or both, can be arrays if the array being

traversed contains subarrays as values. (See Arrays of Arrays for more information about subarrays.) The three possible return values are interpreted as follows:

comp_func(i1, v1, i2, v2) < 0

Index i1 comes before index i2 during loop traversal.

comp_func(i1, v1, i2, v2) == 0

Indices i1 and i2 come together, but the relative order with respect to each other is undefined.

comp_func(i1, v1, i2, v2) > 0

Index i1 comes after index i2 during loop traversal.

Our first comparison function can be used to scan an array in numerical order of the indices:

function cmp_num_idx(i1, v1, i2, v2)

{

# numerical index comparison, ascending order

return (i1 - i2)

}

Our second function traverses an array based on the string order of the element values rather than by indices:

function cmp_str_val(i1, v1, i2, v2)

{

# string value comparison, ascending order

v1 = v1 ""

v2 = v2 ""

if (v1 < v2)

return -1

return (v1 != v2)

}

The third comparison function makes all numbers, and numeric strings without any

leading or trailing spaces, come out first during loop traversal:

function cmp_num_str_val(i1, v1, i2, v2, n1, n2)

{

# numbers before string value comparison, ascending order

n1 = v1 0

n2 = v2 0

if (n1 == v1)

return (n2 == v2) ? (n1 - n2) : -1

else if (n2 == v2)

return 1

return (v1 < v2) ? -1 : (v1 != v2)

}

Here is a main program to demonstrate how gawk behaves using each of the previous

functions:

BEGIN {

data["one"] = 10

data["two"] = 20

data[10] = "one"

data[100] = 100

data[20] = "two"

f[1] = "cmp_num_idx"

f[2] = "cmp_str_val"

f[3] = "cmp_num_str_val"

for (i = 1; i <= 3; i++) {

printf("Sort function: %s\n", f[i])

PROCINFO["sorted_in"] = f[i]

for (j in data)

printf("\tdata[%s] = %s\n", j, data[j])

print ""

}

Here are the results when the program is run:

$ gawk -f compdemo.awk

Sort function: cmp_num_idx Sort by numeric index

data[two] = 20

data[one] = 10 Both strings are numerically zero

data[10] = one

data[20] = two

data[100] = 100

Sort function: cmp_str_val Sort by element values as strings

data[one] = 10

data[100] = 100 String 100 is less than string 20

data[two] = 20

data[10] = one

data[20] = two

Sort function: cmp_num_str_val Sort all numeric values before all strings

data[one] = 10

data[two] = 20

data[100] = 100

data[10] = one

data[20] = two

Consider sorting the entries of a GNU/Linux system password file according to login

name. The following program sorts records by a specific field position and can be used for this purpose:

# passwd-sort.awk --- simple program to sort by field position

# field position is specified by the global variable POS

function cmp_field(i1, v1, i2, v2)

{

# comparison by value, as string, and ascending order

return v1[POS] < v2[POS] ? -1 : (v1[POS] != v2[POS])

}

{

for (i = 1; i <= NF; i++)

a[NR][i] = $i

}

END {

PROCINFO["sorted_in"] = "cmp_field"

if (POS < 1 || POS > NF)

POS = 1

for (i in a) {

for (j = 1; j <= NF; j++)

printf("%s%c", a[i][j], j < NF ? ":" : "")

print ""

}

The first field in each entry of the password file is the user’s login name, and the fields are separated by colons. Each record defines a subarray, with each field as an element in the subarray. Running the program produces the following output:

$ gawk -v POS=1 -F: -f sort.awk /etc/passwd

adm:x:3:4:adm:/var/adm:/sbin/nologin

apache:x:48:48:Apache:/var/www:/sbin/nologin

avahi:x:70:70:Avahi daemon:/:/sbin/nologin

…

The comparison should normally always return the same value when given a specific pair of array elements as its arguments. If inconsistent results are returned, then the order is undefined. This behavior can be exploited to introduce random order into otherwise

seemingly ordered data:

function cmp_randomize(i1, v1, i2, v2)

{

# random order (caution: this may never terminate!)

return (2 - 4 * rand())

}

As already mentioned, the order of the indices is arbitrary if two elements compare equal.

This is usually not a problem, but letting the tied elements come out in arbitrary order can be an issue, especially when comparing item values. The partial ordering of the equal elements may change the next time the array is traversed, if other elements are added to or removed from the array. One way to resolve ties when comparing elements with otherwise equal values is to include the indices in the comparison rules. Note that doing this may make the loop traversal less efficient, so consider it only if necessary. The following comparison functions force a deterministic order, and are based on the fact that the (string) indices of two elements are never equal:

function cmp_numeric(i1, v1, i2, v2)

{

# numerical value (and index) comparison, descending order

return (v1 != v2) ? (v2 - v1) : (i2 - i1)

}

function cmp_string(i1, v1, i2, v2)

{

# string value (and index) comparison, descending order

v1 = v1 i1

v2 = v2 i2

return (v1 > v2) ? -1 : (v1 != v2)

}

A custom comparison function can often simplify ordered loop traversal, and the sky is really the limit when it comes to designing such a function.

When string comparisons are made during a sort, either for element values where one or both aren’t numbers, or for element indices handled as strings, the value of IGNORECASE

(see Predefined Variables) controls whether the comparisons treat corresponding upper-and lowercase letters as equivalent or distinct.

Another point to keep in mind is that in the case of subarrays, the element values can themselves be arrays; a production comparison function should use the isarray()

function (see Getting Type Information) to check for this, and choose a defined sorting order for subarrays.

All sorting based on PROCINFO["sorted_in"] is disabled in POSIX mode, because the PROCINFO array is not special in that case.

As a side note, sorting the array indices before traversing the array has been reported to add a 15% to 20% overhead to the execution time of awk programs. For this reason, sorted array traversal is not the default.

Sorting Array Values and Indices with gawk

In most awk implementations, sorting an array requires writing a sort() function. This can be educational for exploring different sorting algorithms, but usually that’s not the point of

the program. gawk provides the built-in asort() and asorti() functions (see String-

Manipulation Functions) for sorting arrays. For example:

populate the array data

n = asort(data)

for (i = 1; i <= n; i++)

do something with data[i]

After the call to asort(), the array data is indexed from 1 to some number n, the total number of elements in data. (This count is asort()’s return value.) data[1] ≤ data[2] ≤

data[3], and so on. The default comparison is based on the type of the elements (see

Variable Typing and Comparison Expressions). All numeric values come before all string values, which in turn come before all subarrays.

An important side effect of calling asort() is that the array’s original indices are irrevocably lost. As this isn’t always desirable, asort() accepts a second argument: populate the array source

n = asort(source, dest)

for (i = 1; i <= n; i++)

do something with dest[i]

In this case, gawk copies the source array into the dest array and then sorts dest,

destroying its indices. However, the source array is not affected.

Often, what’s needed is to sort on the values of the indices instead of the values of the elements. To do that, use the asorti() function. The interface and behavior are identical to that of asort(), except that the index values are used for sorting and become the values of the result array:

{ source[$0] = some_func($0) }

END {

n = asorti(source, dest)

for (i = 1; i <= n; i++) {

Work with sorted indices directly:

do something with dest[i]

…

Access original array via sorted indices:

do something with source[dest[i]]

}

So far, so good. Now it starts to get interesting. Both asort() and asorti() accept a third string argument to control comparison of array elements. When we introduced asort()

and asorti() in String-Manipulation Functions, we ignored this third argument; however, now is the time to describe how this argument affects these two functions.

Basically, the third argument specifies how the array is to be sorted. There are two possibilities. As with PROCINFO["sorted_in"], this argument may be one of the

predefined names that gawk provides (see Using Predefined Array Scanning Orders with

gawk), or it may be the name of a user-defined function (see Controlling Array Traversal).

In the latter case, the function can compare elements in any way it chooses, taking into account just the indices, just the values, or both. This is extremely powerful.

Once the array is sorted, asort() takes the values in their final order and uses them to fill in the result array, whereas asorti() takes the indices in their final order and uses them to fill in the result array.

NOTE

Copying array indices and elements isn’t expensive in terms of memory. Internally, gawk maintains reference counts to data. For example, when asort() copies the first array to the second one, there is only one copy of the original array elements’ data, even though both arrays use the values.

Because IGNORECASE affects string comparisons, the value of IGNORECASE also affects

sorting for both asort() and asorti(). Note also that the locale’s sorting order does not come into play; comparisons are based on character values only. [79]

Two-Way Communications with Another Process

It is often useful to be able to send data to a separate program for processing and then read the result. This can always be done with temporary files:

# Write the data for processing

tempfile = ("mydata." PROCINFO["pid"])

while ( not done with data)

print data | ("subprogram > " tempfile)

close("subprogram > " tempfile)

# Read the results, remove tempfile when done

while ((getline newdata < tempfile) > 0)

process newdata appropriately

close(tempfile)

system("rm " tempfile)

This works, but not elegantly. Among other things, it requires that the program be run in a directory that cannot be shared among users; for example, /tmp will not do, as another user might happen to be using a temporary file with the same name. [80]

However, with gawk, it is possible to open a two-way pipe to another process. The second process is termed a coprocess, as it runs in parallel with gawk. The two-way connection is created using the ‘|&’ operator (borrowed from the Korn shell, ksh):[81]

do {

print data |& "subprogram"

"subprogram" |& getline results

} while ( data left to process)

close("subprogram")

The first time an I/O operation is executed using the ‘|&’ operator, gawk creates a two-way pipeline to a child process that runs the other program. Output created with print or printf is written to the program’s standard input, and output from the program’s standard output can be read by the gawk program using getline. As is the case with processes

started by ‘|’, the subprogram can be any program, or pipeline of programs, that can be started by the shell.

There are some cautionary items to be aware of:

As the code inside gawk currently stands, the coprocess’s standard error goes to the same place that the parent gawk’s standard error goes. It is not possible to read the child’s standard error separately.

I/O buffering may be a problem. gawk automatically flushes all output down the pipe to the coprocess. However, if the coprocess does not flush its output, gawk may hang

when doing a getline in order to read the coprocess’s results. This could lead to a

situation known as deadlock, where each process is waiting for the other one to do something.

It is possible to close just one end of the two-way pipe to a coprocess, by supplying a

second argument to the close() function of either "to" or "from" (see Closing Input and

Output Redirections). These strings tell gawk to close the end of the pipe that sends data to

the coprocess or the end that reads from it, respectively.

This is particularly necessary in order to use the system sort utility as part of a coprocess; sort must read all of its input data before it can produce any output. The sort program does not receive an end-of-file indication until gawk closes the write end of the pipe.

When you have finished writing data to the sort utility, you can close the "to" end of the pipe, and then start reading sorted data via getline. For example:

BEGIN {

command = "LC_ALL=C sort"

n = split("abcdefghijklmnopqrstuvwxyz", a, "")

for (i = n; i > 0; i--)

print a[i] |& command

close(command, "to")

while ((command |& getline line) > 0)

print "got", line

close(command)

}

This program writes the letters of the alphabet in reverse order, one per line, down the two-way pipe to sort. It then closes the write end of the pipe, so that sort receives an end-of-file indication. This causes sort to sort the data and write the sorted data back to the gawk program. Once all of the data has been read, gawk terminates the coprocess and exits.

As a side note, the assignment ‘LC_ALL=C’ in the sort command ensures traditional Unix (ASCII) sorting from sort. This is not strictly necessary here, but it’s good to know how to do this.

You may also use pseudo-ttys (ptys) for two-way communication instead of pipes, if your system supports them. This is done on a per-command basis, by setting a special element in the PROCINFO array (see Built-in Variables That Convey Information), like so: command = "sort -nr" # command, save in convenience variable

PROCINFO[command, "pty"] = 1 # update PROCINFO

print … |& command # start two-way pipe

…

Using ptys usually avoids the buffer deadlock issues described earlier, at some loss in performance. If your system does not have ptys, or if all the system’s ptys are in use, gawk automatically falls back to using regular pipes.

Using gawk for Network Programming

EMISTERED:

A host is a host from coast to coast,

and nobody talks to a host that’s close,

unless the host that isn’t close

is busy, hung, or dead.

— Mike O’Brien (aka Mr. Protocol)

In addition to being able to open a two-way pipeline to a coprocess on the same system (see Two-Way Communications with Another Process), it is possible to make a two-way connection to another process on another system across an IP network connection.

You can think of this as just a very long two-way pipeline to a coprocess. The way gawk decides that you want to use TCP/IP networking is by recognizing special filenames that begin with one of ‘/inet/’, ‘/inet4/’, or ‘/inet6/’.

The full syntax of the special filename is / net-type/ protocol/ local-port/ remote-host/ remote-port. The components are:

net-type

Specifies the kind of Internet connection to make. Use ‘/inet4/’ to force IPv4, and

‘/inet6/’ to force IPv6. Plain ‘/inet/’ (which used to be the only option) uses the

system default, most likely IPv4.

protocol

The protocol to use over IP. This must be either ‘tcp’, or ‘udp’, for a TCP or UDP IP

connection, respectively. TCP should be used for most applications.

local-port

The local TCP or UDP port number to use. Use a port number of ‘0’ when you want the

system to pick a port. This is what you should do when writing a TCP or UDP client.

You may also use a well-known service name, such as ‘smtp’ or ‘http’, in which case

gawk attempts to determine the predefined port number using the C getaddrinfo()

function.

remote-host

The IP address or fully qualified domain name of the Internet host to which you want to connect.

remote-port

The TCP or UDP port number to use on the given remote-host. Again, use ‘0’ if you don’t care, or else a well-known service name.

NOTE

Failure in opening a two-way socket will result in a nonfatal error being returned to the calling code. The value of ERRNO indicates the error (see Built-in Variables That Convey Information).

Consider the following very simple example:

BEGIN {

Service = "/inet/tcp/0/localhost/daytime"

Service |& getline

print $0

close(Service)

}

This program reads the current date and time from the local system’s TCP daytime server.

It then prints the results and closes the connection.

Because this topic is extensive, the use of gawk for TCP/IP programming is documented separately. See TCP/IP Internetworking with gawk, which comes as part of the gawk distribution, for a much more complete introduction and discussion, as well as extensive examples.

Profiling Your awk Programs

You may produce execution traces of your awk programs. This is done by passing the

option --profile to gawk. When gawk has finished running, it creates a profile of your program in a file named awkprof.out. Because it is profiling, it also executes up to 45%

slower than gawk normally does.

As shown in the following example, the --profile option can be used to change the name of the file where gawk will write the profile:

gawk --profile=myprog.prof -f myprog.awk data1 data2

In the preceding example, gawk places the profile in myprog.prof instead of in

awkprof.out.

Here is a sample session showing a simple awk program, its input data, and the results from running gawk with the --profile option. First, the awk program:

BEGIN { print "First BEGIN rule" }

END { print "First END rule" }

/foo/ {

print "matched /foo/, gosh"

for (i = 1; i <= 3; i++)

sing()

}

{

if (/foo/)

print "if is true"

else

print "else is true"

}

BEGIN { print "Second BEGIN rule" }

END { print "Second END rule" }

function sing( dummy)

{

print "I gotta be me!"

}

Following is the input data:

foo

bar

baz

foo

junk

Here is the awkprof.out that results from running the gawk profiler on this program and data (this example also illustrates that awk programmers sometimes get up very early in the morning to work):

# gawk profile, created Mon Sep 29 05:16:21 2014

# BEGIN rule(s)

BEGIN {

1 print "First BEGIN rule"

}

BEGIN {

1 print "Second BEGIN rule"

}

# Rule(s)

5 /foo/ { # 2

2 print "matched /foo/, gosh"

6 for (i = 1; i <= 3; i++) {

6 sing()

}

5 {

5 if (/foo/) { # 2

2 print "if is true"

3 } else {

3 print "else is true"

}

# END rule(s)

END {

1 print "First END rule"

}

END {

1 print "Second END rule"

}

# Functions, listed alphabetically

6 function sing(dummy)

{

6 print "I gotta be me!"

}

This example illustrates many of the basic features of profiling output. They are as follows:

The program is printed in the order BEGIN rules, BEGINFILE rules, pattern–action rules, ENDFILE rules, END rules, and functions, listed alphabetically. Multiple BEGIN and END

rules retain their separate identities, as do multiple BEGINFILE and ENDFILE rules.

Pattern–action rules have two counts. The first count, to the left of the rule, shows how many times the rule’s pattern was tested. The second count, to the right of the rule’s opening left brace in a comment, shows how many times the rule’s action was

executed. The difference between the two indicates how many times the rule’s pattern evaluated to false.

Similarly, the count for an if-else statement shows how many times the condition was tested. To the right of the opening left brace for the if’s body is a count showing how many times the condition was true. The count for the else indicates how many times

the test failed.

The count for a loop header (such as for or while) shows how many times the loop test was executed. (Because of this, you can’t just look at the count on the first statement in a rule to determine how many times the rule was executed. If the first statement is a loop, the count is misleading.)

For user-defined functions, the count next to the function keyword indicates how

many times the function was called. The counts next to the statements in the body show how many times those statements were executed.

The layout uses “K&R” style with TABs. Braces are used everywhere, even when the body of an if, else, or loop is only a single statement.

Parentheses are used only where needed, as indicated by the structure of the program and the precedence rules. For example, ‘(3 5) * 4’ means add three and five, then

multiply the total by four. However, ‘3 5 * 4’ has no parentheses, and means ‘3

(5 * 4)’.

Parentheses are used around the arguments to print and printf only when the print

or printf statement is followed by a redirection. Similarly, if the target of a redirection isn’t a scalar, it gets parenthesized.

gawk supplies leading comments in front of the BEGIN and END rules, the BEGINFILE and ENDFILE rules, the pattern–action rules, and the functions.

The profiled version of your program may not look exactly like what you typed when you wrote it. This is because gawk creates the profiled version by “pretty-printing” its internal representation of the program. The advantage to this is that gawk can produce a standard representation. The disadvantage is that all source code comments are lost. Also, things such as:

/foo/

come out as:

/foo/ {

print $0

}

which is correct, but possibly unexpected.

Besides creating profiles when a program has completed, gawk can produce a profile while it is running. This is useful if your awk program goes into an infinite loop and you want to see what has been executed. To use this feature, run gawk with the --profile option in the background:

$ gawk --profile -f myprog &

[1] 13992

The shell prints a job number and process ID number; in this case, 13992. Use the kill command to send the USR1 signal to gawk:

$ kill -USR1 13992

As usual, the profiled version of the program is written to awkprof.out, or to a different file if one was specified with the --profile option.

Along with the regular profile, as shown earlier, the profile file includes a trace of any active functions:

# Function Call Stack:

# 3. baz

# 2. bar

# 1. foo

#—main --

You may send gawk the USR1 signal as many times as you like. Each time, the profile and function call trace are appended to the output profile file.

If you use the HUP signal instead of the USR1 signal, gawk produces the profile and the function call trace and then exits.

When gawk runs on MS-Windows systems, it uses the INT and QUIT signals for producing the profile, and in the case of the INT signal, gawk exits. This is because these systems don’t support the kill command, so the only signals you can deliver to a program are those generated by the keyboard. The INT signal is generated by the Ctrl-c or Ctrl-BREAK

key, while the QUIT signal is generated by the Ctrl-\ key.

Finally, gawk also accepts another option, --pretty-print. When called this way, gawk

“pretty-prints” the program into awkprof.out, without any execution counts.

NOTE

The --pretty-print option still runs your program. This will change in the next major release.

Summary

The --non-decimal-data option causes gawk to treat octal- and hexadecimal-looking

input data as octal and hexadecimal. This option should be used with caution or not at all; use of strtonum() is preferable. Note that this option may disappear in a future version of gawk.

You can take over complete control of sorting in ‘for ( indx in array)’ array traversal by setting PROCINFO["sorted_in"] to the name of a user-defined function that does the comparison of array elements based on index and value.

Similarly, you can supply the name of a user-defined comparison function as the third argument to either asort() or asorti() to control how those functions sort arrays. Or you may provide one of the predefined control strings that work for

PROCINFO["sorted_in"].

You can use the ‘|&’ operator to create a two-way pipe to a coprocess. You read from the coprocess with getline and write to it with print or printf. Use close() to close off the coprocess completely, or optionally close off one side of the two-way

communications.

By using special filenames with the ‘|&’ operator, you can open a TCP/IP (or UDP/IP) connection to remote hosts on the Internet. gawk supports both IPv4 and IPv6.

You can generate statement count profiles of your program. This can help you

determine which parts of your program may be taking the most time and let you tune

them more easily. Sending the USR1 signal while profiling causes gawk to dump the

profile and keep going, including a function call stack.

You can also just “pretty-print” the program. This currently also runs the program, but that will change in the next major release.

[78] This is why the predefined sorting orders start with an ‘@’ character, which cannot be part of an identifier.

[79] This is true because locale-based comparison occurs only when in POSIX-compatibility mode, and because asort() and asorti() are gawk extensions, they are not available in that case.

[80] Michael Brennan suggests the use of rand() to generate unique filenames. This is a valid point; nevertheless, temporary files remain more difficult to use than two-way pipes.

[81] This is very different from the same operator in the C shell and in Bash.

Chapter 13. Internationalization with

gawk

Once upon a time, computer makers wrote software that worked only in English.

Eventually, hardware and software vendors noticed that if their systems worked in the native languages of non-English-speaking countries, they were able to sell more systems.

As a result, internationalization and localization of programs and software systems

became a common practice.

For many years, the ability to provide internationalization was largely restricted to programs written in C and C++. This chapter describes the underlying library gawk uses for internationalization, as well as how gawk makes internationalization features available at the awk program level. Having internationalization available at the awk level gives software developers additional flexibility — they are no longer forced to write in C or C++ when internationalization is a requirement.

Internationalization and Localization

Internationalization means writing (or modifying) a program once, in such a way that it can use multiple languages without requiring further source code changes. Localization means providing the data necessary for an internationalized program to work in a

particular language. Most typically, these terms refer to features such as the language used for printing error messages, the language used to read responses, and information related to how numerical and monetary values are printed and read.

GNU gettext

gawk uses GNU gettext to provide its internationalization features. The facilities in GNU

gettext focus on messages: strings printed by a program, either directly or via formatting with printf or sprintf().[82]

When using GNU gettext, each application has its own text domain. This is a unique name, such as ‘kpilot’ or ‘gawk’, that identifies the application. A complete application may have multiple components — programs written in C or C++, as well as scripts written in sh or awk. All of the components use the same text domain.

To make the discussion concrete, assume we’re writing an application named guide.

Internationalization consists of the following steps, in this order:

1. The programmer reviews the source for all of guide’s components and marks each

string that is a candidate for translation. For example, "`-F': option required" is a good candidate for translation. A table with strings of option names is not (e.g., gawk’s --profile option should remain the same, no matter what the local

language).

2. The programmer indicates the application’s text domain ("guide") to the gettext library, by calling the textdomain() function.

3. Messages from the application are extracted from the source code and collected into a portable object template file (guide.pot), which lists the strings and their

translations. The translations are initially empty. The original (usually English)

messages serve as the key for lookup of the translations.

4. For each language with a translator, guide.pot is copied to a portable object file (.po) and translations are created and shipped with the application. For example,

there might be a fr.po for a French translation.

5. Each language’s .po file is converted into a binary message object (.gmo) file. A message object file contains the original messages and their translations in a binary format that allows fast lookup of translations at runtime.

6. When guide is built and installed, the binary translation files are installed in a standard place.

7. For testing and development, it is possible to tell gettext to use .gmo files in a different directory than the standard one by using the bindtextdomain() function.

8. At runtime, guide looks up each string via a call to gettext(). The returned string is the translated string if available, or the original string if not.

9. If necessary, it is possible to access messages from a different text domain than the one belonging to the application, without having to switch the application’s default text domain back and forth.

In C (or C++), the string marking and dynamic translation lookup are accomplished by wrapping each string in a call to gettext():

printf("%s", gettext("Don't Panic!\n"));

The tools that extract messages from source code pull out all strings enclosed in calls to gettext().

The GNU gettext developers, recognizing that typing ‘gettext(…)’ over and over again is both painful and ugly to look at, use the macro ‘_’ (an underscore) to make things easier:

/* In the standard header file: */

#define _(str) gettext(str)

/* In the program text: */

printf("%s", _("Don't Panic!\n"));

This reduces the typing overhead to just three extra characters per string and is

considerably easier to read as well.

There are locale categories for different types of locale-related information. The defined locale categories that gettext knows about are:

LC_MESSAGES

Text messages. This is the default category for gettext operations, but it is possible to supply a different one explicitly, if necessary. (It is almost never necessary to supply a different category.)

LC_COLLATE

Text-collation information (i.e., how different characters and/or groups of characters sort in a given language).

LC_CTYPE

Character-type information (alphabetic, digit, upper- or lowercase, and so on) as well as character encoding. This information is accessed via the POSIX character classes in

regular expressions, such as /[[:alnum:]]/ (see Using Bracket Expressions).

LC_MONETARY

Monetary information, such as the currency symbol, and whether the symbol goes

before or after a number.

LC_NUMERIC

Numeric information, such as which characters to use for the decimal point and the

thousands separator. [83]

LC_TIME

Time- and date-related information, such as 12- or 24-hour clock, month printed before or after the day in a date, local month abbreviations, and so on.

LC_ALL

All of the above. (Not too useful in the context of gettext.)

Internationalizing awk Programs

gawk provides the following variables for internationalization:

TEXTDOMAIN

This variable indicates the application’s text domain. For compatibility with GNU

gettext, the default value is "messages".

_"your message here"

String constants marked with a leading underscore are candidates for translation at

runtime. String constants without a leading underscore are not translated.

gawk provides the following functions for internationalization:

dcgettext( string [, domain [, category]])

Return the translation of string in text domain domain for locale category category.

The default value for domain is the current value of TEXTDOMAIN. The default value for category is "LC_MESSAGES".

If you supply a value for category, it must be a string equal to one of the known locale categories described in the previous section. You must also supply a text domain. Use TEXTDOMAIN if you want to use the current domain.

CAUTION

The order of arguments to the awk version of the dcgettext() function is purposely different from the order for the C version. The awk version’s order was chosen to be simple and to allow for reasonable awk-style default arguments.

dcngettext( string1, string2, number [, domain [, category]]) Return the plural form used for number of the translation of string1 and string2 in text domain domain for locale category category. string1 is the English singular variant of a message, and string2 is the English plural variant of the same message.

The default value for domain is the current value of TEXTDOMAIN. The default value for category is "LC_MESSAGES".

The same remarks about argument order as for the dcgettext() function apply.

bindtextdomain( directory [, domain ])

Change the directory in which gettext looks for .gmo files, in case they will not or cannot be placed in the standard locations (e.g., during testing). Return the directory in which domain is “bound.”

The default domain is the value of TEXTDOMAIN. If directory is the null string (""), then bindtextdomain() returns the current binding for the given domain.

To use these facilities in your awk program, follow these steps:

1. Set the variable TEXTDOMAIN to the text domain of your program. This is best done in a BEGIN rule (see The BEGIN and END Special Patterns), or it can also be done via the -v command-line option (see Command-Line Options):

BEGIN {

TEXTDOMAIN = "guide"

…

}

2. Mark all translatable strings with a leading underscore (‘_’) character. It must be adjacent to the opening quote of the string. For example:

print _"hello, world"

x = _"you goofed"

printf(_"Number of users is %d\n", nusers)

3. If you are creating strings dynamically, you can still translate them, using the dcgettext() built-in function:[84]

if (groggy)

message = dcgettext("%d customers disturbing me\n", "adminprog") else

message = dcgettext("enjoying %d customers\n", "adminprog")

printf(message, ncustomers)

Here, the call to dcgettext() supplies a different text domain ("adminprog") in which to find the message, but it uses the default "LC_MESSAGES" category.

The previous example only works if ncustomers is greater than one. This example

would be better done with dcngettext():

if (groggy)

message = dcngettext("%d customer disturbing me\n",

"%d customers disturbing me\n", "adminprog")

else

message = dcngettext("enjoying %d customer\n",

"enjoying %d customers\n", "adminprog")

printf(message, ncustomers)

4. During development, you might want to put the .gmo file in a private directory for testing. This is done with the bindtextdomain() built-in function:

BEGIN {

TEXTDOMAIN = "guide" # our text domain

if (Testing) {

# where to find our files

bindtextdomain("testdir")

# joe is in charge of adminprog

bindtextdomain("../joe/testdir", "adminprog")

}

…

}

See A Simple Internationalization Example for an example program showing the steps to create and use translations from awk.

Translating awk Programs

Once a program’s translatable strings have been marked, they must be extracted to create the initial .pot file. As part of translation, it is often helpful to rearrange the order in which arguments to printf are output.

gawk’s --gen-pot command-line option extracts the messages and is discussed next. After that, printf’s ability to rearrange the order for printf arguments at runtime is covered.

Extracting Marked Strings

Once your awk program is working, and all the strings have been marked and you’ve set (and perhaps bound) the text domain, it is time to produce translations. First, use the --

gen-pot command-line option to create the initial .pot file:

gawk --gen-pot -f guide.awk > guide.pot

When run with --gen-pot, gawk does not execute your program. Instead, it parses it as usual and prints all marked strings to standard output in the format of a GNU gettext Portable Object file. Also included in the output are any constant strings that appear as the first argument to dcgettext() or as the first and second argument to dcngettext().[85]

You should distribute the generated .pot file with your awk program; translators will

eventually use it to provide you translations that you can also then distribute. See A

Simple Internationalization Example for the full list of steps to go through to create and test translations for guide.

Rearranging printf Arguments

Format strings for printf and sprintf() (see Using printf Statements for Fancier

Printing) present a special problem for translation. Consider the following:[86]

printf(_"String `%s' has %d characters\n",

string, length(string)))

A possible German translation for this might be:

"%d Zeichen lang ist die Zeichenkettè%s'\n"

The problem should be obvious: the order of the format specifications is different from the original! Even though gettext() can return the translated string at runtime, it cannot change the argument order in the call to printf.

To solve this problem, printf format specifiers may have an additional optional element, which we call a positional specifier.

Download

Effective awk Programming: Universal Text Processing and Pattern Matching by Arnold Robbins.epub
Effective awk Programming: Universal Text Processing and Pattern Matching by Arnold Robbins.pdf

Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.

Categories

BSD	Linux
Macintosh	Solaris
Unix	Windows