The Linux Command Line by William E. Shotts Jr

The Linux Command Line by William E. Shotts Jr

Author:William E. Shotts Jr.
Language: eng
Format: mobi, epub
Tags: COMPUTERS / Operating Systems / Linux
ISBN: 9781593274269
Publisher: No Starch Press, Inc.
Published: 2012-01-13T05:00:00+00:00


POSIX Character Classes

The traditional character ranges are an easily understood and effective way to handle the problem of quickly specifying sets of characters. Unfortunately, they don’t always work. While we have not encountered any problems with our use of grep so far, we might run into problems using other programs.

Back in Chapter 4, we looked at how wildcards are used to perform pathname expansion. In that discussion, we said that character ranges could be used in a manner almost identical to the way they are used in regular expressions, but here’s the problem:

[me@linuxbox ˜]$ ls /usr/sbin/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]* /usr/sbin/MAKEFLOPPIES /usr/sbin/NetworkManagerDispatcher /usr/sbin/NetworkManager

(Depending on the Linux distribution, we will get a different list of files, possibly an empty list. This example is from Ubuntu.) This command produces the expected result — a list of only the files whose names begin with an uppercase letter. But with this command we get an entirely different result (only a partial listing of the results is shown):

[me@linuxbox ˜]$ ls /usr/sbin/[A-Z]* /usr/sbin/biosdecode /usr/sbin/chat /usr/sbin/chgpasswd /usr/sbin/chpasswd /usr/sbin/chroot /usr/sbin/cleanup-info /usr/sbin/complain /usr/sbin/console-kit-daemon

Why is that? It’s a long story, but here’s the short version.

Back when Unix was first developed, it only knew about ASCII characters, and this feature reflects that fact. In ASCII, the first 32 characters (numbers 0–31) are control codes (things like tabs, backspaces, and carriage returns). The next 32 (32–63) contain printable characters, including most punctuation characters and the numerals zero through nine. The next 32 (numbers 64–95) contain the uppercase letters and a few more punctuation symbols. The final 31 (numbers 96–127) contain the lowercase letters and yet more punctuation symbols. Based on this arrangement, systems using ASCII used a collation order that looked like this:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

This differs from proper dictionary order, which is like this:

aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

As the popularity of Unix spread beyond the United States, there grew a need to support characters not found in US English. The ASCII table was expanded to use a full 8 bits, adding character numbers 128–255, which accommodated many more languages. To support this ability, the POSIX standards introduced a concept called a locale, which could be adjusted to select the character set needed for a particular location. We can see the language setting of our system using this command:

[me@linuxbox ˜]$ echo $LANG en_US.UTF-8

With this setting, POSIX-compliant applications will use a dictionary collation order rather than ASCII order. This explains the behavior of the commands above. A character range of [A-Z], when interpreted in dictionary order, includes all of the alphabetic characters except the lowercase a — hence our results.

To partially work around this problem, the POSIX standard includes a number of character classes, which provide useful ranges of characters. They are described in Table 19-2.

Table 19-2. POSIX Character Classes

Character Class

Description



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.