C# 4.0 in a Nutshell by Joseph Albahari & Ben Albahari

C# 4.0 in a Nutshell by Joseph Albahari & Ben Albahari

Author:Joseph Albahari & Ben Albahari [Joseph Albahari]
Language: eng
Format: epub
Tags: COMPUTERS / Programming Languages / Visual BASIC
ISBN: 9781449380458
Publisher: O'Reilly Media
Published: 2010-01-19T16:00:00+00:00


Note

A StreamReader or StreamWriter will throw an exception if it encounters bytes that do not have a valid string translation for their encoding.

The simplest of the encodings is ASCII, because each character is represented by one byte. The ASCII encoding maps the first 127 characters of the Unicode set into its single byte, covering what you see on a U.S.-style keyboard. Most other characters, including specialized symbols and non-English characters, cannot be represented and are converted to the □ character. The default UTF-8 encoding can map all allocated Unicode characters, but it is more complex. The first 127 characters encode to a single byte, for ASCII compatibility; the remaining characters encode to a variable number of bytes (most commonly two or three). Consider this:

using (TextWriter w = File.CreateText ("but.txt")) // Use default UTF-8 w.WriteLine ("but-"); // encoding. using (Stream s = File.OpenRead ("but.txt")) for (int b; (b = s.ReadByte()) > −1;) Console.WriteLine (b);

The word “but” is followed not by a stock-standard hyphen, but by the longer em dash (—) character, U+2014. This is the one that won’t get you into trouble with your book editor! Let’s examine the output:

98 // b 117 // u 116 // t 226 // em dash byte 1 Note that the byte values 128 // em dash byte 2 are >= 128 for each part 148 // em dash byte 3 of the multibyte sequence. 13 // <CR> 10 // <LF>

Because the em dash is outside the first 127 characters of the Unicode set, it requires more than a single byte to encode in UTF-8 (in this case, three). UTF-8 is efficient with the Western alphabet, as most popular characters consume just one byte. It also downgrades easily to ASCII simply by ignoring all bytes above 127. Its disadvantage is that seeking within a stream is troublesome, since a character’s position does not correspond to its byte position in the stream. An alternative is UTF-16 (labeled just “Unicode” in the Encoding class). Here’s how we write the same string with UTF-16:

using (Stream s = File.Create ("but.txt")) using (TextWriter w = new StreamWriter (s, Encoding.Unicode)) w.WriteLine ("but-"); foreach (byte b in File.ReadAllBytes ("but.txt")) Console.WriteLine (b);

The output is then:

255 // Byte-order mark 1 254 // Byte-order mark 2 98 // 'b' byte 1 0 // 'b' byte 2 117 // 'u' byte 1 0 // 'u' byte 2 116 // 't' byte 1 0 // 't' byte 2 20 // '--' byte 1 32 // '--' byte 2 13 // <CR> byte 1 0 // <CR> byte 2 10 // <LF> byte 1 0 // <LF> byte 2

Technically, UTF-16 uses either two or four bytes per character (there are close to a million Unicode characters allocated or reserved, so 2 bytes is not always enough). However, because the C# char type is itself only 16 bits wide, a UTF-16 encoding will always use exactly two bytes per .NET char. This makes it easy to jump to a particular character index within a stream.

UTF-16 uses a two-byte prefix



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.