4.7. Character Encoding

For many years Americans have been using the ASCII encoding of characters, permitting easy exchange of English texts. Unfortunately, ASCII is completely inadequate in handling the character sets of most other languages. For many years different countries have adopted different techniques for exchanging text in different languages. More recently, ISO has developed ISO 10646, a single 31-bit encoding for all of the world's characters termed the Universal Character Set (UCS). Characters fitting into 16 bits (the first 65536 values of the UCS) are termed the ``Basic Multilingual Plane'' (BMP), and the BMP is intended to cover nearly all spoken languages. The Unicode forum develops the Unicode standard, which concentrates on the 16-bit set and adds some additional conventions to aid interoperability.

However, most software is not designed to handle 16 bit or 32 bit characters, so a special format called ``UTF-8'' was developed to encode these potentially international characters in a format more easily handled by existing programs and libraries. UTF-8 is defined, among other places, in IETF RFC 2279, so it's a well-defined standard that can be freely read and used. UTF-8 is a variable-width encoding; characters numbered 0 to 0x7f (127) encode to themselves as a single byte, while characters with larger values are encoded into 2 to 6 bytes of information (depending on their value). The encoding has been specially designed to have the following nice properties (this information is from the RFC and Linux utf-8 man page):

In short, the UTF-8 transformation format is becoming a dominant method for exchanging international text information because it can support all of the world's languages, yet it is backward compatible with U.S. ASCII files as well as having other nice properties. For many purposes I recommend its use, particularly when storing data in a ``text'' file.

The reason to mention UTF-8 is that some byte sequences are not legal UTF-8, and this might be an exploitable security hole. The RFC notes the following:

Implementors of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.

A particularly subtle form of this attack could be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but allow the illegal two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F.

A longer discussion about this is available at Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux at http://www.cl.cam.ac.uk/~mgk25/unicode.html.

The UTF-8 character set is one case where it's possible to enumerate all illegal values (and prove that you've enumerated them all). If you need to determine if you have a legal UTF-8 sequence, you need to check for two things: (1) is the initial sequence legal, and (2) if it is, is the first byte followed by the required number of valid continuation characters? Performing the first check is easy; the following is provably the complete list of all illegal UTF-8 initial sequences:

Table 4-1. Illegal UTF-8 initial sequences

UTF-8 SequenceReason for Illegality
10xxxxxxillegal as initial byte of character (80..BF)
1100000x illegal, overlong (C0 80..BF)
11100000 100xxxxx illegal, overlong (E0 80..9F)
11110000 1000xxxx illegal, overlong (F0 80..8F)
11111000 10000xxx illegal, overlong (F8 80..87)
11111100 100000xx illegal, overlong (FC 80..83)
1111111x illegal; prohibited by spec

I should note that in some cases, you might want to cut slack (or use internally) the hexadecimal sequence C0 80. This is an overlong sequence that could represent ASCII NUL (NIL). Since C/C++ have trouble including a NIL character in an ordinary string, some people have taken to using this sequence when they want to represent NIL as part of the data stream; Java even enshrines the practice. Feel free to use C0 80 internally while processing data, but technically you really should translate this back to 00 before saving the data. Depending on your needs, you might decide to be ``sloppy'' and accept C0 80 as input in a UTF-8 data stream.

The second step is to check if the correct number of continuation characters are included in the string. If the first byte has the top 2 bits set, you count the number of ``one'' bits set after the top one, and then check that there are that many continuation bytes which begin with the bits ``10''. So, binary 11100001 requires two more continuation bytes.

A related issue is that some phrases can be expressed in more than one way in ISO 10646/Unicode. For example, some accented characters can be represented as a single character (with the accent) and also as a set of characters (e.g., the base character plus a separate composing accent). These two forms may appear identical. There's also a zero-width space that could be inserted, with the result that apparently-similar items are considered different. Beware of situations where such hidden text could interfere with the program.