Monday, October 12, 2009

Of UNICODE, UTF-8, Character sets part 2

Welcome to this second post in this series on UNICODE, Character sets and what have you not. In the first of these posts, I went through some of the history of character set support, and some anomalies, and finished around the mid-1990's, when we had a bunch of reasonably well stanardized 8-bit character set. And then something happens...

Jacques Chirac becomes president of France. Now wait, that wasn't it. No, what happened was the Internet, and suddenly the time of IT as isolated islands, where we could determine ourselves how we wanted our computers to operate and what character set to use. came to an end. Suddenly, a user in France could view a webpage created in Japan. And now the issue woith Character sets becomes a real problem. Luckily, stuff has been going on since the late 1980's, more specifically UNICODE. Whoa, UNICODE comes to the rescue.

The first installments of UNICODE utilized a 16-bit character format. This later sneaked into operating systems, libraries and all over the place. The char datatype in C was supposed to be substituted for the wchar_t datatype (W as in Wide). This did non become very popular, but the scheme presisted and is still in use. Windows has many hooks for this, and most Windows API functions have wchar_t counterparts, and there are even portability macros (the being one of them, working together dit datatypes such LPTSTR and TCHAR). This turned out to be more used in countries, mostly in Asia, with large character sets, which was present in the UNICODE character set. The Windows macros and types made life easier for far-east developers to make their code more portable to more ISO-8859-1 friendly languages.

What made this not so popular was that it still was a headace creating portable code, and also, the UNICODE character set was moving on, and was soon more than 16-bit. So the 2 byte datatype we used for characters (wchar_t is mapped to a short usually), wasn't enough. Now, working with 2 bytes instead of one was bad enough, but working with 4, 5 or 6 per character was just too much.

The UNICODE character set now needs 6 bytes to cover the full representation. So much for all that hard work by Microsoft and all those macros and API rewrites. Just consider looking for the end of a C-style NULL-terminated string, even with 2 fixed bytes, this is much more difficult than what it used to be. With 6 bytes even more so!

So along comes some schmes that allow you to "encode" a UNICODE character in some other, hopefully easier to manage, form. The most popular of these, by far, is UTF-8. This is a means of encoding a 6-byte long UNICODE character to a format that is a variable length. The nice thing with UTF-8 is that the first 128 positions are encoded exactly like old-style 7-bit ASCII! One byte, highest bit being 0! The Way UTF-8 works means that these 128 bytes position will never appear as part of any other UNICODE character, or in other words, the high-order bit is ALWAYS 1, except for these first characters.

So, given all this, it seems like functions like strlen will work with a UNICODE string? Well, sort of, but it will give you the length in bytes not in characters. But besides that, it will work. And so will strcpy, strcat etc.

So is all hanky-panky then? Nah! Let's look at UNICODE transformation besides UTF-8, namely UTF-16. This is still a variable length encoding, like UTF-8, and it's not that much used actually. But some people tend to THINK it is. What is being heavly used, as mentioned above in Windows, as well as in many Linux technologies though, is UCS-2. So what is that? Well, UCS-2 is based on a ISO Standard, ISO 10646. This preceeded UNICODE slightly, and was, like early UNICODE, fixed 16-bit. UCS-2 means "Universal Character Set, 2 bytes"! When UNICODE came around, the (then) fixed encoding was UTF-16. These two, UCS-2 and UTF-16 are very often confused. But whereas UCS-2 is still a fixed 16-bit character set encoding, UTF-16 has developed and is now a variable length encoding, but of course still very similar to UCS-2. Gosh, I wonder what these folks were smoking whi figured this one up.

There is one UNICODE encoding that is fixed length, and that is called UTF-32 (or, if you ask ISO, UCS-4). This is very seldom used in practice.

But UNICODE seems to persist, in particular UTF-8, which is, for example, standardized in Java. As for MySQL, it supports UTF-8 as well as classic ISO-8859-1 (which is called latin1 in MySQL, which you know by now, if you did last weeks lesson) and several other character sets. One character set not well supported though is UCS-2. You can define it as a server character set, but a UCS-2 client is not accepted. And I think this may be difficult to implement, there is just too many places where MySQL regards a character as being 8 bits. And UTF-8 is much easier, as long as we don't care about the actual length, we can treat a UTF-8 string as long ASCII string, and all special characters that we may be looking for in a UTF-8 string, such a CR/LF, ASCII NULL and semicolon, are in the 0-127 range.

Now, lets have a quick look at UTF-8 and MySQL. As UTF-8 is variable length, any string space, if defined by the # of characters, is bound to be also variable length, right? Ha ha, got you there! No, MySQL will allocate 3 bytes for each character in a UTF-8 string. What, 3 bytes, isn't UTF-8 up to 4 bytes? Yes, but MySQL only supports UNICODE "codepoints" that can be represented with up to 3 UTF-8 bytes.

OK, that's enough for now, I'll finish of this series with a post on collations, some general ideas on character set development and testing, and a few beers.

/Karlsson

No comments: