Thursday, November 22, 2012

Character sets, Collations, UTF-8 and all that

Yesterday at the first Swedish MySQL User Group real meeting here in Stockholm, I presented a talk on character sets, collations and stuff like that. If you read this blog, you know that I have written about this before, but the presentation I did yesterday was a fair bit more detailed. You can view the full presentation on slideshare:
 One thing I talked a lot on was collations and how they affect matters, and this has more of an impact than you think, in particular when using UTF-8. You would think that using UTF-8 most character set problems are solved (at least when using 4-byte UTF-8), but no. Collations are still added to this, and there are many of them and the effect of choosing the wrong one can be real bad.

Let me take an example. You would think think that using a UNIQUE or PRIMARY KEY on a text-based column (using something like a VARCHAR or CHAR type) in a table would ensure that any two strings are unique, but that two strings values that are different may coexist in two different rows. Think again.

A collation defines how characters in a character set are sorted and compared. And most localized collations have some weird attributes to them. There are things that linguistics think are reasonable for a particular language, and that are hence present in the UNICODE standard, but it might not be widely accepted by the community at large. So back to my original example. Let's say we are in Sweden, then 4 (yes, four) different collations may be applicable:
  • utf8 binary - This is a plain binary collation, comparisons are done on the binary value of the characters.
  • utf8_unicode - This is a pretty reasonable collations based on some generic compromise in UNICODE on how things are sorted, and are not sorted across the globe. Sort of.
  • utf8_general - This is a simplified, faster general variation compared to utf8_Unicode
  • utf8_swedish - This is a collations that is specific to Sweden with some interesting Swedish specifics.
So lets's see how this work in practice. Lets try a table that looks like this:
CREATE TABLE `utf8_table` (
  `swedishname` char(10) CHARACTER SET utf8

    COLLATE utf8_general_ci NOT NULL,
  PRIMARY KEY (`swedishname`)
) ENGINE=InnoDB;

What happens with this data:
INSERT INTO utf8_table VALUES('A');
INSERT INTO utf8_table VALUES('Ä');
In Sweden, these two are unique (the second A has an umlaut). In the rest of the world, these two are the same, so the above will not work, a PRIMARY KEY error will happen on the second row, despite the characters being different! So we try this instead:
CREATE TABLE `utf8_table` (
  `swedishname` char(10) CHARACTER SET utf8

    COLLATE utf8_swedish_ci NOT NULL,
  PRIMARY KEY (`swedishname`)
) ENGINE=InnoDB;

And with the same data:
INSERT INTO utf8_table VALUES('A');
INSERT INTO utf8_table VALUES('Ä');
And this works as it should work, both rows are inserted!

I will write another blog post on this soon, with some more examples eventually, but for now:
Cheers
/Karlsson
PS. I apologize if you have problems reading the above, it probably comes from the fact that there are embedded Swedish characters in the text :-(

No comments: