Back to article
Character Sets in Linux or: Why do I See Those Weird Characters?
In the Beginning was ASCII
November 12, 2009
I bet you've seen it too -- email looking like Figure 1.
Sometimes you can figure out from context that �€œ means a left double quote and �€™ means an apostrophe. But who wants to try to read messages like that! What causes them?
Today's article will explain character encoding, how it works in email and web browsers and you you can make sure that your messages don't look like that.
In the beginning: ASCII
In the beginning there was ASCII: a simple set of 127 characters (7 bits). You can see the ASCII table by typing man ascii.
ASCII was fine for English and most programming languages. But pretty soon those pesky Spanish, French and German speakers started complaining: �coutez! �Oye! Pa� auf!
127 characters wasn't enough for all the characters those languages needed. So OS vendors started using that 8th bit. That solved the problem ... for about a month, until Greeks, Russians, Chinese and the rest started demanding ways to type their languages.
Before long there were dozens of encodings for different languages and OSes. ISO-8859 was an attempt to standardize them, and included ISO-8859-1 or "Latin 1" for western European languages, ISO-8859-2 for Central European, and so on.
Some programs switched to use the new ISO-8859 standards, but others clung to older encodings, like Windows 1252 (Western European languages) and cp1251 (Microsoft Windows 3.1 Cyrillic). Meanwhile, there was another problem: lots of Asian languages have more than 256 characters.
Unicode and UTF
Unicode was the solution: a table that includes all characters from all the world's languages. Originally it was intended to fit in two bytes -- but 65,536 (216) "code points" turned out not to be enough. Today Unicode has 1,114,112 code points.
Representing 1,114,112 characters requires 21 bits, though, and no one wanted
to make every character three bytes long.
So the Unicode Transformation Formats were developed.
UTF-16 can represent most Unicode characters in two bytes, spilling
over to more bytes for unusual and seldom-used characters. UTF-8 uses
a single byte when possible -- and for simple English messages containing
only the original 127 ASCII characters, UTF-8 is the same as ASCII.