October 24, 2014
 
 
RSSRSS feed

Mastering Characters Sets in Linux (Weird Characters, part 2) - page 3

gucharmap and recode

  • November 25, 2009
  • By Akkana Peck

There's one additional complication: combining forms.

The � in "Voil�!", Unicode 00e0, can be represented in UTF-8 as the (hexadecimal) bytes c3 a0. But it can also be represented another way, as 61 cc 80. Hexadecimal 61 is a regular ASCII 'a', and cc 80 in UTF-8 is a "combining grave accent" (unicode 0300). An � is an a followed by a combining grave accent.

Confused yet? Try viewing character info for an accented character in gucharmap (Figure 3).

figure 3
figure 3

Click on the link for U+0300 COMBINING GRAVE ACCENT to see that character (Figure 4).

figure 4
figure 4

Why worry about combining forms? First, if you work with characters from web pages or email, you will see combining accents fairly frequently. But in addition, you can use them to get the ASCII equivalent of a character. Python's unicodedata.normalize can turn regular accented characters into their combining forms; then encode with error=ignore to throw away the accent part and keep just the letter.

>>> import unicodedata
>>> uni = u'Voil\u00e0!'
>>> normalized = unicodedata.normalize('NFKD', uni)
>>> print normalized.encode('ascii', 'ignore')
Voila!
Note the lack of accent over the a.

Options for unicodedata.normalize are NFC, NFKC, NFD, and NFKD. Refer to the unicodedata documentation for details on how they differ.

Akkana Peck is a freelance programmer and writer and author of the book "Beginning GIMP: From Novice to Professional". She also uses primitive tools like mutt and PalmOS PDAs and so spends way too much time thinking about converting fancy character sets to simpler ones. You can see her current python charset converter here: ununicode

Sitemap | Contact Us