Mastering Characters Sets in Linux (Weird Characters, part 2) - page 3
gucharmap and recode
There's one additional complication: combining forms.
The � in "Voil�!", Unicode 00e0, can be represented in UTF-8 as the (hexadecimal) bytes c3 a0. But it can also be represented another way, as 61 cc 80. Hexadecimal 61 is a regular ASCII 'a', and cc 80 in UTF-8 is a "combining grave accent" (unicode 0300). An � is an a followed by a combining grave accent.
Confused yet? Try viewing character info for an accented character in gucharmap (Figure 3).
Click on the link for U+0300 COMBINING GRAVE ACCENT to see that character (Figure 4).
Why worry about combining forms? First, if you work with characters from web pages or email, you will see combining accents fairly frequently. But in addition, you can use them to get the ASCII equivalent of a character. Python's unicodedata.normalize can turn regular accented characters into their combining forms; then encode with error=ignore to throw away the accent part and keep just the letter.
>>> import unicodedata
>>> uni = u'Voil\u00e0!'
>>> normalized = unicodedata.normalize('NFKD', uni)
>>> print normalized.encode('ascii', 'ignore')
Voila!
Note the lack of accent over the a.
Options for unicodedata.normalize are NFC, NFKC, NFD, and NFKD. Refer to the unicodedata documentation for details on how they differ.
Akkana Peck is a freelance programmer and writer and author of the
book "Beginning GIMP: From Novice to
Professional". She also uses primitive tools like mutt and PalmOS
PDAs and so spends way too much time thinking about converting
fancy character sets to simpler ones. You can see her current
python charset converter here:
ununicode
- Skip Ahead
- 1. gucharmap and recode
- 2. gucharmap and recode
- 3. gucharmap and recode


Solid state disks (SSDs) made a splash in consumer technology, and now the technology has its eyes on the enterprise storage market. Download this eBook to see what SSDs can do for your infrastructure and review the pros and cons of this potentially game-changing storage technology.