Mastering Characters Sets in Linux (Weird Characters, part 2)

By: Akkana Peck
Wednesday, November 25, 2009 09:55:05 AM EST
URL: http://www.linuxplanet.com/linuxplanet/tutorials/6912/1/

gucharmap and recode

In the last article I talked about Unicode, character sets and encoding -- how accented and special characters are transferred in email and web pages, and why you sometimes see funny characters when the process goes wrong.

But can you fix it when it does go wrong? And if you're a programmer, how should you be handling all these encodings?

gucharmap

First, when you're testing anything involving character encoding, gucharmap is invaluable (Figure 1).

figure 1
figure 1

Every Unicode character is in some category, shown in the list on the left -- in addition to Basic Latin, Latin-1 Supplement (accented characters), Greek, Cyrillic, Katakana etc. there are categories for Braille, Cuneiform, punctuation, mathematics, music and so forth.

The Character Details tab tells you the Unicode, UTF-8, UTF-16 and XML/HTML codes for the character.

If you have a character from a web page or email and don't know what it is, just paste it into gucharmap's Search->Find field (Figure 2).

figure 2
figure 2

recode

You can fix some encoding problems using a simple command-line tool: recode, probably available on your Linux distribution.

To experiment with recode, you'll need some test data. You can make a file containing UTF-8 by pasting something from Firefox, which usually pastes UTF-8 even if you copy from a page with another encoding, like this one.

$ cat >voila-utf8
"Voilá!"       <-- paste this string
^D             <-- Type Ctrl-D on a new line

$ cat >curly-utf8
“Curly quotes” <-- paste this string
^D             <-- Type Ctrl-D on a new line

Once you have test data, run recode like this:

$ recode utf8..iso8859-15 <voila-utf8 >voila-8859
$ 

Now test-8859 should contain a Latin-1 version of the original UTF-8 string. Of course, you can go the other way too:

$ recode iso8859-15..utf8 <voila-8859 >voila2-utf8
$ diff voila-utf8 voila2-utf8
$               <-- no differences

You can examine the files and compare them with a binary dump program like od, then use gucharmap to verify which characters are which. I find od output a bit hard to read, so I wrote a Python equivalent, bdump.

recode can even map curly quotes ("smartquotes") to regular quotes, if its output format is one that doesn't include curly quotes, like ASCII or ISO8859-15:

$ recode utf8..iso8859-15 <curly-utf8
"Curly quotes"

recode translates those curly quotes to straight ASCII quotes. Very useful! Of course, that means that the translation has lost information -- you can't go back to the original UTF-8. To prevent that, use recode's -s (strict) option.

Encoded Strings in Python

Recode is useful, but it's not very flexible if you want something you can use in a program. How would you do the same thing in Python?

Python has a simple 8-bit String type: that's what you get if you say

>>> str = "Hello, world!"

Python can also represent Unicode strings: put a u in front of the quotes.

>>> uni = u"Hello, world!"
Unicode strings may be 16 or 32 bits wide.

Both types let you specify characters by hexadecimal codes:

>>> str = "Here are some \xe2\x80\x9ccurly quotes in UTF-8\xe2\x80\x9d."
>>> uni = "Here are some \u201ccurly quotes in Unicode\u201d."

Use encode and decode to convert between strings and Unicode:

>>> str = "Here are some \xe2\x80\x9ccurly quotes\xe2\x80\x9d."
>>> str.decode('utf-8')
u'Here are some \u201ccurly quotes\u201d.'
You can tell the result is a unicode string because of the u''.

>>> print uni.encode('utf-8')
Here are some “curly quotes”.
>>> uni = u'Voil\u00e0!'
>>> print uni.encode('utf-8')
Voilà!

Very easy -- unless there's a character in the string that isn't legal in the encoding you're using. Suppose you take that unicode with curly quotes and try to represent it as ISO8859-15:

>>> print uni.encode('iso8859-15')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/iso8859_15.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u201c' in position 14: character maps to <undefined>

Those curly quote characters, legal in Unicode and in the UTF-8 encoding, don't exist in ISO-8859-15. So Python's encode function throws an error.

In practice, it's amazingly common to see email messages and web pages that claim to be ISO 8859-15 or some similar encoding, yet include characters like "curly quotes" that aren't part of that encoding. You'll see it in mail from people who paste from their word processors with "smart quotes" enabled, and you'll also see it on professional sites like BBC News. There are also lots of sites that don't specify an encoding at all, like Linux Planet, so software dealing with those sites has to guess based on the characters in the page. So your Python code had better be able to handle encoding errors.

Python's encode method takes an optional second argument, errors. Options are:

strict
throw a UnicodeError
ignore
remove any character that doesn't fit
replace
replace unknown characters with '?'
xmlcharrefreplace
replace with XML representations for the numeric value of the character, like '&#8220;'
backslashreplace
replace with backslashed numeric references, like '\u0201'

But strings like '?' or '\u0201' might not be what you want to show to your users. Fortunately, you can trap the error and do something more useful:

while uni != None:
    try:
        encoded = uni.encode('iso8859-15', 'strict')
        print encoded

    except UnicodeEncodeError, e :
        # The part of the string
        # e.args[2] is the first index where the encoding failed,
        # and e.args[3] is the end point.

        # Encode the first part of the string, up to the error point.
        initial = uni[0:e.args[2]]       # First part, maps correctly
        print initial.encode('iso8859-15', 'strict')

        # Do something with the part that caused the error
        bad = uni[e.args[2]:e.args[3]]
        do_something_with(bad)

        # loop around and continue encoding the rest of uni
        uni = uni[e.args[3]:]            # The rest of the string

initial, bad and rest are all Unicode strings, and you can do what you like with them -- set up your own translation table, keep a log of characters that don't fit, or anything you like.

Combining Forms

There's one additional complication: combining forms.

The à in "Voilà!", Unicode 00e0, can be represented in UTF-8 as the (hexadecimal) bytes c3 a0. But it can also be represented another way, as 61 cc 80. Hexadecimal 61 is a regular ASCII 'a', and cc 80 in UTF-8 is a "combining grave accent" (unicode 0300). An à is an a followed by a combining grave accent.

Confused yet? Try viewing character info for an accented character in gucharmap (Figure 3).

figure 3
figure 3

Click on the link for U+0300 COMBINING GRAVE ACCENT to see that character (Figure 4).

figure 4
figure 4

Why worry about combining forms? First, if you work with characters from web pages or email, you will see combining accents fairly frequently. But in addition, you can use them to get the ASCII equivalent of a character. Python's unicodedata.normalize can turn regular accented characters into their combining forms; then encode with error=ignore to throw away the accent part and keep just the letter.

>>> import unicodedata
>>> uni = u'Voil\u00e0!'
>>> normalized = unicodedata.normalize('NFKD', uni)
>>> print normalized.encode('ascii', 'ignore')
Voila!
Note the lack of accent over the a.

Options for unicodedata.normalize are NFC, NFKC, NFD, and NFKD. Refer to the unicodedata documentation for details on how they differ.

Akkana Peck is a freelance programmer and writer and author of the book "Beginning GIMP: From Novice to Professional". She also uses primitive tools like mutt and PalmOS PDAs and so spends way too much time thinking about converting fancy character sets to simpler ones. You can see her current python charset converter here: ununicode

Copyright Jupitermedia Corp. All Rights Reserved.