March 21, 2019

Mastering Characters Sets in Linux (Weird Characters, part 2) - page 2

gucharmap and recode

  • November 25, 2009
  • By Akkana Peck

Recode is useful, but it's not very flexible if you want something you can use in a program. How would you do the same thing in Python?

Python has a simple 8-bit String type: that's what you get if you say
>>> str = "Hello, world!"

Python can also represent Unicode strings: put a u in front of the quotes.

>>> uni = u"Hello, world!"
Unicode strings may be 16 or 32 bits wide.

Both types let you specify characters by hexadecimal codes:

>>> str = "Here are some \xe2\x80\x9ccurly quotes in UTF-8\xe2\x80\x9d."
>>> uni = "Here are some \u201ccurly quotes in Unicode\u201d."

Use encode and decode to convert between strings and Unicode:

>>> str = "Here are some \xe2\x80\x9ccurly quotes\xe2\x80\x9d."
>>> str.decode('utf-8')
u'Here are some \u201ccurly quotes\u201d.'
You can tell the result is a unicode string because of the u''.

>>> print uni.encode('utf-8')
Here are some “curly quotes”.
>>> uni = u'Voil\u00e0!'
>>> print uni.encode('utf-8')

Very easy -- unless there's a character in the string that isn't legal in the encoding you're using. Suppose you take that unicode with curly quotes and try to represent it as ISO8859-15:

>>> print uni.encode('iso8859-15')
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python2.6/encodings/iso8859_15.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u201c' in position 14: character maps to 

Those curly quote characters, legal in Unicode and in the UTF-8 encoding, don't exist in ISO-8859-15. So Python's encode function throws an error.

In practice, it's amazingly common to see email messages and web pages that claim to be ISO 8859-15 or some similar encoding, yet include characters like "curly quotes" that aren't part of that encoding. You'll see it in mail from people who paste from their word processors with "smart quotes" enabled, and you'll also see it on professional sites like BBC News. There are also lots of sites that don't specify an encoding at all, like Linux Planet, so software dealing with those sites has to guess based on the characters in the page. So your Python code had better be able to handle encoding errors.

Python's encode method takes an optional second argument, errors. Options are:

throw a UnicodeError
remove any character that doesn't fit
replace unknown characters with '?'
replace with XML representations for the numeric value of the character, like '“'
replace with backslashed numeric references, like '\u0201'

But strings like '?' or '\u0201' might not be what you want to show to your users. Fortunately, you can trap the error and do something more useful:

while uni != None:
        encoded = uni.encode('iso8859-15', 'strict')
        print encoded

    except UnicodeEncodeError, e :
        # The part of the string
        # e.args[2] is the first index where the encoding failed,
        # and e.args[3] is the end point.

        # Encode the first part of the string, up to the error point.
        initial = uni[0:e.args[2]]       # First part, maps correctly
        print initial.encode('iso8859-15', 'strict')

        # Do something with the part that caused the error
        bad = uni[e.args[2]:e.args[3]]

        # loop around and continue encoding the rest of uni
        uni = uni[e.args[3]:]            # The rest of the string

initial, bad and rest are all Unicode strings, and you can do what you like with them -- set up your own translation table, keep a log of characters that don't fit, or anything you like.

Most Popular LinuxPlanet Stories