Mastering Characters Sets in Linux (Weird Characters, part 2) - page 2
gucharmap and recode
Recode is useful, but it's not very flexible if you want something you can use in a program. How would you do the same thing in Python?
>>> str = "Hello, world!"
Python can also represent Unicode strings: put a u in front of the quotes.
>>> uni = u"Hello, world!"Unicode strings may be 16 or 32 bits wide.
Both types let you specify characters by hexadecimal codes:
>>> str = "Here are some \xe2\x80\x9ccurly quotes in UTF-8\xe2\x80\x9d." >>> uni = "Here are some \u201ccurly quotes in Unicode\u201d."
Use encode and decode to convert between strings and Unicode:
>>> str = "Here are some \xe2\x80\x9ccurly quotes\xe2\x80\x9d." >>> str.decode('utf-8') u'Here are some \u201ccurly quotes\u201d.'You can tell the result is a unicode string because of the u''.
>>> print uni.encode('utf-8') Here are some “curly quotes”. >>> uni = u'Voil\u00e0!' >>> print uni.encode('utf-8') Voil´┐Ż!
Very easy -- unless there's a character in the string that isn't legal in the encoding you're using. Suppose you take that unicode with curly quotes and try to represent it as ISO8859-15:
>>> print uni.encode('iso8859-15') Traceback (most recent call last): File "
", line 1, in File "/usr/lib/python2.6/encodings/iso8859_15.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character u'\u201c' in position 14: character maps to
Those curly quote characters, legal in Unicode and in the UTF-8 encoding, don't exist in ISO-8859-15. So Python's encode function throws an error.
In practice, it's amazingly common to see email messages and web pages that claim to be ISO 8859-15 or some similar encoding, yet include characters like "curly quotes" that aren't part of that encoding. You'll see it in mail from people who paste from their word processors with "smart quotes" enabled, and you'll also see it on professional sites like BBC News. There are also lots of sites that don't specify an encoding at all, like Linux Planet, so software dealing with those sites has to guess based on the characters in the page. So your Python code had better be able to handle encoding errors.
Python's encode method takes an optional second argument, errors. Options are:
- throw a UnicodeError
- remove any character that doesn't fit
- replace unknown characters with '?'
- replace with XML representations for the numeric value of the character, like '“'
- replace with backslashed numeric references, like '\u0201'
But strings like '?' or '\u0201' might not be what you want to show to your users. Fortunately, you can trap the error and do something more useful:
while uni != None: try: encoded = uni.encode('iso8859-15', 'strict') print encoded except UnicodeEncodeError, e : # The part of the string # e.args is the first index where the encoding failed, # and e.args is the end point. # Encode the first part of the string, up to the error point. initial = uni[0:e.args] # First part, maps correctly print initial.encode('iso8859-15', 'strict') # Do something with the part that caused the error bad = uni[e.args:e.args] do_something_with(bad) # loop around and continue encoding the rest of uni uni = uni[e.args:] # The rest of the string
initial, bad and rest are all Unicode strings, and you can do what you like with them -- set up your own translation table, keep a log of characters that don't fit, or anything you like.