April 18, 2014
 
 
RSSRSS feed

Mastering Characters Sets in Linux (Weird Characters, part 2)

gucharmap and recode

  • November 25, 2009
  • By Akkana Peck

In the last article I talked about Unicode, character sets and encoding -- how accented and special characters are transferred in email and web pages, and why you sometimes see funny characters when the process goes wrong.

But can you fix it when it does go wrong? And if you're a programmer, how should you be handling all these encodings?

gucharmap

First, when you're testing anything involving character encoding, gucharmap is invaluable (Figure 1).

figure 1
figure 1

Every Unicode character is in some category, shown in the list on the left -- in addition to Basic Latin, Latin-1 Supplement (accented characters), Greek, Cyrillic, Katakana etc. there are categories for Braille, Cuneiform, punctuation, mathematics, music and so forth.

The Character Details tab tells you the Unicode, UTF-8, UTF-16 and XML/HTML codes for the character.

If you have a character from a web page or email and don't know what it is, just paste it into gucharmap's Search->Find field (Figure 2).

figure 2
figure 2

recode

You can fix some encoding problems using a simple command-line tool: recode, probably available on your Linux distribution.

To experiment with recode, you'll need some test data. You can make a file containing UTF-8 by pasting something from Firefox, which usually pastes UTF-8 even if you copy from a page with another encoding, like this one.

$ cat >voila-utf8
"Voil´┐Ż!"       <-- paste this string
^D             <-- Type Ctrl-D on a new line

$ cat >curly-utf8
“Curly quotes” <-- paste this string
^D             <-- Type Ctrl-D on a new line

Once you have test data, run recode like this:

$ recode utf8..iso8859-15 voila-8859
$ 

Now test-8859 should contain a Latin-1 version of the original UTF-8 string. Of course, you can go the other way too:

$ recode iso8859-15..utf8 voila2-utf8
$ diff voila-utf8 voila2-utf8
$               <-- no differences

You can examine the files and compare them with a binary dump program like od, then use gucharmap to verify which characters are which. I find od output a bit hard to read, so I wrote a Python equivalent, bdump.

recode can even map curly quotes ("smartquotes") to regular quotes, if its output format is one that doesn't include curly quotes, like ASCII or ISO8859-15:

$ recode utf8..iso8859-15 

recode translates those curly quotes to straight ASCII quotes. Very useful! Of course, that means that the translation has lost information -- you can't go back to the original UTF-8. To prevent that, use recode's -s (strict) option.

Sitemap | Contact Us