Mastering Characters Sets in Linux (Weird Characters, part 2)
gucharmap and recode
In the last article I talked about Unicode, character sets and encoding -- how accented and special characters are transferred in email and web pages, and why you sometimes see funny characters when the process goes wrong.
But can you fix it when it does go wrong? And if you're a programmer, how should you be handling all these encodings?
First, when you're testing anything involving character encoding, gucharmap is invaluable (Figure 1).
Every Unicode character is in some category, shown in the list on the left -- in addition to Basic Latin, Latin-1 Supplement (accented characters), Greek, Cyrillic, Katakana etc. there are categories for Braille, Cuneiform, punctuation, mathematics, music and so forth.
The Character Details tab tells you the Unicode, UTF-8, UTF-16 and XML/HTML codes for the character.
If you have a character from a web page or email and don't know what it is, just paste it into gucharmap's Search->Find field (Figure 2).
You can fix some encoding problems using a simple command-line tool: recode, probably available on your Linux distribution.
To experiment with recode, you'll need some test data. You can make a file containing UTF-8 by pasting something from Firefox, which usually pastes UTF-8 even if you copy from a page with another encoding, like this one.
$ cat >voila-utf8 "Voil´┐Ż!" <-- paste this string ^D <-- Type Ctrl-D on a new line $ cat >curly-utf8 “Curly quotes” <-- paste this string ^D <-- Type Ctrl-D on a new line
Once you have test data, run recode like this:
$ recode utf8..iso8859-15
Now test-8859 should contain a Latin-1 version of the original UTF-8 string. Of course, you can go the other way too:
$ recode iso8859-15..utf8
voila2-utf8 $ diff voila-utf8 voila2-utf8 $ <-- no differences
You can examine the files and compare them with a binary dump program like od, then use gucharmap to verify which characters are which. I find od output a bit hard to read, so I wrote a Python equivalent, bdump.
recode can even map curly quotes ("smartquotes") to regular quotes, if its output format is one that doesn't include curly quotes, like ASCII or ISO8859-15:
$ recode utf8..iso8859-15
recode translates those curly quotes to straight ASCII quotes. Very useful! Of course, that means that the translation has lost information -- you can't go back to the original UTF-8. To prevent that, use recode's -s (strict) option.Solid State Drives: The Future of Data Storage?
Solid state disks (SSDs) made a splash in consumer technology, and now the technology has its eyes on the enterprise storage market. Download this eBook to see what SSDs can do for your infrastructure and review the pros and cons of this potentially game-changing storage technology.
- 1Linux Top 3: GNOME 3.12 and New Betas for Ubuntu 14.04 and OpenMandriva Lx 2014.0
- 2Linux Top 3: Linus Lashes out, Linux 3.14 Gets PIE and Ubuntu One is Done.
- 3Linux Top 3: Ubuntu 14.04, Debian Gives Squeeze More Life and Red Hat Goes Atomic
- 4Linux Top 3: CoreOS, Oracle Enterprise Linux 7 and Ubuntu 14.10
- 5Linux Top 3: Debian Gives Up on Upstart, Ubuntu and Linux Kernel Updates