March 21, 2019

Character Sets in Linux or: Why do I See Those Weird Characters? - page 2

In the Beginning was ASCII

  • November 12, 2009
  • By Akkana Peck
So how do those funny characters get into mail messages?

Every message comes with an encoding. Use the "full headers" feature of your mail client (Fig 2) and look for the charset= part of the Content-Type header.

figure 2
figure 2

In this case, the headers claim the message was sent as iso-8859-1; so that's how my mailer displays it. But the body of the message contains a series of three-character sequences like �€™that clearly weren't what the author intended. What happened?

Smart Quotes are Stupid

Most often, it means that someone pasted text from some other program into their mailer before sending it. It's especially bad when someone composes in a word processor, like MS Word (the likely culprit in this case) or OpenOffice, because of a feature called "smart quotes".

My double quotes in the previous sentence are classic ASCII double quotes. But if I'd typed that phrase into OpenOffice, it tries to get clever and assume I mean something else: “smart quotes” Notice how the quotes are now curly, and the start and end quotes curl in opposite directions (Fig 3).

figure 3
figure 3

Looks nicer, right? Except that the quotes are no longer ASCII characters. For instance, the left curly double-quote is Unicode U+201C. In UTF-8, it's three bytes: in hexademical, they're e2 80 9c.

That's no problem as long as the program displaying it on the other end knows it's UTF-8 and decodes the e2 80 9c correctly. But if you paste those three bytes into a mailer that thinks it's ISO-8859-1, the mailer on the other end decodes them wrong -- and displays something like �€œ.

Fixing Mail

The good news is that many mailers have a way to fix a message like this. Look for a menu labeled something like "Character encoding" (Fig 4).

figure 4
figure 4

If you're seeing a lot of two- and three-byte sequences, try UTF-8 and see if that helps. Of course, you can also other settings and see what works best.

What about messages you send out?

First, check your mailer's encoding setting and make sure it's something sensible. Most offer some way to configure it. If you're not sure what encoding you want, I recommend using UTF-8 or ISO-8859-15, which is Latin-1 plus a few extra characters like the Euro character.

Second, if possible, avoid pasting from other apps into your mail program. If you need to paste, consider disabling "smart quotes" in your word processor, and try to make sure that word processor is using the same encoding as your mailer.

Web Pages

There's one more tricky case: pasting from a web page. Web pages, like mail messages, have encodings. If you copy text from a Windows 1252 page and send it as ISO-8859-15, you may end up with a charset error.

But unlike mail messages, web pages have no "Show full headers" setting. Instead, in Firefox, right-click on a page and View page info (Fig 5).

figure 5
figure 5

Sometimes the server doesn't specify an encoding and Firefox has to guess. Occasionally, it may guess wrong. If you see bogus characters on a web page, Firefox's View->Character Encoding menu lets you override the encoding with one of your own.

That's most of what you need to know about character encodings as a user. Next time, I'll talk about ways programmers can filter or translate from one encoding to another.

Akkana Peck is a freelance programmer whose credits include a tour as a Mozilla developer. She's also the author of Beginning GIMP: From Novice to Professional.

Most Popular LinuxPlanet Stories