Back to article

Editor's Note: Nobody Expects the ISO-8859-1 Inquisition!

"Mr Wentwoth just told me to come in here and say that there was trouble at the mill..."

September 4, 2001

There are times when you're not only wrong... you're wrong in such a way that the darkness of your ignorance spreads out and touches others. A few weeks ago, I was about this wrong, and wrong over something Linux advocates (me included) have been so self-righteous and unpleasant about that it brings heat to the back of my neck as I type.

One of the unhappy parts of working over on LinuxToday is actually preparing stories for posting on that site. I have to traverse a lot of different corners of the web, created with a variety of tools, to keep LinuxToday moving along. Some fairly prominent sites depend on Microsoft tools for production, and have a hit-and-miss attitude toward the ever-present menace of "smart quotes," a.k.a. character entities #146 through #149, a.k.a. "curly quotes" a.k.a. "those quotes that look like question marks (sometimes) under Linux (depending on the application.)"

Perl super-hacker Tom Christiansen calls them something more incisive:

"intentional errors designed to destroy the web by subverting open standards and thus secure Microsoft's hegemony."

In particular, Mr. Christiansen is referring to how the specification for the ISO-8859-1 character set works. C

Character codes above 128 aren't supposed to display anything in particular because they're supposed to be control characters. Microsoft created a superset of ISO-8859-1, according to a bit of searching I did that unearthed a Usenet post from Jamie Zawinski of Netscape fame, called "ISO-8859-1-Windows-3.1-Latin-1," which helps itself to some of the characters ISO-8859-1 doesn't, including the smart quotes, by assigning them to the range of codes above 128, where non-displaying control characters are supposed to live. In other words, as noted in Mr. Christiansen's commentary (and just about every other person to ever comment on smart quotes), Microsoft "embraced and extended" a standard. To add to the soup of character set standards, by the way, a freshly saved HTML document produced in Microsoft Word reveals its declared character set to be windows-1252 (which appears to be a variant on the Western European (1252-c) character set.

So, ISO-8859-1-Windows-3.1-Latin-1 is ISO-8859-1-Latin-1, with the exception of some characters Microsoft chose to store up in the attic. The result of this bit of divergence is a deviation from prescribed practice: Web clients are left to their own devices when it comes to rendering these character codes, and all of Microsoft's know to render them as smart quotes. Other platforms and clients may or may not. Up until very recently, Linux clients were reliable in their failure to render smart quotes.

Smart quotes are considered bad manners among many people, even people who wouldn't touch anything besides Microsoft software. Web design pages that address the existence of the smart quote usually include tips on how to turn off smart quotes in assorted HTML-producing or -exporting software so as to avoid an appearance of thoughtlessness or (worse, on the Web) blithe ignorance of the (strict standards compliance|impoverishment) of non-Microsoft clients.

More than bad manners, an attempt to take over the Web, or badges of a content author's ignorance, smart quotes are, or were, the "Microsoft detectors" of the Linux world: liberal sprinklings of question marks throughout a document are, or were, a dead giveaway that a Microsoft product was present somewhere in the production pipeline. The reaction of many who start noticing them around the web once they make the move to Linux is vaguely akin to "Rowdy" Roddy Piper's in John Carpenter's They Live as he dons his special alien-spotting glasses and realizes the world is in the grip of a vast conspiracy of skeletoid monsters. The effect is amplified when they visit a few sites that trumpet independence from Microsoft products but show the tell-tales of the conspiracy to destroy the web right on their own pages.

On a Linux site, the presence of smart quotes are often cause for severe reactions. A site like LinuxToday, which is 90% cut-and-pasted content from all over the Web, has to be especially careful because sites vary wildly even internally when it comes to their use of smart quotes, and it's easy to miss a single tell-tale question mark in the midst of three or four paragraphs of text, especially when you spend all day reading sites that require you subconsciously substitute the appropriate characters.

Sadly, though, it's time to note that the days of curly quotes and their mis-rendering on a Linux browser as an indicator of OS purity are over, depending on the tool that produced them and depending on your browser.

As an unhappy experience a few weeks ago indicated, a few open source tools (notably AbiWord) now produce unicode character entities above ISO-8859-1-Latin1's 128 (using characters above 255, in compliance with "internationalized HTML" as specified in the HTML 4.0 standard) to provide smart quotes in documents exported as HTML.

An informal trial using character codes in keeping with Microsoft's extended ISO-8859-1 character set also indicates that some open source tools (Mozilla and Konqueror 2.2) and proprietary tools running under Linux (Opera and Netscape 4.7) have thrown up their hands and decided to exercise their option (as provided under the HTML standard) to render character codes 147 and 148 the way Microsoft intends: as smart quotes. Amaya doesn't show anything at all.

I've even provided a screenshot of an AbiWord document (which is compliant with the "internationalized HTML" standard) opened in Opera, Konqueror, Netscape 4.7, Mozilla's latest nightly as of the 30th of August, and the W3C's own testbed browser - Amaya (v 5.0). The example, by the way, uses the "right single quote" character, Unicode #x2019. As the screenshot shows, Konqueror had problems, while all the others did not. Konqueror's problems went away, I should note, once I told it to consider the document's native character set as Unicode with (UTF-8) and it promptly picked an illegible font with which to render the page.

The long and short of it? Not all clients you can run under Linux are "pure" anymore when it comes to dealing with some characters. Some, in fact, have decided just to honor the Microsoft version of the convention even if it isn't strictly standards-compliant. And some don't seem to honor the internationalized HTML standard, which does allow for characters over and above the 128 allowed by ISO-8859-1-Latin1. Some will also forgive a document mis-declaring the character set it uses in its headers as ISO-8859-1 and go ahead and render characters that properly belong to the Unicode (UTF-8) character set.

"Cardinal, Read the Charges."

If it seems like I put too much time into rooting out the issues behind this whole quote mark mess, it's because character mis-renderings provide for a special kind of tyranny familiar to anyone who ever worked a Linux site, made the mistake of leaving a thrown character in something he cut-and-pasted into a buffer on some back-end, and dealt with an inbox full of flame: what I've come to think of as the Tool Taliban, a.k.a. the Platform Purity Squad, a.k.a. "the Same People Who Complain About Cookies Being Intrusive But Think Nothing of Demanding That You Explain What You're Running on Your Computer If They Think They Might Not Like It."

These people probably deserve to be ignored... but they're so excitable one welcomes the opportunity to poke at them through the bars of their cages with a sharp stick, and they need to understand that a.) the software I use to get my job done is none of their business and b.) they may need to check the "purity" (with regards to standards compliance) of their own tools before embarking on jihad against someone who might actually be using a 100% standards-compliant, 100% Linux-based production pipeline.

I also spent some time looking into it because a few weeks ago a bit of AbiWord-produced HTML caused the issue to rear its head here on LinuxPlanet when a column written in that word processor and exported to HTML produced the Unicode character entities for smart quotes we've been discussing, which some browsers (Mozilla, Netscape, Opera) handled fine and others (Konqueror) did not. The ensuing consternation it caused in LinuxToday's talkbacks caused me to ignorantly malign AbiWord:

"One wouldn't suspect that a prominent open source project would cause these sorts of problems, either: I guess it's clear now that Microsoft's approach to this issue is gaining converts."

It doesn't appear that AbiWord actually caused any "problems" at all, with the exception of producing HTML source that used legal (as of the Internationalized HTML standard to be found in HTML 4) character entities that some Linux clients don't know how to read. In fairness, the AbiWord output declared ISO-8859-1 as its character set, which might have confused Konqueror, but even changing that to UTF-8 in the document headers and reloading the document did no good.

A reader also suggested use of 'demoroniser' to get the bugs out of the document in question, which produced an interesting result: the venerable sanitizer of Microsoft-mangled HTML is silent on the question of the character entities involved, because they simply aren't "moronic." They're legal. Legal enough, anyhow, that complaining about them is something best left to the real purists, who think it was a terrible mistake to ever cave in to the layout-oriented people as much as HTML 4 did in the first place.

What's It Mean?

In brief, I was right (some applications to be found in Linux, including an open source app of some prominence) have capitulated on the "smart quote" issue. Others haven't, but have their own snags when it comes to producing some Unicode character entities, anyhow. But I was also wrong: when a Linux-using reader complains that they see question marks where there were supposed to be certain quote marks, it isn't an indication that yet another standard has fallen to the creeping insinuations of Redmond.

The interesting question in all of this is how to confront the issue of apparent compliance with a non-standard on the part of open source developers and their projects: something Mozilla has appeared to go ahead with. That's a question I'll leave to the reader.