HTML Basics
Stanford University Libraries & Academic Information Resources

About character sets

Most of us are familiar with the the USASCII character set that includes all of the characters, numbers, and many of the most commonly used punction marks. It also includes a number of special control characters (most of the first 32 characters), which are 'nonprinting'. A 7-bit character set, there are 128 characters in USASCII. That is, you can represent 128 characters with 7 bits, but with 8 bits you can represent 256 characters. Thus, in USASCII, there are 128 potential characters that go unused, so individual computer system designers have tended to create their own special 'extended' character sets that use the 'upper' 128 characters for special purposes (this is sometimes called the "right hand side" of the character set. DOS has its own "right hand side", as does the Mac. Of course, because the world is basically a cruddy place and people would generally rather go to war than talk with eachother, each of these right hand sides is different from each of the other, (and to make matters more complicated, often very from font to font even on the same type of computer) so the character (i.e. byte) that represents, say a £ on a Mac (in some particular font) probably represents something else on a DOS machine (even in the same font).

Now the International Standards Organization has defined a number of standard character sets of which Latin 1 (ISO 8859-1) is just one. It entails all of USASCII (its left hand side", and its "right hand side" contains a fairly useful set of characters modified by diacritics (though not the diacritic characters themselves) as well as some common symbols such as ®, ±, §, etc. Also, as in the "left hand side", the first 32 chars are reserved as "control characters". Both Windows and Macintosh software should be able to deal with ISO Latin 1 characters, but they do not necessarily use the same byte-value to character mappings. Also each uses those 32 control characters differently. The result is that we can't simply take a document from these systems and display it directly on the Web, which is a 'pure' Latin 1 environment, since those characters will frequently be misrepresented.

Note that the control characters (the first 32 characters from the "righthand side" of ISO Latin 1 are not permitted in HTML documents.