HTML BasicsThis guide is intended to be a quick but rigorous introduction to the HyperText Markup Language (HTML) Version 2.0, the markup scheme used to prepare a wide range of documents for dissemination via the World Wide Web. Although HTML has been around for some time now, it is only recently that it has been formalized and its syntax and semantics set down in a consistent--and machine-verifiable--format.
More specifically, this guide describes a subset of HTML 2.0 intended to be used in the preparation of documents for the WWW pages of Stanford University Libraries & Academic Information Resources and ignores a number of components that are not widely used, or have little application for SUL/AIR.
HTML is the Lingua Franca of the World Wide Web, a formal set of simple markup conventions that enable an author to describe the structure of the content of documents in a device-independent way. In a sense, HTML is a medium by which two strangers communicate without knowing the details of each others communication habits. The author uses HTML to describe (encode) the components of a document ('this is a paragraph, this a section heading, etc.'), and the reader (the reader's software) interprets that encoding and renders it in a visual form that the reader (the reader's hardware) can use. By this means, an author can disseminate documents without any knowledge about the computing environment of the reader. In fact, if the document is created properly (is a valid HTML 2.0 document), it can be 'displayed' on alternative output devices such as braille recorders and speech synthesizers, making the material accessible to people for whom conventional computer displays are inadequate.
HTML is an application of SGML, the Standard Generalized Markup Language. SGML is a metalanguage used to define the syntax of a (potentially very, very large) set of markup languages, of which HTML is but one. A discussion of SGML is beyond the scope of this guide, readers are strongly urged to read the superb A Gentle Introduction to SGML. For further information on SGML, see Robin Cover's SGML Web Page
Although there are a variety of tools available for the creation of HTML documents, nothing more than a text editor or word processor is required. HTML pages are simply ASCII texts in which a rather small set of identifying markers called tags are used to identify the notable components, called elements, of the page. The syntax rules are very simple and can easily be learned in a few hours.[1]
There are only 46 elements in HTML 2.0 and of these more than a dozen are intended for special purposes may be ignored by most authors. In practice, the vast majority of documents can be marked up using only a very few types of tags:
Add to this set a few a highlighting tags if necessary, and the
inline image tag (<img>) and we have a very simple,
but fairly complete markup vocabulary.
Although the syntax rules are simple, they are rules.
Because the language is so simple, and because browsing software has been very forgiving of syntax errors, it is easy come to believe that tag can be used anywhere in the document. This has led to the current state of the Web in which the vast majority of 'legacy documents' are not valid HTML. The HTML 2.0 specification has a happy solution to this legacy: document authors are expected to produce documents that conform to the specification; browser software is expected to be forgiving of syntax errors. The principal 'be conservative in what you offer, and generous in what you accept', is intended to help the Web move gracefully into the next stage of its evolution. As information providers, we can ensure that the information we provide today has some hope of surviving into the next stage by preparing it in conformance with the developing standard.
Fortunately, because HTML 2.0, is defined in SGML, there are tools
available to validate HTML documents, that is, to test the
document against the formal syntax of the markup language and identify
any syntax errors. While these validation tools can only test the
correctness of the syntax of the document--they can't tell you
if you've called something a <paragraph> when it is
really a <blockquote>--they can at least tell you
whether you've put an allowable tag in an allowable place, and this goes
a long way to making the document more useful in the long run.
Before the formalization of HTML 2.0, the common method of checking a document was to look at it on one or two browsers, and this is still a good practice. While it doesn't tell you whether the document conforms to the standard syntax, it does help identify some other kinds of errors. If your document is valid HTML (as verified by a 'validating parser' but is rendered by browsers in a dramatically odd fashion, you may have violated the 'meaning' of the HTML tagging rules, without violating the syntax rules themselves.
Tony Sanders, author of plexus (one of the more popular pieces of Web server software answers eloquently the question Why validate?
From the library's perspective the principal values to validation are:
HTML is an evolving standard. Even as work on version 2.0 is being brought to completion, work is proceeding on version 2.1 and version 3.0 (which is a dramatically richer language). Documents that are in strict conformance to the 2.0 specification will be easily converted to the newer versions, as conversion scripts will be developed to smooth the migration. Perhaps more importantly, as the Web evolves past HTML, toward document formats not yet developed, the problem of legacy documents becomes critical. Documents prepared in conformance with the specification will be significantly easier to transform into newer forms of encoding, making the old document available to new uses.
If we look at the issue another way, we can think of proper markup as a means to avoid dealing with a document over, and over again. If the document is valid (i.e. syntactically and semantically correct), future modifications (adaptations to a new information environment) can be done by computer programs; by paying a little extra attention early in the life of a document, we can 'touch it once', avoiding repeated rewriting.
HTML 2.0 incorporates features developed by the International Committee for Accessible Document Design which make it possible to present valid HTML documents on devices other than computer screens. ICADD describes itself thus:
The International Committee for Accessible Document Design (ICADD) is dedicated to making printed materials accessible to persons with print disabilities. ICADD is an international nonpartisan consortium of representatives from industry, education, and the disabled community.
We believe that advancing computer based publishing, through adaptive computer technology for persons with disabilities, offers the potential to make printed information accessible simultaneously and at no greater cost than the able bodied community enjoys.
A demo of how this can work is online at UCLA. (To see it in action, try copying the URL for the page you are reading and paste it into the form at HTML to ICADD Transformation Service. An application of ICADD that may interest librarians are Journal Citations in ICADD
There are a number of validation tools available. SUL/AIR offers a validation-by-mail service. If you email your HTML document to html@www-sul.stanford.edu, you will receive a report back by email, telling you what errors (if any) your document contains. NB The document must be sent as ordinary mail and must be pure ascii text (no Word Processing documents, no documents using 'extended character sets', etc.). Non-ascii text won't survive the trip through the mail. The SUL/AIR Validation-By-Mail Service will run your document through a rigorous SGML validating parser which will check the syntax of your document, as well as two other programs that will report both on syntax and stylistic errors, giving line number references to the place in your document where the error was found. This service is still under construction and the messages returned are usually rather cryptic. However, by looking at the complete report and comparing it to the lines identified as errors (and perhaps the lines preceding the error), you should be able to figure out what went wrong. As the program is refined, it is hoped that the reports can be made more human-friendly.
If you just want to test a fragment of a document to see if a particular construction or idiom is legal in HTML, you can use the HTML Validation Service. Be sure to select the radio button labelled 'Strict'. This service can also be used to validate a document that is already mounted on a Web server, by providing URL that points to the document. (In the case of pages on the SUL/AIR server maintained by the Systems Office, all documents mounted are expected to be valid HTML, so this service shouldn't ever report an error on our material, but feel free to put this to the test (and please report any errors found to webmaster@www-sul.stanford.edu>
Oh, yeah, one more thing. If you validate your documents you get to
include this cool logo in your document
There are several HTML books on the market, but all were written before the current version of HTML was finished and all have seriously errors as a result of this. Because of the slow pace of book publishing (compared with the high rate of change on the Internet), these books were written at a time when writing for the Web was a simpler matter; the rules were less formal and the only real question was 'will this look OK in Mosaic?'. Since then, HTML has matured into what will soon be an Internet standard, with a rigorous formal (SGML) definition, and a large number of browsers have come to stage (with many more waiting in the wings). The only way to ensure that your document is usable by all of them is to adhere to the standard, and the current spate of books (and online tutorials) weren't designed for this environment.
There are a number of tutorials available on the net but every one I've seen suffers from the same problem as the books--they are obsolete (as this one will be, almost as soon as it is completed). There were one or two excellent tutorials that, had they been updated, would have made this one unnecessary, but they've mysteriously disappeared from the Net (probably because the authors have book contracts).
Strictly speaking, you have no need for this tutorial. Everything you could ever need to know about HTML is available on the Net. Specifically, the latest version of the standard itself is always available. For those wishing to delve further into the formal descriptions of HTML, a series of resources have been assembled