HTML Basics
Stanford University Libraries & Academic Information Resources

Preface

 [Previous Page]  [Tutorial Top]  [Tech Guide]  [Next Page]

This guide is intended to be a quick but rigorous introduction to the HyperText Markup Language (HTML) Version 2.0, the markup scheme used to prepare a wide range of documents for dissemination via the World Wide Web. Although HTML has been around for some time now, it is only recently that it has been formalized and its syntax and semantics set down in a consistent--and machine-verifiable--format.

More specifically, this guide describes a subset of HTML 2.0 intended to be used in the preparation of documents for the WWW pages of Stanford University Libraries & Academic Information Resources and ignores a number of components that are not widely used, or have little application for SUL/AIR.

What is HTML

HTML is the Lingua Franca of the World Wide Web, a formal set of simple markup conventions that enable an author to describe the structure of the content of documents in a device-independent way. In a sense, HTML is a medium by which two strangers communicate without knowing the details of each others communication habits. The author uses HTML to describe (encode) the components of a document ('this is a paragraph, this a section heading, etc.'), and the reader (the reader's software) interprets that encoding and renders it in a visual form that the reader (the reader's hardware) can use. By this means, an author can disseminate documents without any knowledge about the computing environment of the reader. In fact, if the document is created properly (is a valid HTML 2.0 document), it can be 'displayed' on alternative output devices such as braille recorders and speech synthesizers, making the material accessible to people for whom conventional computer displays are inadequate.

HTML is an application of SGML, the Standard Generalized Markup Language. SGML is a metalanguage used to define the syntax of a (potentially very, very large) set of markup languages, of which HTML is but one. A discussion of SGML is beyond the scope of this guide, readers are strongly urged to read the superb A Gentle Introduction to SGML. For further information on SGML, see Robin Cover's SGML Web Page

HTML is easy

Although there are a variety of tools available for the creation of HTML documents, nothing more than a text editor or word processor is required. HTML pages are simply ASCII texts in which a rather small set of identifying markers called tags are used to identify the notable components, called elements, of the page. The syntax rules are very simple and can easily be learned in a few hours.[1]

There are only 46 elements in HTML 2.0 and of these more than a dozen are intended for special purposes may be ignored by most authors. In practice, the vast majority of documents can be marked up using only a very few types of tags:

Add to this set a few a highlighting tags if necessary, and the inline image tag (<img>) and we have a very simple, but fairly complete markup vocabulary.

Nothing's that easy

Although the syntax rules are simple, they are rules.

Because the language is so simple, and because browsing software has been very forgiving of syntax errors, it is easy come to believe that tag can be used anywhere in the document. This has led to the current state of the Web in which the vast majority of 'legacy documents' are not valid HTML. The HTML 2.0 specification has a happy solution to this legacy: document authors are expected to produce documents that conform to the specification; browser software is expected to be forgiving of syntax errors. The principal 'be conservative in what you offer, and generous in what you accept', is intended to help the Web move gracefully into the next stage of its evolution. As information providers, we can ensure that the information we provide today has some hope of surviving into the next stage by preparing it in conformance with the developing standard.

Validation

Fortunately, because HTML 2.0, is defined in SGML, there are tools available to validate HTML documents, that is, to test the document against the formal syntax of the markup language and identify any syntax errors. While these validation tools can only test the correctness of the syntax of the document--they can't tell you if you've called something a <paragraph> when it is really a <blockquote>--they can at least tell you whether you've put an allowable tag in an allowable place, and this goes a long way to making the document more useful in the long run.

Before the formalization of HTML 2.0, the common method of checking a document was to look at it on one or two browsers, and this is still a good practice. While it doesn't tell you whether the document conforms to the standard syntax, it does help identify some other kinds of errors. If your document is valid HTML (as verified by a 'validating parser' but is rendered by browsers in a dramatically odd fashion, you may have violated the 'meaning' of the HTML tagging rules, without violating the syntax rules themselves.

Why validate

Tony Sanders, author of plexus (one of the more popular pieces of Web server software answers eloquently the question Why validate?

From the library's perspective the principal values to validation are:

How to validate

There are a number of validation tools available. SUL/AIR offers a validation-by-mail service. If you email your HTML document to html@www-sul.stanford.edu, you will receive a report back by email, telling you what errors (if any) your document contains. NB The document must be sent as ordinary mail and must be pure ascii text (no Word Processing documents, no documents using 'extended character sets', etc.). Non-ascii text won't survive the trip through the mail. The SUL/AIR Validation-By-Mail Service will run your document through a rigorous SGML validating parser which will check the syntax of your document, as well as two other programs that will report both on syntax and stylistic errors, giving line number references to the place in your document where the error was found. This service is still under construction and the messages returned are usually rather cryptic. However, by looking at the complete report and comparing it to the lines identified as errors (and perhaps the lines preceding the error), you should be able to figure out what went wrong. As the program is refined, it is hoped that the reports can be made more human-friendly.

If you just want to test a fragment of a document to see if a particular construction or idiom is legal in HTML, you can use the HTML Validation Service. Be sure to select the radio button labelled 'Strict'. This service can also be used to validate a document that is already mounted on a Web server, by providing URL that points to the document. (In the case of pages on the SUL/AIR server maintained by the Systems Office, all documents mounted are expected to be valid HTML, so this service shouldn't ever report an error on our material, but feel free to put this to the test (and please report any errors found to webmaster@www-sul.stanford.edu>

 [Valid HTML Logo] Oh, yeah, one more thing. If you validate your documents you get to include this cool logo in your document

Why another tutorial

There are several HTML books on the market, but all were written before the current version of HTML was finished and all have seriously errors as a result of this. Because of the slow pace of book publishing (compared with the high rate of change on the Internet), these books were written at a time when writing for the Web was a simpler matter; the rules were less formal and the only real question was 'will this look OK in Mosaic?'. Since then, HTML has matured into what will soon be an Internet standard, with a rigorous formal (SGML) definition, and a large number of browsers have come to stage (with many more waiting in the wings). The only way to ensure that your document is usable by all of them is to adhere to the standard, and the current spate of books (and online tutorials) weren't designed for this environment.

There are a number of tutorials available on the net but every one I've seen suffers from the same problem as the books--they are obsolete (as this one will be, almost as soon as it is completed). There were one or two excellent tutorials that, had they been updated, would have made this one unnecessary, but they've mysteriously disappeared from the Net (probably because the authors have book contracts).

Strictly speaking, you have no need for this tutorial. Everything you could ever need to know about HTML is available on the Net. Specifically, the latest version of the standard itself is always available. For those wishing to delve further into the formal descriptions of HTML, a series of resources have been assembled

 [Previous Page]  [Tutorial Top]  [Tech Guide]  [Next Page]


Walter Henry
Stanford University Libraries and Academic Information Resources