HTML Basics
Stanford University Libraries & Academic Information Resources

Converting plain text to HTML

 [Previous Page]  [Tutorial Top]  [Tech Guide]  [Next Page]

In cases where you have a non-HTML document to be converted to HTML, there are a few procedures that make the job fairly simple. The tedium of markup can be reduced by making use of the search and replace functions of a text editor or word processor. If the editor can search for regular expressions, or (is in the case of Microsoft Word) can search for 'special' characters such as newlines, whitespace, soft and hard hyphens, etc., these can be exploited to.

Steps in the conversion process

The following procedures should be carried out in order:

  1. Remove soft (end-of-line) and hyphens. Since HTML does not generally respect white space or newlines, word wrap will cause hyphenated documents to be rendered as broken text, with hyphens and extra spaces in the middle of lines. Unfortunately, with ascii text it is impossible to distinguish 'soft' hyphens (those inserted to break a word at a line break) from 'hard' hyphens (those that are really part of the word), so when doing a search/replace, it is necessary to examine each match before accepting the replacement. With a word processor, search for the pattern <hyphen><newline>, deleting each hyphen and newline, replacing with a space if necessary. Then repeat the process, searching for <hyphen><whitespace><newline>, because ascii texts frequently have extraneous whitespace at the end of lines. If your editor supports regular expressions, the two searches may be combined into a search for <hyphen><whitespace>*<newline>
  2. Convert special characters.
    1. Globally replace '&' with &amp; (this must be done before any other replacements.
    2. Globally replace '<' with &lt;
    3. Globally replace '>' with &gt;
    4. If your text contains any characters besides alphanumerics and punctuation, replace them with entity references. In the case of the foreign accented characters and other characters for which HTML offers 'named' entity references (viz:
      
      Á, á, Â, â,
      Æ, æ, À, à,
      Å, å, Ã, ã,
      Ä, ä,
      Ç, ç,
      É, é, Ê, ê, È, è,
      Ð, ð,
      Ë, ë,
      Í, í, Î, î, Ì, ì, Ï, ï,
      Ñ, ñ,
      Ó, ó, Ô, ô, Ò, ò, Ø, ø, Õ, õ, Ö, ö,
      ß,
      Þ, þ,
      Ú, ú, Û, û, Ù, ù, Ü, ü,
      Ý, ý, ÿ,

      use the character entity references. In the case of the other (non-alpha) characters, such as
      ¢, £, ¤, ¥

 [Previous Page]  [Tutorial Top]  [Tech Guide]  [Next Page]


Walter Henry
Stanford University Libraries and Academic Information Resources