HTML is the language that powers the Web in many respects, as the lingua franca that Web browsers are expected to be able to render. HTML has had unprecedented levels of success, and the uptake is all the more surprising when you realise that it was only invented in 1990, and few people knew about it before 1993.
In fact, although HTML has changed relatively little since those early days, the history of HTML is rather cloudy. However, with a little detective work on the Web, it is possible to reconstruct most of the events that led to the creation and subsequent deployment and acception of HTML.
Tim Berners-Lee first started to come up with code for his WWW project in 1990. The first mention of him working on code for processing HyperText can be found in the original HyperText.m file that Tim worked on, dated 25th September 90.
From the 27th to the 30th November 1990, Tim and Robert Cailliau attended ECHT '90 - the European HyperText Convention. After ECHT '90, it appears that he had some more ideas about the (probably as yet unnamed) World Wide Web, and in the last few months of 1990, he started to produce more code, and also the first recorded HTML documents.
In fact, the earliest HTML document on the WWW at the moment dates from 13th November - a couple of weeks before the conference - as evidenced a HTTP HEAD request, which returns "Last-Modified: Tue, 13 Nov 1990 15:17:00 GMT". The page is still functional in most modern Web browsers, and even contains a functional HyperLink!
So, what was early HTML actually like? The following is the code used in the oldest HTML document referenced above:-
<title>Hypertext Links</title> <h1>Links and Anchors</h1> A link is the connection between one piece of <a href=WhatIs.html>hypertext</a> and another.
These are the tags and attributes evident from the first five days of Dec 1990:-
h1 ol li a a@href
title h2 p
Incredibly enough, it is still possible to create a decent working HTML document using just these tags.
Many more tags were set out in TimBL's early formatting test case, probably used to test the WorldWideWeb browser application on the NeXT:-
But, why these tags? Was there anything that influenced early HTML? Tim had mentioned that some of the early HTML documents were based on an old SGML language that CERN was already using:-
We have included in HTML some tags from the SGML tagset used at and once supported at CERN [...] The HTML parser will ignore tags which it does not understand, and will ignore attributes which it does not understand of CERN-SGML tags. - http://www.w3.org/Test/test
I did not know that HTML had been derived from such a language until I stumbled across an interesting set of documents from the 19th December 1990, that contained the following unusual tags:-
bl bib bib@id hp1
xmp h3
box fn
i1 i1@ix dl dt dd dl@compact
bibref bibref@refid
Looking carefully at these documents, they are actually extracts from a large SGMLguid document (SGMLguid was the SGML language at CERN that Tim refered to) last modified by TimBL on the same date. In other words, they weren't actually HTML as I first thought, but Tim was in the process of converting them over: i.e. hyperlinking them together!
After further study of these documents and others in the same subdirectory,
it became apparent that most of the early HTML tags were actually taken from
the CERN SGMLGuid language, which itself was a variant of AAP (an early SGML
language). For example, title, hn, p, ol
and so on are all
apparently taken from this language. The only radical change was the addition
of the all important anchor (<a>) link, without which the WWW wouldn't
have taken off.
It took a while to find out about SGMLguid, and I had originally thought that SGML stood for "Standard Generalized Markup Langauge". However, after entering some of the tags from the example document into Google, I found the Waterloo SCRIPT GML User's Guide, dating from October 18 1988, and containing a reference to the language "GMLguide". SGMLguid is most likely a corruption of that, or possibly a pun, since the version used at CERN was truly SGML.
The User's Guide from 1988 mentions the tags ADDRESS, BODY, DL, DT, DD, H[0-6], INDEX, LI, OL, P, Q, TITLE, UL, and XMP. The main missing tag is, of course, A (for Anchor).
In fact, the tags from Script GML are largely taken from GML itself. GML was the predecessor to SGML, and was developed by IBM. There is a GML Starter Set Reference guide on the Web, and from this it appears that the language has been in existence since about 1980. This is as far back as HTML can be traced.
A sidenote of some interest may be Michael Friendly's GMLHTML: A GML to HTML Translator for Waterloo Script/GML.
It is incredible to think that most of HTML was already defined as GML and GMLguide, and that Tim wanted to show how one could HyperLink them together. That's why HTML is so basic: beacuse it's actually just a derived version of Hyperlinked SGMLguid, but leaving out much of the typographic junk, to make it easier to learn and parse - I presume.
As if we need any further convincing of HTML's roots in the GMLguide language, here is an excerpt from the SGMLguid document, with a .sgml extension, last modified (i.e. saved) by TimBL on the 19th December, but with a last revised date of April 1990:-
<BODY> <H1>Introduction This manual describes how to build a distributed system using the Remote Procedure Call system developed in the Online Group of the DD Division of CERN, the European Particle Physics Laboratory. <h2> The system The remote procedure call product consists of two essential parts: an RPC compiler which is used during development of an application, and the RPC run time system, which is part of the run time code. Target systems supported are <ul> <li>VAX/VMS, <li>Unix (Berkley 4.3 or Ultrix or equivalent) <li>stand&hyphen.alone M680x0 (MoniCa) systems (Valet&hyphen.Plus, etc) <li>stand&hyphen.alone M6809 systems <li>M680x0 systems running RMS68K <li>M680x0 systems running OS9 <li>The IBM&hyphen.PC running TurboPascal or Turbo-C <li>The Macintosh running TurboPascal or MPW </ul>
Does that look familiar (after you add </h1> and </h2> tags)? It was actually first written in 1986, showing that the basis for HTML is probably a lot older than people may think it is.
Still, although there were some basic features by the start of 1991, many of the features were yet to be added, and the language really came into its own in these years.
Towards the end of 1991, Dan Connolly came on the scene. Here is TimBL trying to describe the basics of HTML to him in October 1991:-
Re: status. Re: X11 BROWSER for WWW
Here is some discussion about the tags -- where it's not in http://info.cern.ch/hypertext/WWW/MarkUp/Tags.html I have updated that document now.
Most of the tags are just style tags: this goes for the headings H1 to H6, the lists UL and OL with list elements LI, the glossary DL with elements DT and DD.
<TITLE> ..<TITLE> is designed to be used for putting in the top banner of a window, or using as the window name. It also is what you would use in a history list. It shouldn't be displayed in the text itself, as usually there is a <H1> heading atteh top of the text anyway. A difference is that thet title is designed to make sense out of context, whereas the heading is within context. For example, a title might be "Formatting Characters for Printf -- C reference manual" whereas the heading may just be "Formatting characters".
The base address tag is not used, nor is highlighting HP1 etc.
Anchors are used! The REL attribute is NOT used.
<ISINDEX> is sent by servers to indicate that they will accept a search given this document name plus keywords. It turns on a search panel when the document is the main window. An even better implementation would have a keyword field at the bottom of the text window if the document is a searchable index. That would make the document more self-contained as an item in the user's eyes, and reduce screen clutter.
<NEXTID> can be ignored by browsers, only needed for editors.
<XMP> and <LISTING> are used to indicate inserted literal text. To make life easier for those writing documents (and because we don't have entities in the code yet) they are special in that EVERYTHING is litteral text until the closing tag - so one can use XMP for giving examples of HTML for example. (We really need an escaping method - the next parser will have simpl entities like "<." for "<".) Within XMP or LISTING, newlines are significant (and mean "new line"!)
<PLAINTEXT> is used to indicate that the rest of the file is in fact just ASCII. It turns off SGML parsing completely. It's a fudge for the moment, until we have the document format negociation. ______________________________________
Structure of documents:
In writing a new generic parser, I wondered whether your text object will store the nested structure of a document. At the moment, the document is a linear sequence of styles: you can't have lists within lists, etc. Ideally, it would be able to handle this - although its more difficult for a human writer to handle when formatting the document. I would in fact prefer, instead of <H1>, <H2> etc for headings [those come from the AAP DTD] to have a nestable <SECTION>..</SECTION> element, and a generic <H>..</H> which at any level within the sections would produce the required level of heading.
For a browser, it is quite satisfactory to flatten the structure back into a sequence of styles, but for an editor it isn't.
Tim and Dan, and a few others started to work on standardizing the language, and making it easy to implement. It is at this point that things started to get messy...
With 1992 came (some) stability. See the HTML page.
One of the related files contains a very important idiom:-
It is required that HTML be a common language between all platforms. This implies no device-specific markup, or anything which requires control over fonts or colors, for example. This is in keeping with the SGML ideal.
However, HTML suffered greatly from the lack of standardization, and the dodgy parsing techniques allowed by Mosaic (in 1993). If HTML had been precisely defined as having to have an SGML DTD, it may not have become as popular as fast, but it would have been a lot architecturally stronger.
The first official standard for HTML (HTML 2.0) came out in November 1995: way too late!
HTML has been in use by the World Wide Web (WWW) global information initiative since 1990. This specification roughly corresponds to the capabilities of HTML in common use prior to June 1994. HTML is an application of ISO Standard 8879:1986 Information Processing Text and Office Systems; Standard Generalized Markup Language (SGML).
The early work on HTML was 99% forwards compatable, and it has certainly withheld the test of time. I hope that by examining the roots of HTML, I have deepend my own and others understanding of HTML, and where it is going.
Comments on this document are welcome. Please send comments to the author.