The Semantic Web, Taking Form

  1. Overview
  2. Common Understanding
  3. Formats
  4. Power
  5. Schemas And Ontologies
  6. Concerns
  7. Hype
  8. From Theory To Practice

Overview

The Semantic Web is a conceptual information space in which the resources identified by URIs can be processed by machines. It operates on the principles of "partial understanding" and "inference" (being able to infer new knowledge of terms from data that you already understand), and hence evolution and transformation. Because the URIs are being used to represent the resources, systems can grow on a globally decentralized basis, similar to hypertext documentation systems on the early WWW.

Once data is given a URI, it can then be referred to by anyone else, and as such, complex and intricate relationships can be built, queried, and processed. At the base of this plan is the hope that people will start publishing their data in RDF, Resource Description Framework, which is commonly implemented in an XML based format. The RDF data model is easy to learn, and consists of a series of "triples", i.e. three URIs, which can then be used to model any relationship between data that you care to think of. Thus, all data stored in the system is easily readable and processable.

Once there is a significant amount of data in RDF, we can start to actually utilise it. We can merge two totally independent sources, to come up with aggregated content the like of which can usually only be gotten after a tremendous amount of work. It still takes some effort to merge RDF, but tools will make this job trivial. The next level is the inference level. This describes the capability of the data to be more or less self-describing, i.e. grounded in terms that one's processor already knows so that it may work with new terms as it comes across them. For example, your processor may understand term A, but not understand term B. Once it comes across a piece of RDF that states that term A is equivalent in some respect to term B, it should be able to use this information to its advantage. Using this, it becomes possible to convert some languages into one another. It also becomes possible to partially update a version x of the language to become version x+1, or vice versa as the case may warrant.

The logical and digital signatures step comes after this. This step is mostly theoretical at the moment, but it should be possible to authenticate a piece of data, and then come to very powerful conclusions based upon this knowledge. Examples may include access control, and bank account details, etc. Proving these facts is also important, so the proof languages will have to be developed.

Common Understanding

On the Semantic Web, there are two types of data - primitive, and derived. Primitive data are that which are natively understood by processors, i.e. built in, or that have well published prose definitions that people understand. Inference languages, FOL, proof languages, simple indexing vocabularies and so forth are all types of primitive data (i.e., not primitive as in simple, but primitive as in the fundamental layer). Derived data is data which is inferred from the primitive data. It may also be possible to derive similar primitive data from primitive data, although that is some way off in the future.

So, with the Semantic Web, what we understand as being common terms will eventually collect into distinct groups. No one will have to analyse data, simply accept it. Of course, there will be a great amount of localization too - if you already know of a certain set of terms, you may as well use them rather than creating your own. However, if you went ahead and created your own terms, and then learnt that there was a similar but wider vocabulary being produced elsewhere, what would you normally do? Scrap your own code and use the new lot? With the Semantic Web, it should be possible to simply point out the inferences between the data, and use it that way. Your processors should then be able to grok the external data, and the external processors, assuming that they utilise Semantic Web technologies as well, will be able to process your data.

Common understanding is something that will most likely ever happen on a localized and deliberate scale, rather than a global one. For example, if you use the term "Car" in your language somewhere, there could be thousands of definitions of "Car" that your processor knows about. It should not be expected to recognize these until it has to process them first hand. For these reasons, two axioms of the Semantic Web are, 1) use other people's data when possible, and 2) don't expect to have to convert from your language into several thousand others.

Formats

In general, XML RDF is the format of choice for the Semantic Web. However, in order to represent the triples as efficiently as possible, some different types of formats have been developed, for processor external and internal purposes. One such external RDF format is Notation3, a plain text format devised by Tim Berners-Lee (who also came up with the World Wide Web, and the Semantic Web), which is easy to learn, and easy to process.

RDF Or Semantic Web?

There seems to be a little confusion between people about the difference between RDF and the Semantic Web. RDF is simply a data model and format that allows people to create machine readable data. The Semantic Web will be built on top of this data. Hence, when you publish something in RDF, you aren't necessarily creating something "Semantic Web-ish", but you are possibly making your data available to Semantic Web processors, if it means anything.

Already, we are starting to see companies and products claiming "our product gives you the power of the Semantic Web", when in fact they're probably talking about something completely different, and haven't got a clue what is really meant by the term "Semantic Web". It's not necessarily a bad thing, but it means that the public is going to have to be aware, and the real Semantic Web developers are going to have to be careful when talking about the Semantic Web.

The principle of the Semantic Web is actually fairly basic - machine readable data, global basis. Hence, proprietary data formats and proprietary data don't count as being a part of the Semantic Web. Perhaps we should develop a litmus test...

Power

One of the major points that critics have raised thus far is the fact that the Semantic Web as a whole will be highly incoherent, and that searches with no answer such as "what is love" could easily occur. This is nonsense. The Semantic Web applications that are emerging at the moment are all very carefully scoped applications with little interconnectivity between these closed worlds. You only need to grok as much RDF as your processor needs to handle for a specific task - no more and no less. For example, current applications include annotations servers, and embedding descriptive RDF into pictures.

Admittedly, if and when more RDF is published, there will be a tremendous amount of conflicting data. This is where the explicit trust, and implicit scoped trust mechanisms come into play. For example, when aggregating content, you will do so from only a trusted number of sites, and not all of the sites that you can find on the Web; that would be absurd. When the situation becomes more tricky, for example in business applications, you can check digital signatures and so forth.

Hence, the only real requisite for posting RDF data is to make sure that it parses correctly, and requires minimal human intervention. This is a major point if we want to be able to create a machine readable Web of data!

Schemas And Ontologies

RDF Schema and DAML, the DARPA Agent Markup Language (+OIL), are two very important base level RDF languages. Between them, they enable people to define new applications on top of RDF in a structured and interoperable manner. However, these languages are both in development, and as such aren't to be considered 100% stable, even if the majority of terms in them most probably are. It is likely that future versions will undergo different layering.

Schema languages are what give the Semantic Web applications a basic level of common understanding. Using RDF Schema, you can group terms into certain classes, and define what classes properties can be applied to, and take on. You can also comment and label terms, as well as providing definition arcs.

For example, the idea that the [EARL Schemata] URIs reveal a schema that somehow fully describes this language and that it is so simple (only two {count 'em 2} possible "statements"), yet looks like the recipe for flying to Mars is a bit daunting. Its very simplicity enables it to evaluate and report on just about anything - from document through language via guidelines! It is a fundamental tool for the Semantic Web in that it gives "power to the people" who can say anything about anything. - William Loughborough

Ontological and inference languages are a step above this, and provide even more power. You can create inverse terms, transitive terms, equivalences, datatypes, unions, intersections, and so forth.

Concerns

There have been a number of concerns and issues raised, about the Semantic Web, and about RDF. The RDF issues list contains reams of issues which are currently being addressed by a W3C working Group, RDF Core. The eventual aim of RDF Core is to update the RDF Model and Syntax specification, and to publish the RDF Schema recommendation as a recommendation. However, the precise layering and evolution of these technologies is not known, and hence begs the question, "will all of the current work be out of date"? It is likely that RDF Core will seek to make sure that as much backwards compatability is available in the newer versions of RDF core technologies, but it is already evident that current processors and implementations have followed errors in the current RDF specification, and hence do not parse as they should do.

Thankfully, we are helped due to the fact that RDF is based on namespaces. Namespaces are simply URIs for a piece of XML, so that you can be sure that a certain piece of data belongs to whatever language the URI says it belongs to. Anything with the SVG namespace is SVG, anything with the XHTML namespace is XHTML, and, anything with the RDF namespace is RDF. Therefore, anything that is updated in an incompatible fashion must have a new namespace. This eases the burden on processors, and the humans that operate them!

Hype

There has been a lot of hype about the Semantic Web, and this has not been a good thing. Spurious claims about what the Semantic Web might and might not be able to do have been choking public understanding of it, and adding to the confusion that many people have. There are now probably as many definitions of the Semantic Web as there are people working on it, and that cannot be a sensible state to be in. At the base of this problem has been the lack of hard applications, and the unwillingness of people who work on applications to explain to the general public the very essence of the Semantic Web, and to diffuse some of the ridiculous threads on some of the common RDF lists. Many people have given up, saying that there is no way to explain to everyone what "the Semantic Web is" without some level of misunderstanding or disagreement somewhere along the line. Whereas that may be true, the Semantic Web will start to suffer from this, and indeed probably already has.

Anti-Hype

Some Semantic Web developers are now being very careful to ensure that they don't over-tootle the capabilities (if any!) of the Semantic Web. The "slowly, slowly, building one step at a time" credo has become somewhat pervasive amongst the core Semantic Web developers, and is to some extent both a good and a bad thing - good in the sense that it's best to spend time developing the tools rather than the hype, but bad in that these projects aren't being popularized in the wider community as much they should be, and that the misunderstandings are still present without the hype.

From Theory To Practice

So, what can the Semantic Web achieve for us in practice right now? There are a number of projects in the running, although many of them have more to do with RDF parsing and so forth than the Semantic Web, and many "Semantic Web" applications are from the KR or Ontology people.

Interestingly, many people don't know of the various Semantic Web projects that are in progress, and don't realise that it is practical to do certain things already utilising Semantic Web technologies.

Annotea. Part of the Semantic Web activity, but pretty basic on the RDF side. Annotea is basically meant for humans to annotate files, so it still follows the "machines should process and then get out of the way" creedo, although the information is all available in RDF should anyone want to parse it in the future.

Calendaring. This is still in the very basic levels of development, but has great potential, as a real world applications that could then be used by many different tools that grok "dates" of some kind.

EARL. Well, what can I say about EARL? We've tried our best, we've made use of all the tools, we've made the stuff available. We've given one of the first non-SW folk examples that I'm aware of where you can convert from one language to another using a FOL inference language, and there's graph merging stuff too. But the main thing that I've learnt personally from EARL is what a pain in the rear end it is to deploy Semantic Web technologies. If it wasn't for CWM...

RDFWeb. Another public triples repository, with some different features, and with a different aim. This one provides information on people, their publications, who they know, and so on. This is pretty close to what I envisage as a true "SW project", although it may have a limited appeal in user-scope. If you know about this site, then you're probably fairly likely to know most of the information that it holds. Queries are running a little thin, but one to look out for.

SWAG-D. A fairly independent public triples repository. Why? For people that want to create new terms (and provide links to terms that already exist), and for people who want to register their new terms. Once again, good project, but intended for humans, although this goes one step further by implicitly inferring that this data is made available to humans so that they can write machine processable code.

The (DanC) SWAD stuff. For example, the N3 => XHTML screen scraping stuff (minutes, URI schemes), and the N3 => DOT stuff (circles and nodes). Not forgetting Telagent and Palmagent of course. One thing that people might wonder about, for example, the screen scraping stuff is, "why bother to maintain it in N3 and then convert to XHTML? Why not just put it in XHTML first?". The answer is that the Notation3 can be repurposed and combined with other information. It would be nice to see some more examples of this, but once again, the potential is there.

The SWAD stuff, and so forth is oft' powered by Tim Berners-Lee's CWM, or Closed World Machine. Utilising the Notation3 format mentioned earlier, CWM is a powerful closed world query machine, able to fake inferences through the use of filters. It has proven to be an excellent stepping stone for many into the world of Semantic Web on-top-of RDF processing. Only certain Prolog interpreters can come close, as of writing (2001-06).

Conclusions

As of 2001-06, the Semantic Web is still in a very early stage of development; there are only a handful of tools, an averagely small community, and a great deal of misunderstanding about the term "Semantic Web". However, work is continuing, and more people are getting on board - as TimBL might say: the bobsleigh is starting to glide...

Other Good Resources

Please address comments about this document to the author, at sean@mysterylights.com.

Sean B. Palmer, Semantic Web Agreement Group