Chapter 1. What is XML?

Table of Contents
1.1. Introduction
1.2. Highlights of XML
1.3. A complete example: The readme DTD

1.1. Introduction

XML (short for Extensible Markup Language) generalizes the idea that text documents are typically structured in sections, sub-sections, paragraphs, and so on. The format of the document is not fixed (as, for example, in HTML), but can be declared by a so-called DTD (document type definition). The DTD describes only the rules how the document can be structured, but not how the document can be processed. For example, if you want to publish a book that uses XML markup, you will need a processor that converts the XML file into a printable format such as Postscript. On the one hand, the structure of XML documents is configurable; on the other hand, there is no longer a canonical interpretation of the elements of the document; for example one XML DTD might want that paragraphes are delimited by para tags, and another DTD expects p tags for the same purpose. As a result, for every DTD a new processor is required.

Although XML can be used to express structured text documents it is not limited to this kind of application. For example, XML can also be used to exchange structured data over a network, or to simply store structured data in files. Note that XML documents cannot contain arbitrary binary data because some characters are forbidden; for some applications you need to encode binary data as text (e.g. the base 64 encoding).

1.1.1. The "hello world" example

The following example shows a very simple DTD, and a corresponding document instance. The document is structured such that it consists of sections, and that sections consist of paragraphs, and that paragraphs contain plain text:

<!ELEMENT document (section)+>
<!ELEMENT section (paragraph)+>
<!ELEMENT paragraph (#PCDATA)>

The following document is an instance of this DTD:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE document SYSTEM "simple.dtd">
<document>
  <section>
    <paragraph>This is a paragraph of the first section.</paragraph>
    <paragraph>This is another paragraph of the first section.</paragraph>
  </section>
  <section>
    <paragraph>This is the only paragraph of the second section.</paragraph>
  </section>
</document>

As in HTML (and, of course, in grand-father SGML), the "pieces" of the document are delimited by element braces, i.e. such a piece begins with <name-of-the-type-of-the-piece> and ends with </name-of-the-type-of-the-piece>, and the pieces are called elements. Unlike HTML and SGML, both start tags and end tags (i.e. the delimiters written in angle brackets) can never be left out. For example, HTML calls the paragraphs simply p, and because paragraphs never contain paragraphs, a sequence of several paragraphs can be written as:

<p>First paragraph 
<p>Second paragraph
This is not possible in XML; continuing our example above we must always write
<paragraph>First paragraph</paragraph>
<paragraph>Second paragraph</paragraph>
The rationale behind that is to (1) simplify the development of XML parsers (you need not convert the DTD into a deterministic finite automaton which is required to detect omitted tags), and to (2) make it possible to parse the document independent of whether the DTD is known or not.

The first line of our sample document,

<?xml version="1.0" encoding="ISO-8859-1"?>
is the so-called XML declaration. It expresses that the document follows the conventions of XML version 1.0, and that the document is encoded using characters from the ISO-8859-1 character set (often known as "Latin 1", mostly used in Western Europe). Although the XML declaration is not mandatory, it is good style to include it; everybody sees at the first glance that the document uses XML markup and not the similar-looking HTML and SGML markup languages. If you omit the XML declaration, the parser will assume that the document is encoded as UTF-8 or UTF-16 (there is a rule that makes it possible to distinguish between UTF-8 and UTF-16 automatically); these are encodings of Unicode's universal character set. (Note that PXP, unlike its predecessor "Markup", fully supports Unicode.)

The second line,

<!DOCTYPE document SYSTEM "simple.dtd">
names the DTD that is going to be used for the rest of the document. In general, it is possible that the DTD consists of two parts, the so-called external and the internal subset. "External" means that the DTD exists as a second file; "internal" means that the DTD is included in the same file. In this example, there is only an external subset, and the system identifier "simple.dtd" specifies where the DTD file can be found. System identifiers are interpreted as URLs; for instance this would be legal:
<!DOCTYPE document SYSTEM "http://host/location/simple.dtd">
Please note that PXP cannot interpret HTTP identifiers by default, but it is possible to change the interpretation of system identifiers.

The word immediately following DOCTYPE determines which of the declared element types (here "document", "section", and "paragraph") is used for the outermost element, the root element. In this example it is document because the outermost element is delimited by <document> and </document>.

The DTD consists of three declarations for element types: document, section, and paragraph. Such a declaration has two parts:

<!ELEMENT name content-model>
The content model is a regular expression which describes the possible inner structure of the element. Here, document contains one or more sections, and a section contains one or more paragraphs. Note that these two element types are not allowed to contain arbitrary text. Only the paragraph element type is declared such that parsed character data (indicated by the symbol #PCDATA) is permitted.

See below for a detailed discussion of content models.

1.1.2. XML parsers and processors

XML documents are human-readable, but this is not the main purpose of this language. XML has been designed such that documents can be read by a program called an XML parser. The parser checks that the document is well-formatted, and it represents the document as objects of the programming language. There are two aspects when checking the document: First, the document must follow some basic syntactic rules, such as that tags are written in angle brackets, that for every start tag there must be a corresponding end tag and so on. A document respecting these rules is well-formed. Second, the document must match the DTD in which case the document is valid. Many parsers check only on well-formedness and ignore the DTD; PXP is designed such that it can even validate the document.

A parser does not make a sensible application, it only reads XML documents. The whole application working with XML-formatted data is called an XML processor. Often XML processors convert documents into another format, such as HTML or Postscript. Sometimes processors extract data of the documents and output the processed data again XML-formatted. The parser can help the application processing the document; for example it can provide means to access the document in a specific manner. PXP supports an object-oriented access layer specially.

1.1.3. Discussion

As we have seen, there are two levels of description: On the one hand, XML can define rules about the format of a document (the DTD), on the other hand, XML expresses structured documents. There are a number of possible applications:

  • XML can be used to express structured texts. Unlike HTML, there is no canonical interpretation; one would have to write a backend for the DTD that translates the structured texts into a format that existing browsers, printers etc. understand. The advantage of a self-defined document format is that it is possible to design the format in a more problem-oriented way. For example, if the task is to extract reports from a database, one can use a DTD that reflects the structure of the report or the database. A possible approach would be to have an element type for every database table and for every column. Once the DTD has been designed, the report procedure can be splitted up in a part that selects the database rows and outputs them as an XML document according to the DTD, and in a part that translates the document into other formats. Of course, the latter part can be solved in a generic way, e.g. there may be configurable backends for all DTDs that follow the approach and have element types for tables and columns.

    XML plays the role of a configurable intermediate format. The database extraction function can be written without having to know the details of typesetting; the backends can be written without having to know the details of the database.

    Of course, there are traditional solutions. One can define an ad hoc intermediate text file format. This disadvantage is that there are no names for the pieces of the format, and that such formats usually lack of documentation because of this. Another solution would be to have a binary representation, either as language-dependent or language-independent structure (example of the latter can be found in RPC implementations). The disadvantage is that it is harder to view such representations, one has to write pretty printers for this purpose. It is also more difficult to enter test data; XML is plain text that can be written using an arbitrary editor (Emacs has even a good XML mode, PSGML). All these alternatives suffer from a missing structure checker, i.e. the programs processing these formats usually do not check the input file or input object in detail; XML parsers check the syntax of the input (the so-called well-formedness check), and the advanced parsers like PXP even verify that the structure matches the DTD (the so-called validation).

  • XML can be used as configurable communication language. A fundamental problem of every communication is that sender and receiver must follow the same conventions about the language. For data exchange, the question is usually which data records and fields are available, how they are syntactically composed, and which values are possible for the various fields. Similar questions arise for text document exchange. XML does not answer these problems completely, but it reduces the number of ambiguities for such conventions: The outlines of the syntax are specified by the DTD (but not necessarily the details), and XML introduces canonical names for the components of documents such that it is simpler to describe the rest of the syntax and the semantics informally.

  • XML is a data storage format. Currently, every software product tends to use its own way to store data; commercial software often does not describe such formats, and it is a pain to integrate such software into a bigger project. XML can help to improve this situation when several applications share the same syntax of data files. DTDs are then neutral instances that check the format of data files independent of applications.