The PXP user's guide
Prev		Next

Chapter 1. What is XML?

Table of Contents
1.1. Introduction
1.2. Highlights of XML
1.3. A complete example: The readme DTD

1.1. Introduction

XML (short for Extensible Markup Language) -generalizes the idea that text documents are typically structured in sections, -sub-sections, paragraphs, and so on. The format of the document is not fixed -(as, for example, in HTML), but can be declared by a so-called DTD (document -type definition). The DTD describes only the rules how the document can be -structured, but not how the document can be processed. For example, if you want -to publish a book that uses XML markup, you will need a processor that converts -the XML file into a printable format such as Postscript. On the one hand, the -structure of XML documents is configurable; on the other hand, there is no -longer a canonical interpretation of the elements of the document; for example -one XML DTD might want that paragraphes are delimited by -para tags, and another DTD expects p tags -for the same purpose. As a result, for every DTD a new processor is required.

Although XML can be used to express structured text documents it is not limited -to this kind of application. For example, XML can also be used to exchange -structured data over a network, or to simply store structured data in -files. Note that XML documents cannot contain arbitrary binary data because -some characters are forbidden; for some applications you need to encode binary -data as text (e.g. the base 64 encoding).

1.1.1. The "hello world" example

The following example shows a very simple DTD, and a corresponding document -instance. The document is structured such that it consists of sections, and -that sections consist of paragraphs, and that paragraphs contain plain text:

<!ELEMENT document (section)+>
-<!ELEMENT section (paragraph)+>
-<!ELEMENT paragraph (#PCDATA)>

The following document is an instance of this DTD:

<?xml version="1.0" encoding="ISO-8859-1"?>
-<!DOCTYPE document SYSTEM "simple.dtd">
-<document>
-  <section>
-    <paragraph>This is a paragraph of the first section.</paragraph>
-    <paragraph>This is another paragraph of the first section.</paragraph>
-  </section>
-  <section>
-    <paragraph>This is the only paragraph of the second section.</paragraph>
-  </section>
-</document>

As in HTML (and, of course, in grand-father SGML), the "pieces" of -the document are delimited by element braces, i.e. such a piece begins with -<name-of-the-type-of-the-piece> and ends with -</name-of-the-type-of-the-piece>, and the pieces are -called elements. Unlike HTML and SGML, both start tags and -end tags (i.e. the delimiters written in angle brackets) can never be left -out. For example, HTML calls the paragraphs simply p, and -because paragraphs never contain paragraphs, a sequence of several paragraphs -can be written as: - -

<p>First paragraph 
-<p>Second paragraph

- -This is not possible in XML; continuing our example above we must always write - -

<paragraph>First paragraph</paragraph>
-<paragraph>Second paragraph</paragraph>

- -The rationale behind that is to (1) simplify the development of XML parsers -(you need not convert the DTD into a deterministic finite automaton which is -required to detect omitted tags), and to (2) make it possible to parse the -document independent of whether the DTD is known or not.

The first line of our sample document, - -

<?xml version="1.0" encoding="ISO-8859-1"?>

- -is the so-called XML declaration. It expresses that the -document follows the conventions of XML version 1.0, and that the document is -encoded using characters from the ISO-8859-1 character set (often known as -"Latin 1", mostly used in Western Europe). Although the XML declaration is not -mandatory, it is good style to include it; everybody sees at the first glance -that the document uses XML markup and not the similar-looking HTML and SGML -markup languages. If you omit the XML declaration, the parser will assume -that the document is encoded as UTF-8 or UTF-16 (there is a rule that makes -it possible to distinguish between UTF-8 and UTF-16 automatically); these -are encodings of Unicode's universal character set. (Note that PXP, unlike its -predecessor "Markup", fully supports Unicode.)

The second line, - -

<!DOCTYPE document SYSTEM "simple.dtd">

- -names the DTD that is going to be used for the rest of the document. In -general, it is possible that the DTD consists of two parts, the so-called -external and the internal subset. "External" means that the DTD exists as a -second file; "internal" means that the DTD is included in the same file. In -this example, there is only an external subset, and the system identifier -"simple.dtd" specifies where the DTD file can be found. System identifiers are -interpreted as URLs; for instance this would be legal: - -

<!DOCTYPE document SYSTEM "http://host/location/simple.dtd">

- -Please note that PXP cannot interpret HTTP identifiers by default, but it is -possible to change the interpretation of system identifiers.

The word immediately following DOCTYPE determines which of -the declared element types (here "document", "section", and "paragraph") is -used for the outermost element, the root element. In this -example it is document because the outermost element is -delimited by <document> and -</document>.

The DTD consists of three declarations for element types: -document, section, and -paragraph. Such a declaration has two parts: - -

<!ELEMENT name content-model>

- -The content model is a regular expression which describes the possible inner -structure of the element. Here, document contains one or -more sections, and a section contains one or more -paragraphs. Note that these two element types are not allowed to contain -arbitrary text. Only the paragraph element type is declared -such that parsed character data (indicated by the symbol -#PCDATA) is permitted.

See below for a detailed discussion of content models.

1.1.2. XML parsers and processors

XML documents are human-readable, but this is not the main purpose of this -language. XML has been designed such that documents can be read by a program -called an XML parser. The parser checks that the document -is well-formatted, and it represents the document as objects of the programming -language. There are two aspects when checking the document: First, the document -must follow some basic syntactic rules, such as that tags are written in angle -brackets, that for every start tag there must be a corresponding end tag and so -on. A document respecting these rules is -well-formed. Second, the document must match the DTD in -which case the document is valid. Many parsers check only -on well-formedness and ignore the DTD; PXP is designed such that it can -even validate the document.

A parser does not make a sensible application, it only reads XML -documents. The whole application working with XML-formatted data is called an -XML processor. Often XML processors convert documents into -another format, such as HTML or Postscript. Sometimes processors extract data -of the documents and output the processed data again XML-formatted. The parser -can help the application processing the document; for example it can provide -means to access the document in a specific manner. PXP supports an -object-oriented access layer specially.

1.1.3. Discussion

As we have seen, there are two levels of description: On the one hand, XML can -define rules about the format of a document (the DTD), on the other hand, XML -expresses structured documents. There are a number of possible applications:

XML can be used to express structured texts. Unlike HTML, there is no canonical -interpretation; one would have to write a backend for the DTD that translates -the structured texts into a format that existing browsers, printers -etc. understand. The advantage of a self-defined document format is that it is -possible to design the format in a more problem-oriented way. For example, if -the task is to extract reports from a database, one can use a DTD that reflects -the structure of the report or the database. A possible approach would be to -have an element type for every database table and for every column. Once the -DTD has been designed, the report procedure can be splitted up in a part that -selects the database rows and outputs them as an XML document according to the -DTD, and in a part that translates the document into other formats. Of course, -the latter part can be solved in a generic way, e.g. there may be configurable -backends for all DTDs that follow the approach and have element types for -tables and columns.
XML plays the role of a configurable intermediate format. The database -extraction function can be written without having to know the details of -typesetting; the backends can be written without having to know the details of -the database.
Of course, there are traditional solutions. One can define an ad hoc -intermediate text file format. This disadvantage is that there are no names for -the pieces of the format, and that such formats usually lack of documentation -because of this. Another solution would be to have a binary representation, -either as language-dependent or language-independent structure (example of the -latter can be found in RPC implementations). The disadvantage is that it is -harder to view such representations, one has to write pretty printers for this -purpose. It is also more difficult to enter test data; XML is plain text that -can be written using an arbitrary editor (Emacs has even a good XML mode, -PSGML). All these alternatives suffer from a missing structure checker, -i.e. the programs processing these formats usually do not check the input file -or input object in detail; XML parsers check the syntax of the input (the -so-called well-formedness check), and the advanced parsers like PXP even -verify that the structure matches the DTD (the so-called validation).
XML can be used as configurable communication language. A fundamental problem -of every communication is that sender and receiver must follow the same -conventions about the language. For data exchange, the question is usually -which data records and fields are available, how they are syntactically -composed, and which values are possible for the various fields. Similar -questions arise for text document exchange. XML does not answer these problems -completely, but it reduces the number of ambiguities for such conventions: The -outlines of the syntax are specified by the DTD (but not necessarily the -details), and XML introduces canonical names for the components of documents -such that it is simpler to describe the rest of the syntax and the semantics -informally.
XML is a data storage format. Currently, every software product tends to use -its own way to store data; commercial software often does not describe such -formats, and it is a pain to integrate such software into a bigger project. -XML can help to improve this situation when several applications share the same -syntax of data files. DTDs are then neutral instances that check the format of -data files independent of applications.