The PXP user's guide
Prev		Next

Chapter 1. What is XML?

Table of Contents
1.1. Introduction
1.2. Highlights of XML
1.3. A complete example: The readme DTD

1.1. Introduction

XML (short for Extensible Markup Language) +generalizes the idea that text documents are typically structured in sections, +sub-sections, paragraphs, and so on. The format of the document is not fixed +(as, for example, in HTML), but can be declared by a so-called DTD (document +type definition). The DTD describes only the rules how the document can be +structured, but not how the document can be processed. For example, if you want +to publish a book that uses XML markup, you will need a processor that converts +the XML file into a printable format such as Postscript. On the one hand, the +structure of XML documents is configurable; on the other hand, there is no +longer a canonical interpretation of the elements of the document; for example +one XML DTD might want that paragraphes are delimited by +para tags, and another DTD expects p tags +for the same purpose. As a result, for every DTD a new processor is required.

Although XML can be used to express structured text documents it is not limited +to this kind of application. For example, XML can also be used to exchange +structured data over a network, or to simply store structured data in +files. Note that XML documents cannot contain arbitrary binary data because +some characters are forbidden; for some applications you need to encode binary +data as text (e.g. the base 64 encoding).

1.1.1. The "hello world" example

The following example shows a very simple DTD, and a corresponding document +instance. The document is structured such that it consists of sections, and +that sections consist of paragraphs, and that paragraphs contain plain text:

<!ELEMENT document (section)+>
+<!ELEMENT section (paragraph)+>
+<!ELEMENT paragraph (#PCDATA)>

The following document is an instance of this DTD:

<?xml version="1.0" encoding="ISO-8859-1"?>
+<!DOCTYPE document SYSTEM "simple.dtd">
+<document>
+  <section>
+    <paragraph>This is a paragraph of the first section.</paragraph>
+    <paragraph>This is another paragraph of the first section.</paragraph>
+  </section>
+  <section>
+    <paragraph>This is the only paragraph of the second section.</paragraph>
+  </section>
+</document>

As in HTML (and, of course, in grand-father SGML), the "pieces" of +the document are delimited by element braces, i.e. such a piece begins with +<name-of-the-type-of-the-piece> and ends with +</name-of-the-type-of-the-piece>, and the pieces are +called elements. Unlike HTML and SGML, both start tags and +end tags (i.e. the delimiters written in angle brackets) can never be left +out. For example, HTML calls the paragraphs simply p, and +because paragraphs never contain paragraphs, a sequence of several paragraphs +can be written as: + +

<p>First paragraph 
+<p>Second paragraph

+ +This is not possible in XML; continuing our example above we must always write + +

<paragraph>First paragraph</paragraph>
+<paragraph>Second paragraph</paragraph>

+ +The rationale behind that is to (1) simplify the development of XML parsers +(you need not convert the DTD into a deterministic finite automaton which is +required to detect omitted tags), and to (2) make it possible to parse the +document independent of whether the DTD is known or not.

The first line of our sample document, + +

<?xml version="1.0" encoding="ISO-8859-1"?>

+ +is the so-called XML declaration. It expresses that the +document follows the conventions of XML version 1.0, and that the document is +encoded using characters from the ISO-8859-1 character set (often known as +"Latin 1", mostly used in Western Europe). Although the XML declaration is not +mandatory, it is good style to include it; everybody sees at the first glance +that the document uses XML markup and not the similar-looking HTML and SGML +markup languages. If you omit the XML declaration, the parser will assume +that the document is encoded as UTF-8 or UTF-16 (there is a rule that makes +it possible to distinguish between UTF-8 and UTF-16 automatically); these +are encodings of Unicode's universal character set. (Note that PXP, unlike its +predecessor "Markup", fully supports Unicode.)

The second line, + +

<!DOCTYPE document SYSTEM "simple.dtd">

+ +names the DTD that is going to be used for the rest of the document. In +general, it is possible that the DTD consists of two parts, the so-called +external and the internal subset. "External" means that the DTD exists as a +second file; "internal" means that the DTD is included in the same file. In +this example, there is only an external subset, and the system identifier +"simple.dtd" specifies where the DTD file can be found. System identifiers are +interpreted as URLs; for instance this would be legal: + +

<!DOCTYPE document SYSTEM "http://host/location/simple.dtd">

+ +Please note that PXP cannot interpret HTTP identifiers by default, but it is +possible to change the interpretation of system identifiers.

The word immediately following DOCTYPE determines which of +the declared element types (here "document", "section", and "paragraph") is +used for the outermost element, the root element. In this +example it is document because the outermost element is +delimited by <document> and +</document>.

The DTD consists of three declarations for element types: +document, section, and +paragraph. Such a declaration has two parts: + +

<!ELEMENT name content-model>

+ +The content model is a regular expression which describes the possible inner +structure of the element. Here, document contains one or +more sections, and a section contains one or more +paragraphs. Note that these two element types are not allowed to contain +arbitrary text. Only the paragraph element type is declared +such that parsed character data (indicated by the symbol +#PCDATA) is permitted.

See below for a detailed discussion of content models.

1.1.2. XML parsers and processors

XML documents are human-readable, but this is not the main purpose of this +language. XML has been designed such that documents can be read by a program +called an XML parser. The parser checks that the document +is well-formatted, and it represents the document as objects of the programming +language. There are two aspects when checking the document: First, the document +must follow some basic syntactic rules, such as that tags are written in angle +brackets, that for every start tag there must be a corresponding end tag and so +on. A document respecting these rules is +well-formed. Second, the document must match the DTD in +which case the document is valid. Many parsers check only +on well-formedness and ignore the DTD; PXP is designed such that it can +even validate the document.

A parser does not make a sensible application, it only reads XML +documents. The whole application working with XML-formatted data is called an +XML processor. Often XML processors convert documents into +another format, such as HTML or Postscript. Sometimes processors extract data +of the documents and output the processed data again XML-formatted. The parser +can help the application processing the document; for example it can provide +means to access the document in a specific manner. PXP supports an +object-oriented access layer specially.

1.1.3. Discussion

As we have seen, there are two levels of description: On the one hand, XML can +define rules about the format of a document (the DTD), on the other hand, XML +expresses structured documents. There are a number of possible applications:

XML can be used to express structured texts. Unlike HTML, there is no canonical +interpretation; one would have to write a backend for the DTD that translates +the structured texts into a format that existing browsers, printers +etc. understand. The advantage of a self-defined document format is that it is +possible to design the format in a more problem-oriented way. For example, if +the task is to extract reports from a database, one can use a DTD that reflects +the structure of the report or the database. A possible approach would be to +have an element type for every database table and for every column. Once the +DTD has been designed, the report procedure can be splitted up in a part that +selects the database rows and outputs them as an XML document according to the +DTD, and in a part that translates the document into other formats. Of course, +the latter part can be solved in a generic way, e.g. there may be configurable +backends for all DTDs that follow the approach and have element types for +tables and columns.
XML plays the role of a configurable intermediate format. The database +extraction function can be written without having to know the details of +typesetting; the backends can be written without having to know the details of +the database.
Of course, there are traditional solutions. One can define an ad hoc +intermediate text file format. This disadvantage is that there are no names for +the pieces of the format, and that such formats usually lack of documentation +because of this. Another solution would be to have a binary representation, +either as language-dependent or language-independent structure (example of the +latter can be found in RPC implementations). The disadvantage is that it is +harder to view such representations, one has to write pretty printers for this +purpose. It is also more difficult to enter test data; XML is plain text that +can be written using an arbitrary editor (Emacs has even a good XML mode, +PSGML). All these alternatives suffer from a missing structure checker, +i.e. the programs processing these formats usually do not check the input file +or input object in detail; XML parsers check the syntax of the input (the +so-called well-formedness check), and the advanced parsers like PXP even +verify that the structure matches the DTD (the so-called validation).
XML can be used as configurable communication language. A fundamental problem +of every communication is that sender and receiver must follow the same +conventions about the language. For data exchange, the question is usually +which data records and fields are available, how they are syntactically +composed, and which values are possible for the various fields. Similar +questions arise for text document exchange. XML does not answer these problems +completely, but it reduces the number of ambiguities for such conventions: The +outlines of the syntax are specified by the DTD (but not necessarily the +details), and XML introduces canonical names for the components of documents +such that it is simpler to describe the rest of the syntax and the semantics +informally.
XML is a data storage format. Currently, every software product tends to use +its own way to store data; commercial software often does not describe such +formats, and it is a pain to integrate such software into a bigger project. +XML can help to improve this situation when several applications share the same +syntax of data files. DTDs are then neutral instances that check the format of +data files independent of applications.