The PXP user's guide
Prev	Chapter 1. What is XML?	Next

1.2. Highlights of XML

This section explains many of the features of XML, but not all, and some +features not in detail. For a complete description, see the XML +specification.

1.2.1. The DTD and the instance

The DTD contains various declarations; in general you can only use a feature if +you have previously declared it. The document instance file may contain the +full DTD, but it is also possible to split the DTD into an internal and an +external subset. A document must begin as follows if the full DTD is included: + +

<?xml version="1.0" encoding="Your encoding"?>
+<!DOCTYPE root [
+  Declarations
+]>

+ +These declarations are called the internal subset. Note +that the usage of entities and conditional sections is restricted within the +internal subset.

If the declarations are located in a different file, you can refer to this file +as follows: + +

<?xml version="1.0" encoding="Your encoding"?>
+<!DOCTYPE root SYSTEM "file name">

+ +The declarations in the file are called the external +subset. The file name is called the system +identifier. +It is also possible to refer to the file by a so-called +public identifier, but most XML applications won't use +this feature.

You can also specify both internal and external subsets. In this case, the +declarations of both subsets are mixed, and if there are conflicts, the +declaration of the internal subset overrides those of the external subset with +the same name. This looks as follows: + +

<?xml version="1.0" encoding="Your encoding"?>
+<!DOCTYPE root  SYSTEM "file name" [
+  Declarations
+]>

The XML declaration (the string beginning with <?xml and +ending at ?>) should specify the encoding of the +file. Common values are UTF-8, and the ISO-8859 series of character sets. Note +that every file parsed by the XML processor can begin with an XML declaration +and that every file may have its own encoding.

The name of the root element must be mentioned directly after the +DOCTYPE string. This means that a full document instance +looks like + +

<?xml version="1.0" encoding="Your encoding"?>
+<!DOCTYPE root  SYSTEM "file name" [
+  Declarations
+]>
+
+<root>
+  inner contents
+</root>

1.2.2. Reserved characters

Some characters are generally reserved to indicate markup such that they cannot +be used for character data. These characters are <, >, and +&. Furthermore, single and double quotes are sometimes reserved. If you +want to include such a character as character, write it as follows: + +

< instead of <
> instead of >
& instead of &
' instead of '
" instead of "

+ +All other characters are free in the document instance. It is possible to +include a character by its position in the Unicode alphabet: + +

&#n;

+ +where n is the decimal number of the +character. Alternatively, you can specify the character by its hexadecimal +number: + +

&#xn;

+ +In the scope of declarations, the character % is no longer free. To include it +as character, you must use the notations % or +%.

Note that besides <, >, &, +', and " there are no predefines character entities. This is +different from HTML which defines a list of characters that can be referenced +by name (e.g. ä for ä); however, if you prefer named characters, you +can declare such entities yourself (see below).

1.2.3. Elements and ELEMENT declarations

Elements structure the document instance in a hierarchical way. There is a +top-level element, the root element, which contains a +sequence of inner elements and character sections. The inner elements are +structured in the same way. Every element has an element +type. The beginning of the element is indicated by a start +tag, written + +

<element-type>

+ +and the element continues until the corresponding end tag +is reached: + +

</element-type>

+ +In XML, it is not allowed to omit start or end tags, even if the DTD would +permit this. Note that there are no special rules how to interpret spaces or +newlines near start or end tags; all spaces and newlines count.

Every element type must be declared before it can be used. The declaration +consists of two parts: the ELEMENT declaration describes the content model, +i.e. which inner elements are allowed; the ATTLIST declaration describes the +attributes of the element.

An element can simply allow everything as content. This is written: + +

<!ELEMENT name ANY>

+ +On the opposite, an element can be forced to be empty; declared by: + +

<!ELEMENT name EMPTY>

+ +Note that there is an abbreviated notation for empty element instances: +<name/>.

There are two more sophisticated forms of declarations: so-called +mixed declarations, and regular +expressions. An element with mixed content contains character data +interspersed with inner elements, and the set of allowed inner elements can be +specified. In contrast to this, a regular expression declaration does not allow +character data, but the inner elements can be described by the more powerful +means of regular expressions.

A declaration for mixed content looks as follows: + +

<!ELEMENT name (#PCDATA | element₁ | ... | element_n )*>

+ +or if you do not want to allow any inner element, simply + +

<!ELEMENT name (#PCDATA)>

Example
If element type q is declared as + +
<!ELEMENT q (#PCDATA | r | s)*>
+ +this is a legal instance: + +
<q>This is character data<r></r>with <s></s>inner elements</q>
+ +But this is illegal because t has not been enumerated in the +declaration: + +
<q>This is character data<r></r>with <t></t>inner elements</q>

The other form uses a regular expression to describe the possible contents: + +

<!ELEMENT name regexp>

+ +The following well-known regexp operators are allowed: + +

element-name
(subexpr₁ , ... , subexpr_n )
(subexpr₁ | ... | subexpr_n )
subexpr*
subexpr+
subexpr?

+ +The , operator indicates a sequence of sub-models, the +| operator describes alternative sub-models. The +* indicates zero or more repetitions, and ++ one or more repetitions. Finally, ? can +be used for optional sub-models. As atoms the regexp can contain names of +elements; note that it is not allowed to include #PCDATA.

The exact syntax of the regular expressions is rather strange. This can be +explained best by a list of constraints: + +

The outermost expression must not be +element-name.
Illegal: +<!ELEMENT x y>; this must be written as +<!ELEMENT x (y)>.
For the unary operators subexpr*, +subexpr+, and +subexpr?, the +subexpr must not be again an +unary operator.
Illegal: +<!ELEMENT x y**>; this must be written as +<!ELEMENT x (y*)*>.
Between ) and one of the unary operatory +*, +, or ?, there must +not be whitespace.
Illegal: +<!ELEMENT x (y|z) *>; this must be written as +<!ELEMENT x (y|z)*>.
There is the additional constraint that the +right parenthsis must be contained in the same entity as the left parenthesis; +see the section about parsed entities below.

Note that there is another restriction on regular expressions which must be +deterministic. This means that the parser must be able to see by looking at the +next token which alternative is actually used, or whether the repetition +stops. The reason for this is simply compatability with SGML (there is no +intrinsic reason for this rule; XML can live without this restriction).

Example
The elements are declared as follows: + +
<!ELEMENT q (r?, (s | t)+)>
+<!ELEMENT r (#PCDATA)>
+<!ELEMENT s EMPTY>
+<!ELEMENT t (q | r)>
+ +This is a legal instance: + +
<q><r>Some characters</r><s/></q>
+ +(Note: <s/> is an abbreviation for +<s></s>.) + +It would be illegal to leave <s/> out because at +least one instance of s or t must be +present. It would be illegal, too, if characters existed outside the +r element; the only exception is white space. -- This is +legal, too: + +
<q><s/><t><q><s/></q></t></q>

1.2.4. Attribute lists and ATTLIST declarations

Elements may have attributes. These are put into the start tag of an element as +follows: + +

<element-name attribute₁="value₁" ... attribute_n="value_n">

+ +Instead of +"value_k" +it is also possible to use single quotes as in +'value_k'. +Note that you cannot use double quotes literally within the value of the +attribute if double quotes are the delimiters; the same applies to single +quotes. You can generally not use < and & as characters in attribute +values. It is possible to include the paraphrases <, >, +&, ', and " (and any other reference to a general +entity as long as the entity is not defined by an external file) as well as +&#n;.

Before you can use an attribute you must declare it. An ATTLIST declaration +looks as follows: + +

<!ATTLIST element-name 
+          attribute-name attribute-type attribute-default
+          ...
+          attribute-name attribute-type attribute-default
+>

+ +There are a lot of types, but most important are: + +

CDATA: Every string is allowed as attribute value.
NMTOKEN: Every nametoken is allowed as attribute +value. Nametokens consist (mainly) of letters, digits, ., :, -, _ in arbitrary +order.
NMTOKENS: A space-separated list of nametokens is allowed as +attribute value.

+ +The most interesting default declarations are: + +

#REQUIRED: The attribute must be specified.
#IMPLIED: The attribute can be specified but also can be +left out. The application can find out whether the attribute was present or +not.
"value" or +'value': This particular value is +used as default if the attribute is omitted in the element.

Example
This is a valid attribute declaration for element type r: + +
<!ATTLIST r 
+          x CDATA    #REQUIRED
+          y NMTOKEN  #IMPLIED
+          z NMTOKENS "one two three">
+ +This means that x is a required attribute that cannot be +left out, while y and z are optional. The +XML parser indicates the application whether y is present or +not, but if z is missing the default value +"one two three" is returned automatically.
This is a valid example of these attributes: + +
<r x="He said: &quot;I don't like quotes!&quot;" y='1'>

1.2.5. Parsed entities

Elements describe the logical structure of the document, while +entities determine the physical structure. Entities are +the pieces of text the parser operates on, mostly files and macros. Entities +may be parsed in which case the parser reads the text and +interprets it as XML markup, or unparsed which simply +means that the data of the entity has a foreign format (e.g. a GIF icon).

If the parsed entity is going to be used as part of the DTD, it +is called a parameter entity. You can declare a parameter +entity with a fixed text as content by: + +

<!ENTITY % name "value">

+ +Within the DTD, you can refer to this entity, i.e. read +the text of the entity, by: + +

%name;

+ +Such entities behave like macros, i.e. when they are referred to, the +macro text is inserted and read instead of the original text. + +

Example
For example, you can declare two elements with the same content model by: + +
<!ENTITY % model "a | b | c">
+<!ELEMENT x (%model;)>
+<!ELEMENT y (%model;)>

+ +If the contents of the entity are given as string constant, the entity is +called an internal entity. It is also possible to name a +file to be used as content (an external entity): + +

<!ENTITY % name SYSTEM "file name">

+ +There are some restrictions for parameter entities: + +

If the internal parameter entity contains the first token of a declaration +(i.e. <!), it must also contain the last token of the +declaration, i.e. the >. This means that the entity +either contains a whole number of complete declarations, or some text from the +middle of one declaration.
Illegal: +
```
<!ENTITY % e "(a | b | c)>">
+<!ELEMENT x %e;
```
Because <! is contained in the main +entity, and the corresponding > is contained in the +entity e.
If the internal parameter entity contains a left paranthesis, it must also +contain the corresponding right paranthesis.
Illegal: +
```
<!ENTITY % e "(a | b | c">
+<!ELEMENT x %e;)>
```
Because ( is contained in the entity +e, and the corresponding ) is +contained in the main entity.
When reading text from an entity, the parser automatically inserts one space +character before the entity text and one space character after the entity +text. However, this rule is not applied within the definition of another +entity.
Legal: +
```
 
+<!ENTITY % suffix "gif"> 
+<!ENTITY iconfile 'icon.%suffix;'>
```
Because %suffix; is referenced within +the definition text for iconfile, no additional spaces are +added.
Illegal: +
```
<!ENTITY % suffix "test">
+<!ELEMENT x.%suffix; ANY>
```
+Because %suffix; is referenced outside the definition +text of another entity, the parser replaces %suffix; by +spacetestspace.
Illegal: +
```
<!ENTITY % e "(a | b | c)">
+<!ELEMENT x %e;*>
```
Because there is a whitespace between ) +and *, which is illegal.
An external parameter entity must always consist of a whole number of complete +declarations.
In the internal subset of the DTD, a reference to a parameter entity (internal +or external) is only allowed at positions where a new declaration can start.

If the parsed entity is going to be used in the document instance, it is called +a general entity. Such entities can be used as +abbreviations for frequent phrases, or to include external files. Internal +general entities are declared as follows: + +

<!ENTITY name "value">

+ +External general entities are declared this way: + +

<!ENTITY name SYSTEM "file name">

+ +References to general entities are written as: + +

&name;

+ +The main difference between parameter and general entities is that the former +are only recognized in the DTD and that the latter are only recognized in the +document instance. As the DTD is parsed before the document, the parameter +entities are expanded first; for example it is possible to use the content of a +parameter entity as the name of a general entity: +&%name;;[1].

General entities must respect the element hierarchy. This means that there must +be an end tag for every start tag in the entity value, and that end tags +without corresponding start tags are not allowed.

Example
If the author of a document changes sometimes, it is worthwhile to set up a +general entity containing the names of the authors. If the author changes, you +need only to change the definition of the entity, and do not need to check all +occurrences of authors' names: + +
<!ENTITY authors "Gerd Stolpmann">
+ +In the document text, you can now refer to the author names by writing +&authors;.
Illegal: +The following two entities are illegal because the elements in the definition +do not nest properly: + +
<!ENTITY lengthy-tag "<section textcolor='white' background='graphic'>">
+<!ENTITY nonsense    "<a></b>">

Earlier in this introduction we explained that there are substitutes for +reserved characters: <, >, &, ', and +". These are simply predefined general entities; note that they are +the only predefined entities. It is allowed to define these entities again +as long as the meaning is unchanged.

1.2.6. Notations and unparsed entities

Unparsed entities have a foreign format and can thus not be read by the XML +parser. Unparsed entities are always external. The format of an unparsed entity +must have been declared, such a format is called a +notation. The entity can then be declared by referring to +this notation. As unparsed entities do not contain XML text, it is not possible +to include them directly into the document; you can only declare attributes +such that names of unparsed entities are acceptable values.

As you can see, unparsed entities are too complicated in order to have any +purpose. It is almost always better to simply pass the name of the data file as +normal attribute value, and let the application recognize and process the +foreign format.

Prev	Home	Next
What is XML?	Up	A complete example: The readme DTD