4 >Highlights of XML</TITLE
7 CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
9 TITLE="The PXP user's guide"
10 HREF="index.html"><LINK
18 TITLE="A complete example: The readme DTD"
19 HREF="x468.html"><LINK
22 HREF="markup.css"></HEAD
41 >The PXP user's guide</TH
56 >Chapter 1. What is XML?</TD
76 >1.2. Highlights of XML</A
79 >This section explains many of the features of XML, but not all, and some
80 features not in detail. For a complete description, see the <A
81 HREF="http://www.w3.org/TR/1998/REC-xml-19980210.html"
92 >1.2.1. The DTD and the instance</A
95 >The DTD contains various declarations; in general you can only use a feature if
96 you have previously declared it. The document instance file may contain the
97 full DTD, but it is also possible to split the DTD into an internal and an
98 external subset. A document must begin as follows if the full DTD is included:
101 CLASS="PROGRAMLISTING"
102 ><?xml version="1.0" encoding="<TT
123 These declarations are called the <I
127 that the usage of entities and conditional sections is restricted within the
130 >If the declarations are located in a different file, you can refer to this file
134 CLASS="PROGRAMLISTING"
135 ><?xml version="1.0" encoding="<TT
154 The declarations in the file are called the <I
158 >. The file name is called the <I
163 It is also possible to refer to the file by a so-called
166 >public identifier</I
167 >, but most XML applications won't use
170 >You can also specify both internal and external subsets. In this case, the
171 declarations of both subsets are mixed, and if there are conflicts, the
172 declaration of the internal subset overrides those of the external subset with
173 the same name. This looks as follows:
176 CLASS="PROGRAMLISTING"
177 ><?xml version="1.0" encoding="<TT
203 >The XML declaration (the string beginning with <TT
210 >) should specify the encoding of the
211 file. Common values are UTF-8, and the ISO-8859 series of character sets. Note
212 that every file parsed by the XML processor can begin with an XML declaration
213 and that every file may have its own encoding.</P
215 >The name of the root element must be mentioned directly after the
219 > string. This means that a full document instance
223 CLASS="PROGRAMLISTING"
224 ><?xml version="1.0" encoding="<TT
275 >1.2.2. Reserved characters</A
278 >Some characters are generally reserved to indicate markup such that they cannot
279 be used for character data. These characters are <, >, and
280 &. Furthermore, single and double quotes are sometimes reserved. If you
281 want to include such a character as character, write it as follows:
288 STYLE="list-style-type: disc"
296 STYLE="list-style-type: disc"
304 STYLE="list-style-type: disc"
309 > instead of &</P
312 STYLE="list-style-type: disc"
320 STYLE="list-style-type: disc"
330 All other characters are free in the document instance. It is possible to
331 include a character by its position in the Unicode alphabet:
334 CLASS="PROGRAMLISTING"
348 > is the decimal number of the
349 character. Alternatively, you can specify the character by its hexadecimal
353 CLASS="PROGRAMLISTING"
362 In the scope of declarations, the character % is no longer free. To include it
363 as character, you must use the notations <TT
372 >Note that besides &lt;, &gt;, &amp;,
373 &apos;, and &quot; there are no predefines character entities. This is
374 different from HTML which defines a list of characters that can be referenced
375 by name (e.g. &auml; for ä); however, if you prefer named characters, you
376 can declare such entities yourself (see below).</P
384 >1.2.3. Elements and ELEMENT declarations</A
387 >Elements structure the document instance in a hierarchical way. There is a
388 top-level element, the <I
392 sequence of inner elements and character sections. The inner elements are
393 structured in the same way. Every element has an <I
397 >. The beginning of the element is indicated by a <I
404 CLASS="PROGRAMLISTING"
413 and the element continues until the corresponding <I
420 CLASS="PROGRAMLISTING"
429 In XML, it is not allowed to omit start or end tags, even if the DTD would
430 permit this. Note that there are no special rules how to interpret spaces or
431 newlines near start or end tags; all spaces and newlines count.</P
433 >Every element type must be declared before it can be used. The declaration
434 consists of two parts: the ELEMENT declaration describes the content model,
435 i.e. which inner elements are allowed; the ATTLIST declaration describes the
436 attributes of the element.</P
438 >An element can simply allow everything as content. This is written:
441 CLASS="PROGRAMLISTING"
450 On the opposite, an element can be forced to be empty; declared by:
453 CLASS="PROGRAMLISTING"
462 Note that there is an abbreviated notation for empty element instances:
473 >There are two more sophisticated forms of declarations: so-called
476 >mixed declarations</I
481 >. An element with mixed content contains character data
482 interspersed with inner elements, and the set of allowed inner elements can be
483 specified. In contrast to this, a regular expression declaration does not allow
484 character data, but the inner elements can be described by the more powerful
485 means of regular expressions.</P
487 >A declaration for mixed content looks as follows:
490 CLASS="PROGRAMLISTING"
513 or if you do not want to allow any inner element, simply
516 CLASS="PROGRAMLISTING"
537 CLASS="PROGRAMLISTING"
538 ><!ELEMENT q (#PCDATA | r | s)*></PRE
541 this is a legal instance:
544 CLASS="PROGRAMLISTING"
545 ><q>This is character data<r></r>with <s></s>inner elements</q></PRE
548 But this is illegal because <TT
551 > has not been enumerated in the
555 CLASS="PROGRAMLISTING"
556 ><q>This is character data<r></r>with <t></t>inner elements</q></PRE
560 >The other form uses a regular expression to describe the possible contents:
563 CLASS="PROGRAMLISTING"
577 The following well-known regexp operators are allowed:
584 STYLE="list-style-type: disc"
597 STYLE="list-style-type: disc"
622 STYLE="list-style-type: disc"
647 STYLE="list-style-type: disc"
660 STYLE="list-style-type: disc"
673 STYLE="list-style-type: disc"
691 > operator indicates a sequence of sub-models, the
695 > operator describes alternative sub-models. The
699 > indicates zero or more repetitions, and
703 > one or more repetitions. Finally, <TT
707 be used for optional sub-models. As atoms the regexp can contain names of
708 elements; note that it is not allowed to include <TT
713 >The exact syntax of the regular expressions is rather strange. This can be
714 explained best by a list of constraints:
721 STYLE="list-style-type: disc"
723 >The outermost expression must not be
740 ><!ELEMENT x y></TT
741 >; this must be written as
744 ><!ELEMENT x (y)></TT
748 STYLE="list-style-type: disc"
750 >For the unary operators <TT
785 > must not be again an
794 ><!ELEMENT x y**></TT
795 >; this must be written as
798 ><!ELEMENT x (y*)*></TT
802 STYLE="list-style-type: disc"
807 > and one of the unary operatory
818 not be whitespace.</P
826 ><!ELEMENT x (y|z) *></TT
827 >; this must be written as
830 ><!ELEMENT x (y|z)*></TT
834 STYLE="list-style-type: disc"
836 >There is the additional constraint that the
837 right parenthsis must be contained in the same entity as the left parenthesis;
838 see the section about parsed entities below.</P
843 >Note that there is another restriction on regular expressions which must be
844 deterministic. This means that the parser must be able to see by looking at the
845 next token which alternative is actually used, or whether the repetition
846 stops. The reason for this is simply compatability with SGML (there is no
847 intrinsic reason for this rule; XML can live without this restriction).</P
855 >The elements are declared as follows:
858 CLASS="PROGRAMLISTING"
859 ><!ELEMENT q (r?, (s | t)+)>
860 <!ELEMENT r (#PCDATA)>
861 <!ELEMENT s EMPTY>
862 <!ELEMENT t (q | r)></PRE
865 This is a legal instance:
868 CLASS="PROGRAMLISTING"
869 ><q><r>Some characters</r><s/></q></PRE
875 > is an abbreviation for
878 ><s></s></TT
881 It would be illegal to leave <TT
885 least one instance of <TT
892 present. It would be illegal, too, if characters existed outside the
896 > element; the only exception is white space. -- This is
900 CLASS="PROGRAMLISTING"
901 ><q><s/><t><q><s/></q></t></q></PRE
911 >1.2.4. Attribute lists and ATTLIST declarations</A
914 >Elements may have attributes. These are put into the start tag of an element as
918 CLASS="PROGRAMLISTING"
967 it is also possible to use single quotes as in
979 Note that you cannot use double quotes literally within the value of the
980 attribute if double quotes are the delimiters; the same applies to single
981 quotes. You can generally not use < and & as characters in attribute
982 values. It is possible to include the paraphrases &lt;, &gt;,
983 &amp;, &apos;, and &quot; (and any other reference to a general
984 entity as long as the entity is not defined by an external file) as well as
992 >Before you can use an attribute you must declare it. An ATTLIST declaration
996 CLASS="PROGRAMLISTING"
1016 >attribute-default</I
1033 >attribute-default</I
1039 There are a lot of types, but most important are:
1046 STYLE="list-style-type: disc"
1051 >: Every string is allowed as attribute value.</P
1054 STYLE="list-style-type: disc"
1059 >: Every nametoken is allowed as attribute
1060 value. Nametokens consist (mainly) of letters, digits, ., :, -, _ in arbitrary
1064 STYLE="list-style-type: disc"
1069 >: A space-separated list of nametokens is allowed as
1075 The most interesting default declarations are:
1082 STYLE="list-style-type: disc"
1087 >: The attribute must be specified.</P
1090 STYLE="list-style-type: disc"
1095 >: The attribute can be specified but also can be
1096 left out. The application can find out whether the attribute was present or
1100 STYLE="list-style-type: disc"
1119 >: This particular value is
1120 used as default if the attribute is omitted in the element.</P
1131 >This is a valid attribute declaration for element type <TT
1137 CLASS="PROGRAMLISTING"
1141 z NMTOKENS "one two three"></PRE
1147 > is a required attribute that cannot be
1155 XML parser indicates the application whether <TT
1162 > is missing the default value
1163 "one two three" is returned automatically. </P
1165 >This is a valid example of these attributes:
1168 CLASS="PROGRAMLISTING"
1169 ><r x="He said: &quot;I don't like quotes!&quot;" y='1'></PRE
1179 >1.2.5. Parsed entities</A
1182 >Elements describe the logical structure of the document, while
1186 > determine the physical structure. Entities are
1187 the pieces of text the parser operates on, mostly files and macros. Entities
1191 > in which case the parser reads the text and
1192 interprets it as XML markup, or <I
1196 means that the data of the entity has a foreign format (e.g. a GIF icon).</P
1198 >If the parsed entity is going to be used as part of the DTD, it
1201 >parameter entity</I
1202 >. You can declare a parameter
1203 entity with a fixed text as content by:
1206 CLASS="PROGRAMLISTING"
1220 Within the DTD, you can <I
1223 > this entity, i.e. read
1224 the text of the entity, by:
1227 CLASS="PROGRAMLISTING"
1236 Such entities behave like macros, i.e. when they are referred to, the
1237 macro text is inserted and read instead of the original text.
1246 >For example, you can declare two elements with the same content model by:
1249 CLASS="PROGRAMLISTING"
1250 ><!ENTITY % model "a | b | c">
1251 <!ELEMENT x (%model;)>
1252 <!ELEMENT y (%model;)></PRE
1257 If the contents of the entity are given as string constant, the entity is
1261 > entity. It is also possible to name a
1262 file to be used as content (an <I
1268 CLASS="PROGRAMLISTING"
1282 There are some restrictions for parameter entities:
1289 STYLE="list-style-type: disc"
1291 >If the internal parameter entity contains the first token of a declaration
1295 >), it must also contain the last token of the
1296 declaration, i.e. the <TT
1299 >. This means that the entity
1300 either contains a whole number of complete declarations, or some text from the
1301 middle of one declaration.</P
1308 CLASS="PROGRAMLISTING"
1309 ><!ENTITY % e "(a | b | c)>">
1310 <!ELEMENT x %e;</PRE
1314 > is contained in the main
1315 entity, and the corresponding <TT
1318 > is contained in the
1325 STYLE="list-style-type: disc"
1327 >If the internal parameter entity contains a left paranthesis, it must also
1328 contain the corresponding right paranthesis.</P
1335 CLASS="PROGRAMLISTING"
1336 ><!ENTITY % e "(a | b | c">
1337 <!ELEMENT x %e;)></PRE
1341 > is contained in the entity
1345 >, and the corresponding <TT
1349 contained in the main entity.</P
1352 STYLE="list-style-type: disc"
1354 >When reading text from an entity, the parser automatically inserts one space
1355 character before the entity text and one space character after the entity
1356 text. However, this rule is not applied within the definition of another
1364 CLASS="PROGRAMLISTING"
1366 <!ENTITY % suffix "gif">
1367 <!ENTITY iconfile 'icon.%suffix;'></PRE
1371 > is referenced within
1372 the definition text for <TT
1375 >, no additional spaces are
1383 CLASS="PROGRAMLISTING"
1384 ><!ENTITY % suffix "test">
1385 <!ELEMENT x.%suffix; ANY></PRE
1390 > is referenced outside the definition
1391 text of another entity, the parser replaces <TT
1415 CLASS="PROGRAMLISTING"
1416 ><!ENTITY % e "(a | b | c)">
1417 <!ELEMENT x %e;*></PRE
1418 > Because there is a whitespace between <TT
1425 >, which is illegal.</P
1428 STYLE="list-style-type: disc"
1430 >An external parameter entity must always consist of a whole number of complete
1434 STYLE="list-style-type: disc"
1436 >In the internal subset of the DTD, a reference to a parameter entity (internal
1437 or external) is only allowed at positions where a new declaration can start.</P
1442 >If the parsed entity is going to be used in the document instance, it is called
1446 >. Such entities can be used as
1447 abbreviations for frequent phrases, or to include external files. Internal
1448 general entities are declared as follows:
1451 CLASS="PROGRAMLISTING"
1465 External general entities are declared this way:
1468 CLASS="PROGRAMLISTING"
1482 References to general entities are written as:
1485 CLASS="PROGRAMLISTING"
1494 The main difference between parameter and general entities is that the former
1495 are only recognized in the DTD and that the latter are only recognized in the
1496 document instance. As the DTD is parsed before the document, the parameter
1497 entities are expanded first; for example it is possible to use the content of a
1498 parameter entity as the name of a general entity:
1501 >&#38;%name;;</TT
1508 >General entities must respect the element hierarchy. This means that there must
1509 be an end tag for every start tag in the entity value, and that end tags
1510 without corresponding start tags are not allowed.</P
1518 >If the author of a document changes sometimes, it is worthwhile to set up a
1519 general entity containing the names of the authors. If the author changes, you
1520 need only to change the definition of the entity, and do not need to check all
1521 occurrences of authors' names:
1524 CLASS="PROGRAMLISTING"
1525 ><!ENTITY authors "Gerd Stolpmann"></PRE
1528 In the document text, you can now refer to the author names by writing
1538 The following two entities are illegal because the elements in the definition
1539 do not nest properly:
1542 CLASS="PROGRAMLISTING"
1543 ><!ENTITY lengthy-tag "<section textcolor='white' background='graphic'>">
1544 <!ENTITY nonsense "<a></b>"></PRE
1548 >Earlier in this introduction we explained that there are substitutes for
1549 reserved characters: &lt;, &gt;, &amp;, &apos;, and
1550 &quot;. These are simply predefined general entities; note that they are
1551 the only predefined entities. It is allowed to define these entities again
1552 as long as the meaning is unchanged.</P
1560 >1.2.6. Notations and unparsed entities</A
1563 >Unparsed entities have a foreign format and can thus not be read by the XML
1564 parser. Unparsed entities are always external. The format of an unparsed entity
1565 must have been declared, such a format is called a
1569 >. The entity can then be declared by referring to
1570 this notation. As unparsed entities do not contain XML text, it is not possible
1571 to include them directly into the document; you can only declare attributes
1572 such that names of unparsed entities are acceptable values.</P
1574 >As you can see, unparsed entities are too complicated in order to have any
1575 purpose. It is almost always better to simply pass the name of the data file as
1576 normal attribute value, and let the application recognize and process the
1594 HREF="x107.html#AEN445"
1602 >This construct is only
1603 allowed within the definition of another entity; otherwise extra spaces would
1604 be added (as explained above). Such indirection is not recommended.</P
1608 CLASS="PROGRAMLISTING"
1609 ><!ENTITY % variant "a"> <!-- or "b" -->
1610 <!ENTITY text-a "This is text A.">
1611 <!ENTITY text-b "This is text B.">
1612 <!ENTITY text "&#38;text-%variant;;"></PRE
1614 You can now write <TT
1617 > in the document instance, and
1618 depending on the value of <TT
1685 >A complete example: The <I