helm/DEVEL/pxp/pxp/doc/SPEC.xml

   1 <?xml version="1.0" encoding="ISO-8859-1"?>
   2 <!DOCTYPE readme SYSTEM "readme.dtd" [
   3
   4 <!ENTITY % common SYSTEM "common.xml">
   5 %common;
   6
   7 <!-- Special HTML config: -->
   8 <!ENTITY % readme:html:up '<a href="../..">up</a>'>
   9
  10 <!ENTITY % config SYSTEM "config.xml">
  11 %config;
  12
  13 ]>
  14
  15 <readme title="Notes on the XML specification">
  16
  17   <sect1>
  18     <title>This document</title>
  19     <p>There are some points in the XML specification which are ambiguous.
  20 The following notes discuss these points, and describe how this parser
  21 behaves.</p>
  22   </sect1>
  23
  24   <sect1>
  25     <title>Conditional sections and the token ]]&gt;</title>
  26
  27     <p>It is unclear what happens if an ignored section contains the
  28 token ]]&gt; at places where it is normally allowed, i.e. within string
  29 literals and comments, e.g.
  30
  31 <code>
  32 &lt;![IGNORE[ &lt;!-- ]]&gt; --&gt; ]]&gt;
  33 </code>
  34
  35 On the one hand, the production rule of the XML grammar does not treat such
  36 tokens specially. Following the grammar, already the first ]]&gt; ends
  37 the conditional section
  38
  39 <code>
  40 &lt;![IGNORE[ &lt;!-- ]]&gt;
  41 </code>
  42
  43 and the other tokens are included into the DTD.</p>
  44
  45 <p>On the other hand, we can read: "Like the internal and external DTD subsets,
  46 a conditional section may contain one or more complete declarations, comments,
  47 processing instructions, or nested conditional sections, intermingled with
  48 white space" (XML 1.0 spec, section 3.4). Complete declarations and comments
  49 may contain ]]&gt;, so this is contradictory to the grammar.</p>
  50
  51 <p>The intention of conditional sections is to include or exclude the section
  52 depending on the current replacement text of a parameter entity. Almost
  53 always such sections are used as in
  54
  55 <code>
  56 &lt;!ENTITY % want.a.feature.or.not "INCLUDE"&gt;   (or "IGNORE")
  57 &lt;![ %want.a.feature.or.not; [ ... ]]&gt;
  58 </code>
  59
  60 This means that if it is possible to include a section it must also be
  61 legal to ignore the same section. This is a strong indication that
  62 the token ]]&gt; must not count as section terminator if it occurs
  63 in a string literal or comment.</p>
  64
  65 <p>This parser implements the latter.</p>
  66
  67   </sect1>
  68
  69   <sect1>
  70     <title>Conditional sections and the inclusion of parameter entities</title>
  71
  72     <p>It is unclear what happens if an ignored section contains a reference
  73 to a parameter entity. In most cases, this is not problematic because
  74 nesting of parameter entities must respect declaration braces. The
  75 replacement text of parameter entities must either contain a <em>whole</em>
  76 number of declarations or only inner material of one declaration. Almost always
  77 it does not matter whether these references are resolved or not
  78 (the section is ignored).</p>
  79
  80     <p>But there is one case which is not explicitly specified: Is it allowed
  81 that the replacement text of an entity contains the end marker ]]&gt;
  82 of an ignored conditional section? Example:
  83
  84 <code>
  85 &lt;!ENTITY % end "]]&gt;"&gt;
  86 &lt;![ IGNORE [ %end;
  87 </code>
  88
  89 We do not find the statement in the XML spec that the ]]&gt; must be contained
  90 in the same entity as the corresponding &lt;![ (as for the tokens &lt;! and
  91 &gt; of declarations). So it is possible to conclude that ]]&gt; may be in
  92 another entity.</p>
  93
  94     <p>Of course, there are many arguments not to allow such constructs: The
  95 resulting code is incomprehensive, and parsing takes longer (especially if the
  96 entities are external). I think the best argument against this kind of XML
  97 is that the XML spec is not detailed enough, as it contains no rules where
  98 entity references should be recognized and where not. For example:
  99
 100 <code>
 101 &lt;!ENTITY % y "]]&gt;"&gt;
 102 &lt;!ENTITY % x "&lt;!ENTITY z '&lt;![CDATA[some text%y;'&gt;"&gt;
 103 &lt;![ IGNORE [ %x; ]]&gt;
 104 </code>
 105
 106 Which token ]]&gt; counts? From a logical point of view, the ]]&gt; in the
 107 third line ends the conditional section. As already pointed out, the XML spec
 108 permits the interpretation that ]]&gt; is recognized even in string literals,
 109 and this may be also true if it is "imported" from a separate entity; and so
 110 the first ]]&gt; denotes the end of the section.</p>
 111
 112     <p>As a practical solution, this parser does not expand parameter entities
 113 in ignored sections. Furthermore, it is also not allowed that the ending ]]&gt;
 114 of ignored or included sections is contained in a different entity than the
 115 starting &lt;![ token.</p>
 116   </sect1>
 117
 118
 119   <sect1>
 120     <title>Standalone documents and attribute normalization</title>
 121
 122     <p>
 123 If a document is declared as stand-alone, a restriction on the effect of
 124 attribute normalization takes effect for attributes declared in external
 125 entities. Normally, the parser knows the type of the attribute from
 126 the ATTLIST declaration, and it can normalize attribute values depending
 127 on their types. For example, an NMTOKEN attribute can be written with
 128 leading or trailing spaces, but the parser returns always the nmtoken
 129 without such added spaces; in contrast to this, a CDATA attribute is
 130 not normalized in this way. For stand-alone document the type information is
 131 not available if the ATTLIST declaration is located in an external
 132 entity. Because of this, the XML spec demands that attribute values must
 133 be written in their normal form in this case, i.e. without additional
 134 spaces.
 135 </p>
 136     <p>This parser interprets this restriction as follows. Obviously,
 137 the substitution of character and entity references is not considered
 138 as a "change of the value" as a result of the normalization, because
 139 these operations will be performed identically if the ATTLIST declaration
 140 is not available. The same applies to the substitution of TABs, CRs,
 141 and LFs by space characters. Only the removal of spaces depending on
 142 the type of the attribute changes the value if the ATTLIST is not
 143 available.
 144 </p>
 145     <p>This means in detail: CDATA attributes never violate the
 146 stand-alone status. ID, IDREF, NMTOKEN, ENTITY, NOTATION and enumerator
 147 attributes must not be written with leading and/or trailing spaces. IDREF,
 148 ENTITIES, and NMTOKENS attributes must not be written with extra spaces at the
 149 beginning or at the end of the value, or between the tokens of the list.
 150 </p>
 151     <p>The whole check is dubious, because the attribute type expresses also a
 152 semantical constraint, not only a syntactical one. At least this parser
 153 distinguishes strictly between single-value and list types, and returns the
 154 attribute values differently; the first are represented as Value s (where s is
 155 a string), the latter are represented as Valuelist [s1; s2; ...; sN]. The
 156 internal representation of the value is dependent on the attribute type, too,
 157 such that even normalized values are processed differently depending on
 158 whether the attribute has list type or not. For this parser, it makes still a
 159 difference whether a value is normalized and processed as if it were CDATA, or
 160 whether the value is processed according to its declared type.
 161 </p>
 162     <p>The stand-alone check is included to be able to make a statement
 163 whether other, well-formedness parsers can process the document. Of course,
 164 these parsers always process attributes as CDATA, and the stand-alone check
 165 guarantees that these parsers will always see the normalized values.
 166 </p>
 167   </sect1>
 168
 169   <sect1>
 170     <title>Standalone documents and the restrictions on entity
 171 references</title>
 172     <p>
 173 Stand-alone documents must not refer to entities which are declared in an
 174 external entity. This parser applies this rule only: to general and NDATA
 175 entities when they occur in the document body (i.e. not in the DTD); and to
 176 general and NDATA entities occuring in default attribute values declared in the
 177 internal subset of the DTD.
 178 </p>
 179     <p>
 180 Parameter entities are out of discussion for the stand-alone property. If there
 181 is a parameter entity reference in the internal subset which was declared in an
 182 external entity, it is not available in the same way as the external entity is
 183 not available that contains its declaration. Because of this "equivalence",
 184 parameter entity references are not checked on violations against the
 185 stand-alone declaration. It simply does not matter. - Illustration:
 186 </p>
 187
 188     <p>
 189 Main document:
 190
 191     <code><![CDATA[
 192 <!ENTITY % ext SYSTEM "ext">
 193 %ext;
 194 %ent;
 195 ]]></code>
 196
 197 "ext" contains:
 198
 199     <code><![CDATA[
 200 <!ENTITY % ent "<!ELEMENT el (other*)>">
 201 ]]></code>
 202 </p>
 203
 204     <p>Here, the reference %ent; would be illegal if the standalone
 205 declaration is strictly interpreted. This parser handles the references
 206 %ent; and %ext; equivalently which means that %ent; is allowed, but the
 207 element type "el" is treated as externally declared.
 208 </p>
 209
 210     <p>
 211 General entities can occur within the DTD, but they can only be contained in
 212 the default value of attributes, or in the definition of other general
 213 entities. The latter can be ignored, because the check will be repeated when
 214 the entities are expanded. Though, general entities occuring in default
 215 attribute values are actually checked at the moment when the default is
 216 used in an element instance.
 217 </p>
 218     <p>
 219 General entities occuring in the document body are always checked.</p>
 220     <p>
 221 NDATA entities can occur in ENTITY attribute values; either in the element
 222 instance or in the default declaration. Both cases are checked.
 223 </p>
 224   </sect1>
 225
 226 </readme>