helm/DEVEL/pxp/pxp/doc/SPEC

   1 ******************************************************************************
   2 Notes on the XML specification
   3 ******************************************************************************
   4
   5
   6 ==============================================================================
   7 This document
   8 ==============================================================================
   9
  10 There are some points in the XML specification which are ambiguous. The
  11 following notes discuss these points, and describe how this parser behaves.
  12
  13 ==============================================================================
  14 Conditional sections and the token ]]>
  15 ==============================================================================
  16
  17 It is unclear what happens if an ignored section contains the token ]]> at
  18 places where it is normally allowed, i.e. within string literals and comments,
  19 e.g.
  20
  21 <![IGNORE[ <!-- ]]> --> ]]>
  22
  23 On the one hand, the production rule of the XML grammar does not treat such
  24 tokens specially. Following the grammar, already the first ]]> ends the
  25 conditional section
  26
  27 <![IGNORE[ <!-- ]]>
  28
  29 and the other tokens are included into the DTD.
  30
  31 On the other hand, we can read: "Like the internal and external DTD subsets, a
  32 conditional section may contain one or more complete declarations, comments,
  33 processing instructions, or nested conditional sections, intermingled with
  34 white space" (XML 1.0 spec, section 3.4). Complete declarations and comments
  35 may contain ]]>, so this is contradictory to the grammar.
  36
  37 The intention of conditional sections is to include or exclude the section
  38 depending on the current replacement text of a parameter entity. Almost always
  39 such sections are used as in
  40
  41 <!ENTITY % want.a.feature.or.not "INCLUDE">   (or "IGNORE")
  42 <![ %want.a.feature.or.not; [ ... ]]>
  43
  44 This means that if it is possible to include a section it must also be legal to
  45 ignore the same section. This is a strong indication that the token ]]> must
  46 not count as section terminator if it occurs in a string literal or comment.
  47
  48 This parser implements the latter.
  49
  50 ==============================================================================
  51 Conditional sections and the inclusion of parameter entities
  52 ==============================================================================
  53
  54 It is unclear what happens if an ignored section contains a reference to a
  55 parameter entity. In most cases, this is not problematic because nesting of
  56 parameter entities must respect declaration braces. The replacement text of
  57 parameter entities must either contain a whole number of declarations or only
  58 inner material of one declaration. Almost always it does not matter whether
  59 these references are resolved or not (the section is ignored).
  60
  61 But there is one case which is not explicitly specified: Is it allowed that the
  62 replacement text of an entity contains the end marker ]]> of an ignored
  63 conditional section? Example:
  64
  65 <!ENTITY % end "]]>">
  66 <![ IGNORE [ %end;
  67
  68 We do not find the statement in the XML spec that the ]]> must be contained in
  69 the same entity as the corresponding <![ (as for the tokens <! and > of
  70 declarations). So it is possible to conclude that ]]> may be in another entity.
  71
  72 Of course, there are many arguments not to allow such constructs: The resulting
  73 code is incomprehensive, and parsing takes longer (especially if the entities
  74 are external). I think the best argument against this kind of XML is that the
  75 XML spec is not detailed enough, as it contains no rules where entity
  76 references should be recognized and where not. For example:
  77
  78 <!ENTITY % y "]]>">
  79 <!ENTITY % x "<!ENTITY z '<![CDATA[some text%y;'>">
  80 <![ IGNORE [ %x; ]]>
  81
  82 Which token ]]> counts? From a logical point of view, the ]]> in the third line
  83 ends the conditional section. As already pointed out, the XML spec permits the
  84 interpretation that ]]> is recognized even in string literals, and this may be
  85 also true if it is "imported" from a separate entity; and so the first ]]>
  86 denotes the end of the section.
  87
  88 As a practical solution, this parser does not expand parameter entities in
  89 ignored sections. Furthermore, it is also not allowed that the ending ]]> of
  90 ignored or included sections is contained in a different entity than the
  91 starting <![ token.
  92
  93 ==============================================================================
  94 Standalone documents and attribute normalization
  95 ==============================================================================
  96
  97 If a document is declared as stand-alone, a restriction on the effect of
  98 attribute normalization takes effect for attributes declared in external
  99 entities. Normally, the parser knows the type of the attribute from the ATTLIST
 100 declaration, and it can normalize attribute values depending on their types.
 101 For example, an NMTOKEN attribute can be written with leading or trailing
 102 spaces, but the parser returns always the nmtoken without such added spaces; in
 103 contrast to this, a CDATA attribute is not normalized in this way. For
 104 stand-alone document the type information is not available if the ATTLIST
 105 declaration is located in an external entity. Because of this, the XML spec
 106 demands that attribute values must be written in their normal form in this
 107 case, i.e. without additional spaces.
 108
 109 This parser interprets this restriction as follows. Obviously, the substitution
 110 of character and entity references is not considered as a "change of the value"
 111 as a result of the normalization, because these operations will be performed
 112 identically if the ATTLIST declaration is not available. The same applies to
 113 the substitution of TABs, CRs, and LFs by space characters. Only the removal of
 114 spaces depending on the type of the attribute changes the value if the ATTLIST
 115 is not available.
 116
 117 This means in detail: CDATA attributes never violate the stand-alone status.
 118 ID, IDREF, NMTOKEN, ENTITY, NOTATION and enumerator attributes must not be
 119 written with leading and/or trailing spaces. IDREF, ENTITIES, and NMTOKENS
 120 attributes must not be written with extra spaces at the beginning or at the end
 121 of the value, or between the tokens of the list.
 122
 123 The whole check is dubious, because the attribute type expresses also a
 124 semantical constraint, not only a syntactical one. At least this parser
 125 distinguishes strictly between single-value and list types, and returns the
 126 attribute values differently; the first are represented as Value s (where s is
 127 a string), the latter are represented as Valuelist [s1; s2; ...; sN]. The
 128 internal representation of the value is dependent on the attribute type, too,
 129 such that even normalized values are processed differently depending on whether
 130 the attribute has list type or not. For this parser, it makes still a
 131 difference whether a value is normalized and processed as if it were CDATA, or
 132 whether the value is processed according to its declared type.
 133
 134 The stand-alone check is included to be able to make a statement whether other,
 135 well-formedness parsers can process the document. Of course, these parsers
 136 always process attributes as CDATA, and the stand-alone check guarantees that
 137 these parsers will always see the normalized values.
 138
 139 ==============================================================================
 140 Standalone documents and the restrictions on entity
 141 references
 142 ==============================================================================
 143
 144 Stand-alone documents must not refer to entities which are declared in an
 145 external entity. This parser applies this rule only: to general and NDATA
 146 entities when they occur in the document body (i.e. not in the DTD); and to
 147 general and NDATA entities occuring in default attribute values declared in the
 148 internal subset of the DTD.
 149
 150 Parameter entities are out of discussion for the stand-alone property. If there
 151 is a parameter entity reference in the internal subset which was declared in an
 152 external entity, it is not available in the same way as the external entity is
 153 not available that contains its declaration. Because of this "equivalence",
 154 parameter entity references are not checked on violations against the
 155 stand-alone declaration. It simply does not matter. - Illustration:
 156
 157 Main document:
 158
 159 <!ENTITY % ext SYSTEM "ext">
 160 %ext;
 161 %ent;
 162
 163 "ext" contains:
 164
 165 <!ENTITY % ent "<!ELEMENT el (other*)>">
 166
 167
 168
 169 Here, the reference %ent; would be illegal if the standalone declaration is
 170 strictly interpreted. This parser handles the references %ent; and %ext;
 171 equivalently which means that %ent; is allowed, but the element type "el" is
 172 treated as externally declared.
 173
 174 General entities can occur within the DTD, but they can only be contained in
 175 the default value of attributes, or in the definition of other general
 176 entities. The latter can be ignored, because the check will be repeated when
 177 the entities are expanded. Though, general entities occuring in default
 178 attribute values are actually checked at the moment when the default is used in
 179 an element instance.
 180
 181 General entities occuring in the document body are always checked.
 182
 183 NDATA entities can occur in ENTITY attribute values; either in the element
 184 instance or in the default declaration. Both cases are checked.
 185