****************************************************************************** Notes on the XML specification ****************************************************************************** ============================================================================== This document ============================================================================== There are some points in the XML specification which are ambiguous. The following notes discuss these points, and describe how this parser behaves. ============================================================================== Conditional sections and the token ]]> ============================================================================== It is unclear what happens if an ignored section contains the token ]]> at places where it is normally allowed, i.e. within string literals and comments, e.g. --> ]]> On the one hand, the production rule of the XML grammar does not treat such tokens specially. Following the grammar, already the first ]]> ends the conditional section and the other tokens are included into the DTD. On the other hand, we can read: "Like the internal and external DTD subsets, a conditional section may contain one or more complete declarations, comments, processing instructions, or nested conditional sections, intermingled with white space" (XML 1.0 spec, section 3.4). Complete declarations and comments may contain ]]>, so this is contradictory to the grammar. The intention of conditional sections is to include or exclude the section depending on the current replacement text of a parameter entity. Almost always such sections are used as in (or "IGNORE") This means that if it is possible to include a section it must also be legal to ignore the same section. This is a strong indication that the token ]]> must not count as section terminator if it occurs in a string literal or comment. This parser implements the latter. ============================================================================== Conditional sections and the inclusion of parameter entities ============================================================================== It is unclear what happens if an ignored section contains a reference to a parameter entity. In most cases, this is not problematic because nesting of parameter entities must respect declaration braces. The replacement text of parameter entities must either contain a whole number of declarations or only inner material of one declaration. Almost always it does not matter whether these references are resolved or not (the section is ignored). But there is one case which is not explicitly specified: Is it allowed that the replacement text of an entity contains the end marker ]]> of an ignored conditional section? Example: "> must be contained in the same entity as the corresponding of declarations). So it is possible to conclude that ]]> may be in another entity. Of course, there are many arguments not to allow such constructs: The resulting code is incomprehensive, and parsing takes longer (especially if the entities are external). I think the best argument against this kind of XML is that the XML spec is not detailed enough, as it contains no rules where entity references should be recognized and where not. For example: "> "> Which token ]]> counts? From a logical point of view, the ]]> in the third line ends the conditional section. As already pointed out, the XML spec permits the interpretation that ]]> is recognized even in string literals, and this may be also true if it is "imported" from a separate entity; and so the first ]]> denotes the end of the section. As a practical solution, this parser does not expand parameter entities in ignored sections. Furthermore, it is also not allowed that the ending ]]> of ignored or included sections is contained in a different entity than the starting %ext; %ent; "ext" contains: "> Here, the reference %ent; would be illegal if the standalone declaration is strictly interpreted. This parser handles the references %ent; and %ext; equivalently which means that %ent; is allowed, but the element type "el" is treated as externally declared. General entities can occur within the DTD, but they can only be contained in the default value of attributes, or in the definition of other general entities. The latter can be ignored, because the check will be repeated when the entities are expanded. Though, general entities occuring in default attribute values are actually checked at the moment when the default is used in an element instance. General entities occuring in the document body are always checked. NDATA entities can occur in ENTITY attribute values; either in the element instance or in the default declaration. Both cases are checked.