Initial revision

[helm.git] / helm / DEVEL / pxp / pxp / doc / SPEC
diff --git a/helm/DEVEL/pxp/pxp/doc/SPEC b/helm/DEVEL/pxp/pxp/doc/SPEC

new file mode 100644 (file)

index 0000000..28e6914
--- /dev/null
+++ b/helm/DEVEL/pxp/pxp/doc/SPEC
@@ -0,0 +1,185 @@
+******************************************************************************
+Notes on the XML specification
+******************************************************************************
+
+
+==============================================================================
+This document
+==============================================================================
+
+There are some points in the XML specification which are ambiguous. The 
+following notes discuss these points, and describe how this parser behaves.
+
+==============================================================================
+Conditional sections and the token ]]>
+==============================================================================
+
+It is unclear what happens if an ignored section contains the token ]]> at 
+places where it is normally allowed, i.e. within string literals and comments, 
+e.g. 
+
+<![IGNORE[ <!-- ]]> --> ]]>
+
+On the one hand, the production rule of the XML grammar does not treat such 
+tokens specially. Following the grammar, already the first ]]> ends the 
+conditional section 
+
+<![IGNORE[ <!-- ]]>
+
+and the other tokens are included into the DTD.
+
+On the other hand, we can read: "Like the internal and external DTD subsets, a 
+conditional section may contain one or more complete declarations, comments, 
+processing instructions, or nested conditional sections, intermingled with 
+white space" (XML 1.0 spec, section 3.4). Complete declarations and comments 
+may contain ]]>, so this is contradictory to the grammar.
+
+The intention of conditional sections is to include or exclude the section 
+depending on the current replacement text of a parameter entity. Almost always 
+such sections are used as in 
+
+<!ENTITY % want.a.feature.or.not "INCLUDE">   (or "IGNORE")
+<![ %want.a.feature.or.not; [ ... ]]>
+
+This means that if it is possible to include a section it must also be legal to 
+ignore the same section. This is a strong indication that the token ]]> must 
+not count as section terminator if it occurs in a string literal or comment.
+
+This parser implements the latter.
+
+==============================================================================
+Conditional sections and the inclusion of parameter entities
+==============================================================================
+
+It is unclear what happens if an ignored section contains a reference to a 
+parameter entity. In most cases, this is not problematic because nesting of 
+parameter entities must respect declaration braces. The replacement text of 
+parameter entities must either contain a whole number of declarations or only 
+inner material of one declaration. Almost always it does not matter whether 
+these references are resolved or not (the section is ignored).
+
+But there is one case which is not explicitly specified: Is it allowed that the 
+replacement text of an entity contains the end marker ]]> of an ignored 
+conditional section? Example: 
+
+<!ENTITY % end "]]>">
+<![ IGNORE [ %end;
+
+We do not find the statement in the XML spec that the ]]> must be contained in 
+the same entity as the corresponding <![ (as for the tokens <! and > of 
+declarations). So it is possible to conclude that ]]> may be in another entity.
+
+Of course, there are many arguments not to allow such constructs: The resulting 
+code is incomprehensive, and parsing takes longer (especially if the entities 
+are external). I think the best argument against this kind of XML is that the 
+XML spec is not detailed enough, as it contains no rules where entity 
+references should be recognized and where not. For example: 
+
+<!ENTITY % y "]]>">
+<!ENTITY % x "<!ENTITY z '<![CDATA[some text%y;'>">
+<![ IGNORE [ %x; ]]>
+
+Which token ]]> counts? From a logical point of view, the ]]> in the third line 
+ends the conditional section. As already pointed out, the XML spec permits the 
+interpretation that ]]> is recognized even in string literals, and this may be 
+also true if it is "imported" from a separate entity; and so the first ]]> 
+denotes the end of the section.
+
+As a practical solution, this parser does not expand parameter entities in 
+ignored sections. Furthermore, it is also not allowed that the ending ]]> of 
+ignored or included sections is contained in a different entity than the 
+starting <![ token.
+
+==============================================================================
+Standalone documents and attribute normalization
+==============================================================================
+
+If a document is declared as stand-alone, a restriction on the effect of 
+attribute normalization takes effect for attributes declared in external 
+entities. Normally, the parser knows the type of the attribute from the ATTLIST 
+declaration, and it can normalize attribute values depending on their types. 
+For example, an NMTOKEN attribute can be written with leading or trailing 
+spaces, but the parser returns always the nmtoken without such added spaces; in 
+contrast to this, a CDATA attribute is not normalized in this way. For 
+stand-alone document the type information is not available if the ATTLIST 
+declaration is located in an external entity. Because of this, the XML spec 
+demands that attribute values must be written in their normal form in this 
+case, i.e. without additional spaces. 
+
+This parser interprets this restriction as follows. Obviously, the substitution 
+of character and entity references is not considered as a "change of the value" 
+as a result of the normalization, because these operations will be performed 
+identically if the ATTLIST declaration is not available. The same applies to 
+the substitution of TABs, CRs, and LFs by space characters. Only the removal of 
+spaces depending on the type of the attribute changes the value if the ATTLIST 
+is not available. 
+
+This means in detail: CDATA attributes never violate the stand-alone status. 
+ID, IDREF, NMTOKEN, ENTITY, NOTATION and enumerator attributes must not be 
+written with leading and/or trailing spaces. IDREF, ENTITIES, and NMTOKENS 
+attributes must not be written with extra spaces at the beginning or at the end 
+of the value, or between the tokens of the list. 
+
+The whole check is dubious, because the attribute type expresses also a 
+semantical constraint, not only a syntactical one. At least this parser 
+distinguishes strictly between single-value and list types, and returns the 
+attribute values differently; the first are represented as Value s (where s is 
+a string), the latter are represented as Valuelist [s1; s2; ...; sN]. The 
+internal representation of the value is dependent on the attribute type, too, 
+such that even normalized values are processed differently depending on whether 
+the attribute has list type or not. For this parser, it makes still a 
+difference whether a value is normalized and processed as if it were CDATA, or 
+whether the value is processed according to its declared type. 
+
+The stand-alone check is included to be able to make a statement whether other, 
+well-formedness parsers can process the document. Of course, these parsers 
+always process attributes as CDATA, and the stand-alone check guarantees that 
+these parsers will always see the normalized values. 
+
+==============================================================================
+Standalone documents and the restrictions on entity
+references
+==============================================================================
+
+Stand-alone documents must not refer to entities which are declared in an 
+external entity. This parser applies this rule only: to general and NDATA 
+entities when they occur in the document body (i.e. not in the DTD); and to 
+general and NDATA entities occuring in default attribute values declared in the 
+internal subset of the DTD. 
+
+Parameter entities are out of discussion for the stand-alone property. If there 
+is a parameter entity reference in the internal subset which was declared in an 
+external entity, it is not available in the same way as the external entity is 
+not available that contains its declaration. Because of this "equivalence", 
+parameter entity references are not checked on violations against the 
+stand-alone declaration. It simply does not matter. - Illustration: 
+
+Main document: 
+
+<!ENTITY % ext SYSTEM "ext">
+%ext;
+%ent;
+
+"ext" contains: 
+
+<!ENTITY % ent "<!ELEMENT el (other*)>">
+
+
+
+Here, the reference %ent; would be illegal if the standalone declaration is 
+strictly interpreted. This parser handles the references %ent; and %ext; 
+equivalently which means that %ent; is allowed, but the element type "el" is 
+treated as externally declared. 
+
+General entities can occur within the DTD, but they can only be contained in 
+the default value of attributes, or in the definition of other general 
+entities. The latter can be ignored, because the check will be repeated when 
+the entities are expanded. Though, general entities occuring in default 
+attribute values are actually checked at the moment when the default is used in 
+an element instance. 
+
+General entities occuring in the document body are always checked.
+
+NDATA entities can occur in ENTITY attribute values; either in the element 
+instance or in the default declaration. Both cases are checked. 
+