Initial revision

[helm.git] / helm / DEVEL / pxp / pxp / doc / SPEC.xml
diff --git a/helm/DEVEL/pxp/pxp/doc/SPEC.xml b/helm/DEVEL/pxp/pxp/doc/SPEC.xml

new file mode 100644 (file)

index 0000000..906f45a
--- /dev/null
+++ b/helm/DEVEL/pxp/pxp/doc/SPEC.xml
@@ -0,0 +1,226 @@
+<?xml version="1.0" encoding="ISO-8859-1"?>
+<!DOCTYPE readme SYSTEM "readme.dtd" [
+
+<!ENTITY % common SYSTEM "common.xml">
+%common;
+
+<!-- Special HTML config: -->
+<!ENTITY % readme:html:up '<a href="../..">up</a>'>
+
+<!ENTITY % config SYSTEM "config.xml">
+%config;
+
+]>
+
+<readme title="Notes on the XML specification">
+
+  <sect1>
+    <title>This document</title>
+    <p>There are some points in the XML specification which are ambiguous.
+The following notes discuss these points, and describe how this parser
+behaves.</p>
+  </sect1>
+
+  <sect1>
+    <title>Conditional sections and the token ]]&gt;</title>
+
+    <p>It is unclear what happens if an ignored section contains the
+token ]]&gt; at places where it is normally allowed, i.e. within string
+literals and comments, e.g.
+
+<code>
+&lt;![IGNORE[ &lt;!-- ]]&gt; --&gt; ]]&gt;
+</code>
+
+On the one hand, the production rule of the XML grammar does not treat such 
+tokens specially. Following the grammar, already the first ]]&gt; ends
+the conditional section
+
+<code>
+&lt;![IGNORE[ &lt;!-- ]]&gt;
+</code>
+
+and the other tokens are included into the DTD.</p>
+
+<p>On the other hand, we can read: "Like the internal and external DTD subsets,
+a conditional section may contain one or more complete declarations, comments,
+processing instructions, or nested conditional sections, intermingled with
+white space" (XML 1.0 spec, section 3.4). Complete declarations and comments
+may contain ]]&gt;, so this is contradictory to the grammar.</p>
+
+<p>The intention of conditional sections is to include or exclude the section 
+depending on the current replacement text of a parameter entity. Almost
+always such sections are used as in
+
+<code>
+&lt;!ENTITY % want.a.feature.or.not "INCLUDE"&gt;   (or "IGNORE")
+&lt;![ %want.a.feature.or.not; [ ... ]]&gt;
+</code>
+
+This means that if it is possible to include a section it must also be
+legal to ignore the same section. This is a strong indication that 
+the token ]]&gt; must not count as section terminator if it occurs
+in a string literal or comment.</p>
+
+<p>This parser implements the latter.</p>
+
+  </sect1>
+
+  <sect1>
+    <title>Conditional sections and the inclusion of parameter entities</title>
+
+    <p>It is unclear what happens if an ignored section contains a reference
+to a parameter entity. In most cases, this is not problematic because 
+nesting of parameter entities must respect declaration braces. The
+replacement text of parameter entities must either contain a <em>whole</em>
+number of declarations or only inner material of one declaration. Almost always
+it does not matter whether these references are resolved or not
+(the section is ignored).</p>
+
+    <p>But there is one case which is not explicitly specified: Is it allowed
+that the replacement text of an entity contains the end marker ]]&gt; 
+of an ignored conditional section? Example:
+
+<code>
+&lt;!ENTITY % end "]]&gt;"&gt;
+&lt;![ IGNORE [ %end;
+</code>
+
+We do not find the statement in the XML spec that the ]]&gt; must be contained
+in the same entity as the corresponding &lt;![ (as for the tokens &lt;! and
+&gt; of declarations). So it is possible to conclude that ]]&gt; may be in
+another entity.</p>
+
+    <p>Of course, there are many arguments not to allow such constructs: The
+resulting code is incomprehensive, and parsing takes longer (especially if the
+entities are external). I think the best argument against this kind of XML
+is that the XML spec is not detailed enough, as it contains no rules where
+entity references should be recognized and where not. For example:
+
+<code>
+&lt;!ENTITY % y "]]&gt;"&gt;
+&lt;!ENTITY % x "&lt;!ENTITY z '&lt;![CDATA[some text%y;'&gt;"&gt;
+&lt;![ IGNORE [ %x; ]]&gt;
+</code>
+
+Which token ]]&gt; counts? From a logical point of view, the ]]&gt; in the
+third line ends the conditional section. As already pointed out, the XML spec
+permits the interpretation that ]]&gt; is recognized even in string literals,
+and this may be also true if it is "imported" from a separate entity; and so
+the first ]]&gt; denotes the end of the section.</p>
+
+    <p>As a practical solution, this parser does not expand parameter entities
+in ignored sections. Furthermore, it is also not allowed that the ending ]]&gt;
+of ignored or included sections is contained in a different entity than the
+starting &lt;![ token.</p>
+  </sect1>
+
+
+  <sect1>
+    <title>Standalone documents and attribute normalization</title>
+    
+    <p>
+If a document is declared as stand-alone, a restriction on the effect of
+attribute normalization takes effect for attributes declared in external
+entities. Normally, the parser knows the type of the attribute from
+the ATTLIST declaration, and it can normalize attribute values depending
+on their types. For example, an NMTOKEN attribute can be written with
+leading or trailing spaces, but the parser returns always the nmtoken
+without such added spaces; in contrast to this, a CDATA attribute is
+not normalized in this way. For stand-alone document the type information is
+not available if the ATTLIST declaration is located in an external
+entity. Because of this, the XML spec demands that attribute values must
+be written in their normal form in this case, i.e. without additional
+spaces.
+</p>
+    <p>This parser interprets this restriction as follows. Obviously, 
+the substitution of character and entity references is not considered
+as a "change of the value" as a result of the normalization, because
+these operations will be performed identically if the ATTLIST declaration
+is not available. The same applies to the substitution of TABs, CRs, 
+and LFs by space characters. Only the removal of spaces depending on
+the type of the attribute changes the value if the ATTLIST is not
+available.
+</p>
+    <p>This means in detail: CDATA attributes never violate the
+stand-alone status. ID, IDREF, NMTOKEN, ENTITY, NOTATION and enumerator
+attributes must not be written with leading and/or trailing spaces. IDREF,
+ENTITIES, and NMTOKENS attributes must not be written with extra spaces at the
+beginning or at the end of the value, or between the tokens of the list.
+</p>
+    <p>The whole check is dubious, because the attribute type expresses also a
+semantical constraint, not only a syntactical one. At least this parser
+distinguishes strictly between single-value and list types, and returns the
+attribute values differently; the first are represented as Value s (where s is
+a string), the latter are represented as Valuelist [s1; s2; ...; sN]. The
+internal representation of the value is dependent on the attribute type, too,
+such that even normalized values are processed differently depending on
+whether the attribute has list type or not. For this parser, it makes still a
+difference whether a value is normalized and processed as if it were CDATA, or
+whether the value is processed according to its declared type.
+</p>
+    <p>The stand-alone check is included to be able to make a statement 
+whether other, well-formedness parsers can process the document. Of course,
+these parsers always process attributes as CDATA, and the stand-alone check
+guarantees that these parsers will always see the normalized values.
+</p>
+  </sect1>
+
+  <sect1>
+    <title>Standalone documents and the restrictions on entity
+references</title>
+    <p>
+Stand-alone documents must not refer to entities which are declared in an
+external entity. This parser applies this rule only: to general and NDATA
+entities when they occur in the document body (i.e. not in the DTD); and to
+general and NDATA entities occuring in default attribute values declared in the
+internal subset of the DTD.
+</p>
+    <p>
+Parameter entities are out of discussion for the stand-alone property. If there
+is a parameter entity reference in the internal subset which was declared in an
+external entity, it is not available in the same way as the external entity is
+not available that contains its declaration. Because of this "equivalence",
+parameter entity references are not checked on violations against the
+stand-alone declaration. It simply does not matter. - Illustration:
+</p>
+
+    <p>
+Main document:
+
+    <code><![CDATA[
+<!ENTITY % ext SYSTEM "ext">
+%ext;
+%ent;
+]]></code>
+
+"ext" contains:
+
+    <code><![CDATA[
+<!ENTITY % ent "<!ELEMENT el (other*)>">
+]]></code>
+</p>
+
+    <p>Here, the reference %ent; would be illegal if the standalone
+declaration is strictly interpreted. This parser handles the references
+%ent; and %ext; equivalently which means that %ent; is allowed, but the
+element type "el" is treated as externally declared.
+</p>
+
+    <p>
+General entities can occur within the DTD, but they can only be contained in
+the default value of attributes, or in the definition of other general
+entities. The latter can be ignored, because the check will be repeated when
+the entities are expanded. Though, general entities occuring in default
+attribute values are actually checked at the moment when the default is
+used in an element instance.
+</p>
+    <p>
+General entities occuring in the document body are always checked.</p>
+    <p>
+NDATA entities can occur in ENTITY attribute values; either in the element
+instance or in the default declaration. Both cases are checked.
+</p>
+  </sect1>
+
+</readme>