X-Git-Url: http://matita.cs.unibo.it/gitweb/?a=blobdiff_plain;f=helm%2FDEVEL%2Fpxp%2Fpxp%2Fdoc%2Fmanual%2Fhtml%2Fx1496.html;fp=helm%2FDEVEL%2Fpxp%2Fpxp%2Fdoc%2Fmanual%2Fhtml%2Fx1496.html;h=faea39fc62fe30a508da8da4c8aee1975c00b820;hb=c03d2c1fdab8d228cb88aaba5ca0f556318bebc5;hp=0000000000000000000000000000000000000000;hpb=758057e85325f94cd88583feb1fdf6b038e35055;p=helm.git diff --git a/helm/DEVEL/pxp/pxp/doc/manual/html/x1496.html b/helm/DEVEL/pxp/pxp/doc/manual/html/x1496.html new file mode 100644 index 000000000..faea39fc6 --- /dev/null +++ b/helm/DEVEL/pxp/pxp/doc/manual/html/x1496.html @@ -0,0 +1,442 @@ +
If an element declaration does not allow the element to +contain character data, the following rules apply.
If the element must be empty, i.e. it is declared with the +keyword EMPTY, the element instance must be effectively +empty (it must not even contain whitespace characters). The parser guarantees +that a declared EMPTY element does never contain a data +node, even if the data node represents the empty string.
If the element declaration only permits other elements to occur +within that element but not character data, it is still possible to insert +whitespace characters between the subelements. The parser ignores these +characters, too, and does not create data nodes for them.
Example. Consider the following element types: + +
<!ELEMENT x ( #PCDATA | z )* > +<!ELEMENT y ( z )* > +<!ELEMENT z EMPTY>+ +Only x may contain character data, the keyword +#PCDATA indicates this. The other types are character-free.
The XML term + +
<x><z/> <z/></x>+ +will be internally represented by an element node for x +with three subnodes: the first z element, a data node +containing the space character, and the second z element. +In contrast to this, the term + +
<y><z/> <z/></y>+ +is represented by an element node for y with only +two subnodes, the two z elements. There +is no data node for the space character because spaces are ignored in the +character-free element y.
The XML specification allows all Unicode characters in XML +texts. This parser can be configured such that UTF-8 is used to represent the +characters internally; however, the default character encoding is +ISO-8859-1. (Currently, no other encodings are possible for the internal string +representation; the type Pxp_types.rep_encoding enumerates +the possible encodings. Principially, the parser could use any encoding that is +ASCII-compatible, but there are currently only lexical analyzers for UTF-8 and +ISO-8859-1. It is currently impossible to use UTF-16 or UCS-4 as internal +encodings (or other multibyte encodings which are not ASCII-compatible) unless +major parts of the parser are rewritten - unlikely...)
The internal encoding may be different from the external encoding (specified +in the XML declaration <?xml ... encoding="..."?>); in +this case the strings are automatically converted to the internal encoding.
If the internal encoding is ISO-8859-1, it is possible that there are +characters that cannot be represented. In this case, the parser ignores such +characters and prints a warning (to the collect_warning +object that must be passed when the parser is called).
The XML specification allows lines to be separated by single LF +characters, by CR LF character sequences, or by single CR +characters. Internally, these separators are always converted to single LF +characters.
The parser guarantees that there are never two adjacent data +nodes; if necessary, data material that would otherwise be represented by +several nodes is collapsed into one node. Note that you can still create node +trees with adjacent data nodes; however, the parser does not return such trees.
Note that CDATA sections are not represented specially; such +sections are added to the current data material that being collected for the +next data node.
Entities are not represented within +documents! If the parser finds an entity reference in the document +content, the reference is immediately expanded, and the parser reads the +expansion text instead of the reference.
As attribute +values are composed of Unicode characters, too, the same problems with the +character encoding arise as for character material. Attribute values are +converted to the internal encoding, too; and if there are characters that +cannot be represented, these are dropped, and a warning is printed.
Attribute values are normalized before they are returned by +methods like attribute. First, any remaining entity +references are expanded; if necessary, expansion is performed recursively. +Second, newline characters (any of LF, CR LF, or CR characters) are converted +to single space characters. Note that especially the latter action is +prescribed by the XML standard (but is not converted +such that it is still possible to include line feeds into attributes).
Processing instructions are parsed to some extent: The first word of the +PI is called the target, and it is stored separated from the rest of the PI: + +
<?target rest?>+ +The exact location where a PI occurs is not represented (by default). The +parser puts the PI into the object that represents the embracing construct (an +element, a DTD, or the whole document); that means you can find out which PIs +occur in a certain element, in the DTD, or in the whole document, but you +cannot lookup the exact position within the construct.
If you require the exact location of PIs, it is possible to +create extra nodes for them. This mode is controled by the option +enable_pinstr_nodes. The additional nodes have the node type +T_pinstr target, and are created +from special exemplars contained in the spec (see +pxp_document.mli).
Normally, comments are not represented; they are dropped by +default. However, if you require them, it is possible to create +T_comment nodes for them. This mode can be specified by the +option enable_comment_nodes. Comment nodes are created from +special exemplars contained in the spec (see +pxp_document.mli). You can access the contents of comments through the +method comment.
These attributes are not supported specially; they are handled +like any other attribute.
Currently, there is no special support for namespaces. +However, the parser allows it that the colon occurs in names such that it is +possible to implement namespaces on top of the current API.
Some future release of PXP will support namespaces as built-in +feature...