X-Git-Url: http://matita.cs.unibo.it/gitweb/?a=blobdiff_plain;f=helm%2FDEVEL%2Fpxp%2Fpxp%2Fdoc%2Fmanual%2Fhtml%2Fc1567.html;fp=helm%2FDEVEL%2Fpxp%2Fpxp%2Fdoc%2Fmanual%2Fhtml%2Fc1567.html;h=ab88e87bf7a921858f63dcbc69a8763b458c976e;hb=c03d2c1fdab8d228cb88aaba5ca0f556318bebc5;hp=0000000000000000000000000000000000000000;hpb=758057e85325f94cd88583feb1fdf6b038e35055;p=helm.git diff --git a/helm/DEVEL/pxp/pxp/doc/manual/html/c1567.html b/helm/DEVEL/pxp/pxp/doc/manual/html/c1567.html new file mode 100644 index 000000000..ab88e87bf --- /dev/null +++ b/helm/DEVEL/pxp/pxp/doc/manual/html/c1567.html @@ -0,0 +1,434 @@ +
There are the following main functions invoking the parser (in Pxp_yacc): + +
parse_document_entity: You want to +parse a complete and closed document consisting of a DTD and the document body; +the body is validated against the DTD. This mode is interesting if you have a +file + +
<!DOCTYPE root ... [ ... ] > <root> ... </root>+ +and you can accept any DTD that is included in the file (e.g. because the file +is under your control).
parse_wfdocument_entity: You want to +parse a complete and closed document consisting of a DTD and the document body; +but the body is not validated, only checked for well-formedness. This mode is +preferred if validation costs too much time or if the DTD is missing.
parse_dtd_entity: You want only to +parse an entity (file) containing the external subset of a DTD. Sometimes it is +interesting to read such a DTD, for example to compare it with the DTD included +in a document, or to apply the next mode:
parse_content_entity: You want only to +parse an entity (file) containing a fragment of a document body; this fragment +is validated against the DTD you pass to the function. Especially, the fragment +must not have a <!DOCTYPE> clause, and must directly +begin with an element. The element is validated against the DTD. This mode is +interesting if you want to check documents against a fixed, immutable DTD.
parse_wfcontent_entity: This function +also parses a single element without DTD, but does not validate it.
extract_dtd_from_document_entity: This +function extracts the DTD from a closed document consisting of a DTD and a +document body. Both the internal and the external subsets are extracted.
In many cases, parse_document_entity is the preferred mode +to parse a document in a validating way, and +parse_wfdocument_entity is the mode of choice to parse a +file while only checking for well-formedness.
There are a number of variations of these modes. One important application of a +parser is to check documents of an untrusted source against a fixed DTD. One +solution is to not allow the <!DOCTYPE> clause in +these documents, and treat the document like a fragment (using mode +parse_content_entity). This is very simple, but +inflexible; users of such a system cannot even define additional entities to +abbreviate frequent phrases of their text.
It may be necessary to have a more intelligent checker. For example, it is also +possible to parse the document to check fully, i.e. with DTD, and to compare +this DTD with the prescribed one. In order to fully parse the document, mode +parse_document_entity is applied, and to get the DTD to +compare with mode parse_dtd_entity can be used.
There is another very important configurable aspect of the parser: the +so-called resolver. The task of the resolver is to locate the contents of an +(external) entity for a given entity name, and to make the contents accessible +as a character stream. (Furthermore, it also normalizes the character set; +but this is a detail we can ignore here.) Consider you have a file called +"main.xml" containing + +
<!ENTITY % sub SYSTEM "sub/sub.xml"> +%sub;+ +and a file stored in the subdirectory "sub" with name +"sub.xml" containing + +
<!ENTITY % subsub SYSTEM "subsub/subsub.xml"> +%subsub;+ +and a file stored in the subdirectory "subsub" of +"sub" with name "subsub.xml" (the +contents of this file do not matter). Here, the resolver must track that +the second entity subsub is located in the directory +"sub/subsub", i.e. the difficulty is to interpret the +system (file) names of entities relative to the entities containing them, +even if the entities are deeply nested.
There is not a fixed resolver already doing everything right - resolving entity +names is a task that highly depends on the environment. The XML specification +only demands that SYSTEM entities are interpreted like URLs +(which is not very precise, as there are lots of URL schemes in use), hoping +that this helps overcoming the local peculiarities of the environment; the idea +is that if you do not know your environment you can refer to other entities by +denoting URLs for them. I think that this interpretation of +SYSTEM names may have some applications in the internet, but +it is not the first choice in general. Because of this, the resolver is a +separate module of the parser that can be exchanged by another one if +necessary; more precisely, the parser already defines several resolvers.
The following resolvers do already exist: + +
Resolvers reading from arbitrary input channels. These +can be configured such that a certain ID is associated with the channel; in +this case inner references to external entities can be resolved. There is also +a special resolver that interprets SYSTEM IDs as URLs; this resolver can +process relative SYSTEM names and determine the corresponding absolute URL.
A resolver that reads always from a given O'Caml +string. This resolver is not able to resolve further names unless the string is +not associated with any name, i.e. if the document contained in the string +refers to an external entity, this reference cannot be followed in this +case.
A resolver for file names. The SYSTEM +name is interpreted as file URL with the slash "/" as separator for +directories. - This resolver is derived from the generic URL resolver.
Note that the existing resolvers only interpret SYSTEM +names, not PUBLIC names. If it helps you, it is possible to +define resolvers for PUBLIC names, too; for example, such a +resolver could look up the public name in a hash table, and map it to a system +name which is passed over to the existing resolver for system names. It is +relatively simple to provide such a resolver.