Chapter 4. Configuring and calling the parser

Table of Contents
4.1. Overview
4.2. Resolvers and sources
4.3. The DTD classes
4.4. Invoking the parser
4.5. Updates

4.1. Overview

There are the following main functions invoking the parser (in Pxp_yacc):

In many cases, parse_document_entity is the preferred mode to parse a document in a validating way, and parse_wfdocument_entity is the mode of choice to parse a file while only checking for well-formedness.

There are a number of variations of these modes. One important application of a parser is to check documents of an untrusted source against a fixed DTD. One solution is to not allow the <!DOCTYPE> clause in these documents, and treat the document like a fragment (using mode parse_content_entity). This is very simple, but inflexible; users of such a system cannot even define additional entities to abbreviate frequent phrases of their text.

It may be necessary to have a more intelligent checker. For example, it is also possible to parse the document to check fully, i.e. with DTD, and to compare this DTD with the prescribed one. In order to fully parse the document, mode parse_document_entity is applied, and to get the DTD to compare with mode parse_dtd_entity can be used.

There is another very important configurable aspect of the parser: the so-called resolver. The task of the resolver is to locate the contents of an (external) entity for a given entity name, and to make the contents accessible as a character stream. (Furthermore, it also normalizes the character set; but this is a detail we can ignore here.) Consider you have a file called "main.xml" containing

<!ENTITY % sub SYSTEM "sub/sub.xml">
%sub;
and a file stored in the subdirectory "sub" with name "sub.xml" containing
<!ENTITY % subsub SYSTEM "subsub/subsub.xml">
%subsub;
and a file stored in the subdirectory "subsub" of "sub" with name "subsub.xml" (the contents of this file do not matter). Here, the resolver must track that the second entity subsub is located in the directory "sub/subsub", i.e. the difficulty is to interpret the system (file) names of entities relative to the entities containing them, even if the entities are deeply nested.

There is not a fixed resolver already doing everything right - resolving entity names is a task that highly depends on the environment. The XML specification only demands that SYSTEM entities are interpreted like URLs (which is not very precise, as there are lots of URL schemes in use), hoping that this helps overcoming the local peculiarities of the environment; the idea is that if you do not know your environment you can refer to other entities by denoting URLs for them. I think that this interpretation of SYSTEM names may have some applications in the internet, but it is not the first choice in general. Because of this, the resolver is a separate module of the parser that can be exchanged by another one if necessary; more precisely, the parser already defines several resolvers.

The following resolvers do already exist:

The interface a resolver must have is documented, so it is possible to write your own resolver. For example, you could connect the parser with an HTTP client, and resolve URLs of the HTTP namespace. The resolver classes support that several independent resolvers are combined to one more powerful resolver; thus it is possible to combine a self-written resolver with the already existing resolvers.

Note that the existing resolvers only interpret SYSTEM names, not PUBLIC names. If it helps you, it is possible to define resolvers for PUBLIC names, too; for example, such a resolver could look up the public name in a hash table, and map it to a system name which is passed over to the existing resolver for system names. It is relatively simple to provide such a resolver.