4 >Configuring and calling the parser</TITLE
7 CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
9 TITLE="The PXP user's guide"
10 HREF="index.html"><LINK
15 TITLE="Details of the mapping from XML text to the tree representation"
16 HREF="x1496.html"><LINK
18 TITLE="Resolvers and sources"
19 HREF="x1629.html"><LINK
22 HREF="markup.css"></HEAD
41 >The PXP user's guide</TH
75 >Chapter 4. Configuring and calling the parser</A
86 HREF="c1567.html#AEN1569"
92 >Resolvers and sources</A
102 >Invoking the parser</A
120 >There are the following main functions invoking the parser (in Pxp_yacc):
127 STYLE="list-style-type: disc"
131 >parse_document_entity:</I
133 parse a complete and closed document consisting of a DTD and the document body;
134 the body is validated against the DTD. This mode is interesting if you have a
138 CLASS="PROGRAMLISTING"
139 ><!DOCTYPE root ... [ ... ] > <root> ... </root></PRE
142 and you can accept any DTD that is included in the file (e.g. because the file
143 is under your control).</P
146 STYLE="list-style-type: disc"
150 >parse_wfdocument_entity:</I
152 parse a complete and closed document consisting of a DTD and the document body;
153 but the body is not validated, only checked for well-formedness. This mode is
154 preferred if validation costs too much time or if the DTD is missing.</P
157 STYLE="list-style-type: disc"
161 >parse_dtd_entity:</I
163 parse an entity (file) containing the external subset of a DTD. Sometimes it is
164 interesting to read such a DTD, for example to compare it with the DTD included
165 in a document, or to apply the next mode:</P
168 STYLE="list-style-type: disc"
172 >parse_content_entity:</I
174 parse an entity (file) containing a fragment of a document body; this fragment
175 is validated against the DTD you pass to the function. Especially, the fragment
178 > <!DOCTYPE></TT
179 > clause, and must directly
180 begin with an element. The element is validated against the DTD. This mode is
181 interesting if you want to check documents against a fixed, immutable DTD.</P
184 STYLE="list-style-type: disc"
188 >parse_wfcontent_entity:</I
190 also parses a single element without DTD, but does not validate it.</P
193 STYLE="list-style-type: disc"
197 >extract_dtd_from_document_entity:</I
199 function extracts the DTD from a closed document consisting of a DTD and a
200 document body. Both the internal and the external subsets are extracted.</P
207 >parse_document_entity</TT
208 > is the preferred mode
209 to parse a document in a validating way, and
212 >parse_wfdocument_entity</TT
213 > is the mode of choice to parse a
214 file while only checking for well-formedness.</P
216 >There are a number of variations of these modes. One important application of a
217 parser is to check documents of an untrusted source against a fixed DTD. One
218 solution is to not allow the <TT
220 ><!DOCTYPE></TT
222 these documents, and treat the document like a fragment (using mode
225 >parse_content_entity</I
226 >). This is very simple, but
227 inflexible; users of such a system cannot even define additional entities to
228 abbreviate frequent phrases of their text.</P
230 >It may be necessary to have a more intelligent checker. For example, it is also
231 possible to parse the document to check fully, i.e. with DTD, and to compare
232 this DTD with the prescribed one. In order to fully parse the document, mode
235 >parse_document_entity</I
236 > is applied, and to get the DTD to
242 >There is another very important configurable aspect of the parser: the
243 so-called resolver. The task of the resolver is to locate the contents of an
244 (external) entity for a given entity name, and to make the contents accessible
245 as a character stream. (Furthermore, it also normalizes the character set;
246 but this is a detail we can ignore here.) Consider you have a file called
253 CLASS="PROGRAMLISTING"
254 ><!ENTITY % sub SYSTEM "sub/sub.xml">
258 and a file stored in the subdirectory <TT
268 CLASS="PROGRAMLISTING"
269 ><!ENTITY % subsub SYSTEM "subsub/subsub.xml">
273 and a file stored in the subdirectory <TT
284 contents of this file do not matter). Here, the resolver must track that
285 the second entity <TT
288 > is located in the directory
292 >, i.e. the difficulty is to interpret the
293 system (file) names of entities relative to the entities containing them,
294 even if the entities are deeply nested.</P
296 >There is not a fixed resolver already doing everything right - resolving entity
297 names is a task that highly depends on the environment. The XML specification
298 only demands that <TT
301 > entities are interpreted like URLs
302 (which is not very precise, as there are lots of URL schemes in use), hoping
303 that this helps overcoming the local peculiarities of the environment; the idea
304 is that if you do not know your environment you can refer to other entities by
305 denoting URLs for them. I think that this interpretation of
309 > names may have some applications in the internet, but
310 it is not the first choice in general. Because of this, the resolver is a
311 separate module of the parser that can be exchanged by another one if
312 necessary; more precisely, the parser already defines several resolvers.</P
314 >The following resolvers do already exist:
321 STYLE="list-style-type: disc"
323 >Resolvers reading from arbitrary input channels. These
324 can be configured such that a certain ID is associated with the channel; in
325 this case inner references to external entities can be resolved. There is also
326 a special resolver that interprets SYSTEM IDs as URLs; this resolver can
327 process relative SYSTEM names and determine the corresponding absolute URL.</P
330 STYLE="list-style-type: disc"
332 >A resolver that reads always from a given O'Caml
333 string. This resolver is not able to resolve further names unless the string is
334 not associated with any name, i.e. if the document contained in the string
335 refers to an external entity, this reference cannot be followed in this
339 STYLE="list-style-type: disc"
341 >A resolver for file names. The <TT
345 name is interpreted as file URL with the slash "/" as separator for
346 directories. - This resolver is derived from the generic URL resolver.</P
351 The interface a resolver must have is documented, so it is possible to write
352 your own resolver. For example, you could connect the parser with an HTTP
353 client, and resolve URLs of the HTTP namespace. The resolver classes support
354 that several independent resolvers are combined to one more powerful resolver;
355 thus it is possible to combine a self-written resolver with the already
356 existing resolvers.</P
358 >Note that the existing resolvers only interpret <TT
365 > names. If it helps you, it is possible to
366 define resolvers for <TT
369 > names, too; for example, such a
370 resolver could look up the public name in a hash table, and map it to a system
371 name which is passed over to the existing resolver for system names. It is
372 relatively simple to provide such a resolver.</P
415 >Details of the mapping from XML text to the tree representation</TD
428 >Resolvers and sources</TD