helm/DEVEL/pxp/pxp/doc/manual/html/c1567.html

   1 <HTML
   2 ><HEAD
   3 ><TITLE
   4 >Configuring and calling the parser</TITLE
   5 ><META
   6 NAME="GENERATOR"
   7 CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
   8 REL="HOME"
   9 TITLE="The PXP user's guide"
  10 HREF="index.html"><LINK
  11 REL="UP"
  12 TITLE="User's guide"
  13 HREF="p34.html"><LINK
  14 REL="PREVIOUS"
  15 TITLE="Details of the mapping from XML text to the tree representation"
  16 HREF="x1496.html"><LINK
  17 REL="NEXT"
  18 TITLE="Resolvers and sources"
  19 HREF="x1629.html"><LINK
  20 REL="STYLESHEET"
  21 TYPE="text/css"
  22 HREF="markup.css"></HEAD
  23 ><BODY
  24 CLASS="CHAPTER"
  25 BGCOLOR="#FFFFFF"
  26 TEXT="#000000"
  27 LINK="#0000FF"
  28 VLINK="#840084"
  29 ALINK="#0000FF"
  30 ><DIV
  31 CLASS="NAVHEADER"
  32 ><TABLE
  33 WIDTH="100%"
  34 BORDER="0"
  35 CELLPADDING="0"
  36 CELLSPACING="0"
  37 ><TR
  38 ><TH
  39 COLSPAN="3"
  40 ALIGN="center"
  41 >The PXP user's guide</TH
  42 ></TR
  43 ><TR
  44 ><TD
  45 WIDTH="10%"
  46 ALIGN="left"
  47 VALIGN="bottom"
  48 ><A
  49 HREF="x1496.html"
  50 >Prev</A
  51 ></TD
  52 ><TD
  53 WIDTH="80%"
  54 ALIGN="center"
  55 VALIGN="bottom"
  56 ></TD
  57 ><TD
  58 WIDTH="10%"
  59 ALIGN="right"
  60 VALIGN="bottom"
  61 ><A
  62 HREF="x1629.html"
  63 >Next</A
  64 ></TD
  65 ></TR
  66 ></TABLE
  67 ><HR
  68 ALIGN="LEFT"
  69 WIDTH="100%"></DIV
  70 ><DIV
  71 CLASS="CHAPTER"
  72 ><H1
  73 ><A
  74 NAME="AEN1567"
  75 >Chapter 4. Configuring and calling the parser</A
  76 ></H1
  77 ><DIV
  78 CLASS="TOC"
  79 ><DL
  80 ><DT
  81 ><B
  82 >Table of Contents</B
  83 ></DT
  84 ><DT
  85 >4.1. <A
  86 HREF="c1567.html#AEN1569"
  87 >Overview</A
  88 ></DT
  89 ><DT
  90 >4.2. <A
  91 HREF="x1629.html"
  92 >Resolvers and sources</A
  93 ></DT
  94 ><DT
  95 >4.3. <A
  96 HREF="x1812.html"
  97 >The DTD classes</A
  98 ></DT
  99 ><DT
 100 >4.4. <A
 101 HREF="x1818.html"
 102 >Invoking the parser</A
 103 ></DT
 104 ><DT
 105 >4.5. <A
 106 HREF="x1965.html"
 107 >Updates</A
 108 ></DT
 109 ></DL
 110 ></DIV
 111 ><DIV
 112 CLASS="SECT1"
 113 ><H1
 114 CLASS="SECT1"
 115 ><A
 116 NAME="AEN1569"
 117 >4.1. Overview</A
 118 ></H1
 119 ><P
 120 >There are the following main functions invoking the parser (in Pxp_yacc):
 121
 122           <P
 123 ></P
 124 ><UL
 125 COMPACT="COMPACT"
 126 ><LI
 127 STYLE="list-style-type: disc"
 128 ><P
 129 ><I
 130 CLASS="EMPHASIS"
 131 >parse_document_entity:</I
 132 > You want to
 133 parse a complete and closed document consisting of a DTD and the document body;
 134 the body is validated against the DTD. This mode is interesting if you have a
 135 file
 136
 137 <PRE
 138 CLASS="PROGRAMLISTING"
 139 >&#60;!DOCTYPE root ... [ ... ] &#62; &#60;root&#62; ... &#60;/root&#62;</PRE
 140 >
 141
 142 and you can accept any DTD that is included in the file (e.g. because the file
 143 is under your control).</P
 144 ></LI
 145 ><LI
 146 STYLE="list-style-type: disc"
 147 ><P
 148 ><I
 149 CLASS="EMPHASIS"
 150 >parse_wfdocument_entity:</I
 151 > You want to
 152 parse a complete and closed document consisting of a DTD and the document body;
 153 but the body is not validated, only checked for well-formedness. This mode is
 154 preferred if validation costs too much time or if the DTD is missing.</P
 155 ></LI
 156 ><LI
 157 STYLE="list-style-type: disc"
 158 ><P
 159 ><I
 160 CLASS="EMPHASIS"
 161 >parse_dtd_entity:</I
 162 > You want only to
 163 parse an entity (file) containing the external subset of a DTD. Sometimes it is
 164 interesting to read such a DTD, for example to compare it with the DTD included
 165 in a document, or to apply the next mode:</P
 166 ></LI
 167 ><LI
 168 STYLE="list-style-type: disc"
 169 ><P
 170 ><I
 171 CLASS="EMPHASIS"
 172 >parse_content_entity:</I
 173 > You want only to
 174 parse an entity (file) containing a fragment of a document body; this fragment
 175 is validated against the DTD you pass to the function. Especially, the fragment
 176 must not have a <TT
 177 CLASS="LITERAL"
 178 > &lt;!DOCTYPE&gt;</TT
 179 > clause, and must directly
 180 begin with an element.  The element is validated against the DTD.  This mode is
 181 interesting if you want to check documents against a fixed, immutable DTD.</P
 182 ></LI
 183 ><LI
 184 STYLE="list-style-type: disc"
 185 ><P
 186 ><I
 187 CLASS="EMPHASIS"
 188 >parse_wfcontent_entity:</I
 189 > This function
 190 also parses a single element without DTD, but does not validate it.</P
 191 ></LI
 192 ><LI
 193 STYLE="list-style-type: disc"
 194 ><P
 195 ><I
 196 CLASS="EMPHASIS"
 197 >extract_dtd_from_document_entity:</I
 198 > This
 199 function extracts the DTD from a closed document consisting of a DTD and a
 200 document body. Both the internal and the external subsets are extracted.</P
 201 ></LI
 202 ></UL
 203 ></P
 204 ><P
 205 >In many cases, <TT
 206 CLASS="LITERAL"
 207 >parse_document_entity</TT
 208 > is the preferred mode
 209 to parse a document in a validating way, and
 210 <TT
 211 CLASS="LITERAL"
 212 >parse_wfdocument_entity</TT
 213 > is the mode of choice to parse a
 214 file while only checking for well-formedness.</P
 215 ><P
 216 >There are a number of variations of these modes. One important application of a
 217 parser is to check documents of an untrusted source against a fixed DTD. One
 218 solution is to not allow the <TT
 219 CLASS="LITERAL"
 220 >&lt;!DOCTYPE&gt;</TT
 221 > clause in
 222 these documents, and treat the document like a fragment (using mode
 223 <I
 224 CLASS="EMPHASIS"
 225 >parse_content_entity</I
 226 >). This is very simple, but
 227 inflexible; users of such a system cannot even define additional entities to
 228 abbreviate frequent phrases of their text.</P
 229 ><P
 230 >It may be necessary to have a more intelligent checker. For example, it is also
 231 possible to parse the document to check fully, i.e. with DTD, and to compare
 232 this DTD with the prescribed one. In order to fully parse the document, mode
 233 <I
 234 CLASS="EMPHASIS"
 235 >parse_document_entity</I
 236 > is applied, and to get the DTD to
 237 compare with mode <I
 238 CLASS="EMPHASIS"
 239 >parse_dtd_entity</I
 240 > can be used.</P
 241 ><P
 242 >There is another very important configurable aspect of the parser: the
 243 so-called resolver. The task of the resolver is to locate the contents of an
 244 (external) entity for a given entity name, and to make the contents accessible
 245 as a character stream. (Furthermore, it also normalizes the character set;
 246 but this is a detail we can ignore here.) Consider you have a file called
 247 <TT
 248 CLASS="LITERAL"
 249 >"main.xml"</TT
 250 > containing
 251
 252 <PRE
 253 CLASS="PROGRAMLISTING"
 254 >&#60;!ENTITY % sub SYSTEM "sub/sub.xml"&#62;
 255 %sub;</PRE
 256 >
 257
 258 and a file stored in the subdirectory <TT
 259 CLASS="LITERAL"
 260 >"sub"</TT
 261 > with name
 262 <TT
 263 CLASS="LITERAL"
 264 >"sub.xml"</TT
 265 > containing
 266
 267 <PRE
 268 CLASS="PROGRAMLISTING"
 269 >&#60;!ENTITY % subsub SYSTEM "subsub/subsub.xml"&#62;
 270 %subsub;</PRE
 271 >
 272
 273 and a file stored in the subdirectory <TT
 274 CLASS="LITERAL"
 275 >"subsub"</TT
 276 > of
 277 <TT
 278 CLASS="LITERAL"
 279 >"sub"</TT
 280 > with name <TT
 281 CLASS="LITERAL"
 282 >"subsub.xml"</TT
 283 > (the
 284 contents of this file do not matter). Here, the resolver must track that
 285 the second entity <TT
 286 CLASS="LITERAL"
 287 >subsub</TT
 288 > is located in the directory
 289 <TT
 290 CLASS="LITERAL"
 291 >"sub/subsub"</TT
 292 >, i.e. the difficulty is to interpret the
 293 system (file) names of entities relative to the entities containing them,
 294 even if the entities are deeply nested.</P
 295 ><P
 296 >There is not a fixed resolver already doing everything right - resolving entity
 297 names is a task that highly depends on the environment. The XML specification
 298 only demands that <TT
 299 CLASS="LITERAL"
 300 >SYSTEM</TT
 301 > entities are interpreted like URLs
 302 (which is not very precise, as there are lots of URL schemes in use), hoping
 303 that this helps overcoming the local peculiarities of the environment; the idea
 304 is that if you do not know your environment you can refer to other entities by
 305 denoting URLs for them. I think that this interpretation of
 306 <TT
 307 CLASS="LITERAL"
 308 >SYSTEM</TT
 309 > names may have some applications in the internet, but
 310 it is not the first choice in general. Because of this, the resolver is a
 311 separate module of the parser that can be exchanged by another one if
 312 necessary; more precisely, the parser already defines several resolvers.</P
 313 ><P
 314 >The following resolvers do already exist:
 315
 316           <P
 317 ></P
 318 ><UL
 319 COMPACT="COMPACT"
 320 ><LI
 321 STYLE="list-style-type: disc"
 322 ><P
 323 >Resolvers reading from arbitrary input channels. These
 324 can be configured such that a certain ID is associated with the channel; in
 325 this case inner references to external entities can be resolved. There is also
 326 a special resolver that interprets SYSTEM IDs as URLs; this resolver can
 327 process relative SYSTEM names and determine the corresponding absolute URL.</P
 328 ></LI
 329 ><LI
 330 STYLE="list-style-type: disc"
 331 ><P
 332 >A resolver that reads always from a given O'Caml
 333 string. This resolver is not able to resolve further names unless the string is
 334 not associated with any name, i.e. if the document contained in the string
 335 refers to an external entity, this reference cannot be followed in this
 336 case.</P
 337 ></LI
 338 ><LI
 339 STYLE="list-style-type: disc"
 340 ><P
 341 >A resolver for file names. The <TT
 342 CLASS="LITERAL"
 343 >SYSTEM</TT
 344 >
 345 name is interpreted as file URL with the slash "/" as separator for
 346 directories. - This resolver is derived from the generic URL resolver.</P
 347 ></LI
 348 ></UL
 349 >
 350
 351 The interface a resolver must have is documented, so it is possible to write
 352 your own resolver. For example, you could connect the parser with an HTTP
 353 client, and resolve URLs of the HTTP namespace. The resolver classes support
 354 that several independent resolvers are combined to one more powerful resolver;
 355 thus it is possible to combine a self-written resolver with the already
 356 existing resolvers.</P
 357 ><P
 358 >Note that the existing resolvers only interpret <TT
 359 CLASS="LITERAL"
 360 >SYSTEM</TT
 361 >
 362 names, not <TT
 363 CLASS="LITERAL"
 364 >PUBLIC</TT
 365 > names. If it helps you, it is possible to
 366 define resolvers for <TT
 367 CLASS="LITERAL"
 368 >PUBLIC</TT
 369 > names, too; for example, such a
 370 resolver could look up the public name in a hash table, and map it to a system
 371 name which is passed over to the existing resolver for system names. It is
 372 relatively simple to provide such a resolver.</P
 373 ></DIV
 374 ></DIV
 375 ><DIV
 376 CLASS="NAVFOOTER"
 377 ><HR
 378 ALIGN="LEFT"
 379 WIDTH="100%"><TABLE
 380 WIDTH="100%"
 381 BORDER="0"
 382 CELLPADDING="0"
 383 CELLSPACING="0"
 384 ><TR
 385 ><TD
 386 WIDTH="33%"
 387 ALIGN="left"
 388 VALIGN="top"
 389 ><A
 390 HREF="x1496.html"
 391 >Prev</A
 392 ></TD
 393 ><TD
 394 WIDTH="34%"
 395 ALIGN="center"
 396 VALIGN="top"
 397 ><A
 398 HREF="index.html"
 399 >Home</A
 400 ></TD
 401 ><TD
 402 WIDTH="33%"
 403 ALIGN="right"
 404 VALIGN="top"
 405 ><A
 406 HREF="x1629.html"
 407 >Next</A
 408 ></TD
 409 ></TR
 410 ><TR
 411 ><TD
 412 WIDTH="33%"
 413 ALIGN="left"
 414 VALIGN="top"
 415 >Details of the mapping from XML text to the tree representation</TD
 416 ><TD
 417 WIDTH="34%"
 418 ALIGN="center"
 419 VALIGN="top"
 420 ><A
 421 HREF="p34.html"
 422 >Up</A
 423 ></TD
 424 ><TD
 425 WIDTH="33%"
 426 ALIGN="right"
 427 VALIGN="top"
 428 >Resolvers and sources</TD
 429 ></TR
 430 ></TABLE
 431 ></DIV
 432 ></BODY
 433 ></HTML
 434 >