helm/DEVEL/pxp/pxp/doc/manual/html/x550.html

   1 <HTML
   2 ><HEAD
   3 ><TITLE
   4 >How to parse a document from an application</TITLE
   5 ><META
   6 NAME="GENERATOR"
   7 CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
   8 REL="HOME"
   9 TITLE="The PXP user's guide"
  10 HREF="index.html"><LINK
  11 REL="UP"
  12 TITLE="Using PXP"
  13 HREF="c533.html"><LINK
  14 REL="PREVIOUS"
  15 TITLE="Using PXP"
  16 HREF="c533.html"><LINK
  17 REL="NEXT"
  18 TITLE="Class-based processing of the node tree"
  19 HREF="x675.html"><LINK
  20 REL="STYLESHEET"
  21 TYPE="text/css"
  22 HREF="markup.css"></HEAD
  23 ><BODY
  24 CLASS="SECT1"
  25 BGCOLOR="#FFFFFF"
  26 TEXT="#000000"
  27 LINK="#0000FF"
  28 VLINK="#840084"
  29 ALINK="#0000FF"
  30 ><DIV
  31 CLASS="NAVHEADER"
  32 ><TABLE
  33 WIDTH="100%"
  34 BORDER="0"
  35 CELLPADDING="0"
  36 CELLSPACING="0"
  37 ><TR
  38 ><TH
  39 COLSPAN="3"
  40 ALIGN="center"
  41 >The PXP user's guide</TH
  42 ></TR
  43 ><TR
  44 ><TD
  45 WIDTH="10%"
  46 ALIGN="left"
  47 VALIGN="bottom"
  48 ><A
  49 HREF="c533.html"
  50 >Prev</A
  51 ></TD
  52 ><TD
  53 WIDTH="80%"
  54 ALIGN="center"
  55 VALIGN="bottom"
  56 >Chapter 2. Using <SPAN
  57 CLASS="ACRONYM"
  58 >PXP</SPAN
  59 ></TD
  60 ><TD
  61 WIDTH="10%"
  62 ALIGN="right"
  63 VALIGN="bottom"
  64 ><A
  65 HREF="x675.html"
  66 >Next</A
  67 ></TD
  68 ></TR
  69 ></TABLE
  70 ><HR
  71 ALIGN="LEFT"
  72 WIDTH="100%"></DIV
  73 ><DIV
  74 CLASS="SECT1"
  75 ><H1
  76 CLASS="SECT1"
  77 ><A
  78 NAME="AEN550"
  79 >2.2. How to parse a document from an application</A
  80 ></H1
  81 ><P
  82 >Let me first give a rough overview of the object model of the parser. The
  83 following items are represented by objects:
  84
  85 <P
  86 ></P
  87 ><UL
  88 COMPACT="COMPACT"
  89 ><LI
  90 STYLE="list-style-type: disc"
  91 ><P
  92 ><I
  93 CLASS="EMPHASIS"
  94 >Documents:</I
  95 > The document representation is more or less the
  96 anchor for the application; all accesses to the parsed entities start here. It
  97 is described by the class <TT
  98 CLASS="LITERAL"
  99 >document</TT
 100 > contained in the module
 101 <TT
 102 CLASS="LITERAL"
 103 >Pxp_document</TT
 104 >. You can get some global information, such
 105 as the XML declaration the document begins with, the DTD of the document,
 106 global processing instructions, and most important, the document tree. </P
 107 ></LI
 108 ><LI
 109 STYLE="list-style-type: disc"
 110 ><P
 111 ><I
 112 CLASS="EMPHASIS"
 113 >The contents of documents:</I
 114 > The contents have the structure
 115 of a tree: Elements contain other elements and text<A
 116 NAME="AEN562"
 117 HREF="#FTN.AEN562"
 118 >[1]</A
 119 >.
 120
 121 The common type to represent both kinds of content is <TT
 122 CLASS="LITERAL"
 123 >node</TT
 124 >
 125 which is a class type that unifies the properties of elements and character
 126 data. Every node has a list of children (which is empty if the element is empty
 127 or the node represents text); nodes may have attributes; nodes have always text
 128 contents. There are two implementations of <TT
 129 CLASS="LITERAL"
 130 >node</TT
 131 >, the class
 132 <TT
 133 CLASS="LITERAL"
 134 >element_impl</TT
 135 > for elements, and the class
 136 <TT
 137 CLASS="LITERAL"
 138 >data_impl</TT
 139 > for text data. You find these classes and class
 140 types in the module <TT
 141 CLASS="LITERAL"
 142 >Pxp_document</TT
 143 >, too.</P
 144 ><P
 145 >Note that attribute lists are represented by non-class values.</P
 146 ></LI
 147 ><LI
 148 STYLE="list-style-type: disc"
 149 ><P
 150 ><I
 151 CLASS="EMPHASIS"
 152 >The node extension:</I
 153 > For advanced usage, every node of the
 154 document may have an associated <I
 155 CLASS="EMPHASIS"
 156 >extension</I
 157 > which is simply
 158 a second object. This object must have the three methods
 159 <TT
 160 CLASS="LITERAL"
 161 >clone</TT
 162 >, <TT
 163 CLASS="LITERAL"
 164 >node</TT
 165 >, and
 166 <TT
 167 CLASS="LITERAL"
 168 >set_node</TT
 169 > as bare minimum, but you are free to add methods as
 170 you want. This is the preferred way to add functionality to the document
 171 tree<A
 172 NAME="AEN582"
 173 HREF="#FTN.AEN582"
 174 >[2]</A
 175 >. The class type <TT
 176 CLASS="LITERAL"
 177 >extension</TT
 178 > is
 179 defined in <TT
 180 CLASS="LITERAL"
 181 >Pxp_document</TT
 182 >, too.</P
 183 ></LI
 184 ><LI
 185 STYLE="list-style-type: disc"
 186 ><P
 187 ><I
 188 CLASS="EMPHASIS"
 189 >The DTD:</I
 190 > Sometimes it is necessary to access the DTD of a
 191 document; the average application does not need this feature. The class
 192 <TT
 193 CLASS="LITERAL"
 194 >dtd</TT
 195 > describes DTDs, and makes it possible to get
 196 representations of element, entity, and notation declarations as well as
 197 processing instructions contained in the DTD. This class, and
 198 <TT
 199 CLASS="LITERAL"
 200 >dtd_element</TT
 201 >, <TT
 202 CLASS="LITERAL"
 203 >dtd_notation</TT
 204 >, and
 205 <TT
 206 CLASS="LITERAL"
 207 >proc_instruction</TT
 208 > can be found in the module
 209 <TT
 210 CLASS="LITERAL"
 211 >Pxp_dtd</TT
 212 >. There are a couple of classes representing
 213 different kinds of entities; these can be found in the module
 214 <TT
 215 CLASS="LITERAL"
 216 >Pxp_entity</TT
 217 >. </P
 218 ></LI
 219 ></UL
 220 >
 221
 222 Additionally, the following modules play a role:
 223
 224 <P
 225 ></P
 226 ><UL
 227 COMPACT="COMPACT"
 228 ><LI
 229 STYLE="list-style-type: disc"
 230 ><P
 231 ><I
 232 CLASS="EMPHASIS"
 233 >Pxp_yacc:</I
 234 > Here the main parsing functions such as
 235 <TT
 236 CLASS="LITERAL"
 237 >parse_document_entity</TT
 238 > are located. Some additional types and
 239 functions allow the parser to be configured in a non-standard way.</P
 240 ></LI
 241 ><LI
 242 STYLE="list-style-type: disc"
 243 ><P
 244 ><I
 245 CLASS="EMPHASIS"
 246 >Pxp_types:</I
 247 > This is a collection of basic types and
 248 exceptions. </P
 249 ></LI
 250 ></UL
 251 >
 252
 253 There are some further modules that are needed internally but are not part of
 254 the API.</P
 255 ><P
 256 >Let the document to be parsed be stored in a file called
 257 <TT
 258 CLASS="LITERAL"
 259 >doc.xml</TT
 260 >. The parsing process is started by calling the
 261 function
 262
 263 <PRE
 264 CLASS="PROGRAMLISTING"
 265 >val parse_document_entity : config -&#62; source -&#62; 'ext spec -&#62; 'ext document</PRE
 266 >
 267
 268 defined in the module <TT
 269 CLASS="LITERAL"
 270 >Pxp_yacc</TT
 271 >. The first argument
 272 specifies some global properties of the parser; it is recommended to start with
 273 the <TT
 274 CLASS="LITERAL"
 275 >default_config</TT
 276 >. The second argument determines where the
 277 document to be parsed comes from; this may be a file, a channel, or an entity
 278 ID. To parse <TT
 279 CLASS="LITERAL"
 280 >doc.xml</TT
 281 >, it is sufficient to pass
 282 <TT
 283 CLASS="LITERAL"
 284 >from_file "doc.xml"</TT
 285 >. </P
 286 ><P
 287 >The third argument passes the object specification to use. Roughly
 288 speaking, it determines which classes implement the node objects of which
 289 element types, and which extensions are to be used. The <TT
 290 CLASS="LITERAL"
 291 >'ext</TT
 292 >
 293 polymorphic variable is the type of the extension. For the moment, let us
 294 simply pass <TT
 295 CLASS="LITERAL"
 296 >default_spec</TT
 297 > as this argument, and ignore it.</P
 298 ><P
 299 >So the following expression parses <TT
 300 CLASS="LITERAL"
 301 >doc.xml</TT
 302 >:
 303
 304 <PRE
 305 CLASS="PROGRAMLISTING"
 306 >open Pxp_yacc
 307 let d = parse_document_entity default_config (from_file "doc.xml") default_spec</PRE
 308 >
 309
 310 Note that <TT
 311 CLASS="LITERAL"
 312 >default_config</TT
 313 > implies that warnings are collected
 314 but not printed. Errors raise one of the exception defined in
 315 <TT
 316 CLASS="LITERAL"
 317 >Pxp_types</TT
 318 >; to get readable errors and warnings catch the
 319 exceptions as follows:
 320
 321 <PRE
 322 CLASS="PROGRAMLISTING"
 323 >class warner =
 324   object
 325     method warn w =
 326       print_endline ("WARNING: " ^ w)
 327   end
 328 ;;
 329
 330 try
 331   let config = { default_config with warner = new warner } in
 332   let d = parse_document_entity config (from_file "doc.xml") default_spec
 333   in
 334     ...
 335 with
 336    e -&#62;
 337      print_endline (Pxp_types.string_of_exn e)</PRE
 338 >
 339
 340 Now <TT
 341 CLASS="LITERAL"
 342 >d</TT
 343 > is an object of the <TT
 344 CLASS="LITERAL"
 345 >document</TT
 346 >
 347 class. If you want the node tree, you can get the root element by
 348
 349 <PRE
 350 CLASS="PROGRAMLISTING"
 351 >let root = d # root</PRE
 352 >
 353
 354 and if you would rather like to access the DTD, determine it by
 355
 356 <PRE
 357 CLASS="PROGRAMLISTING"
 358 >let dtd = d # dtd</PRE
 359 >
 360
 361 As it is more interesting, let us investigate the node tree now. Given the root
 362 element, it is possible to recursively traverse the whole tree. The children of
 363 a node <TT
 364 CLASS="LITERAL"
 365 >n</TT
 366 > are returned by the method
 367 <TT
 368 CLASS="LITERAL"
 369 >sub_nodes</TT
 370 >, and the type of a node is returned by
 371 <TT
 372 CLASS="LITERAL"
 373 >node_type</TT
 374 >. This function traverses the tree, and prints the
 375 type of each node:
 376
 377 <PRE
 378 CLASS="PROGRAMLISTING"
 379 >let rec print_structure n =
 380   let ntype = n # node_type in
 381   match ntype with
 382     T_element name -&#62;
 383       print_endline ("Element of type " ^ name);
 384       let children = n # sub_nodes in
 385       List.iter print_structure children
 386   | T_data -&#62;
 387       print_endline "Data"
 388   | _ -&#62;
 389       (* Other node types are not possible unless the parser is configured
 390          differently.
 391        *)
 392       assert false</PRE
 393 >
 394
 395 You can call this function by
 396
 397 <PRE
 398 CLASS="PROGRAMLISTING"
 399 >print_structure root</PRE
 400 >
 401
 402 The type returned by <TT
 403 CLASS="LITERAL"
 404 >node_type</TT
 405 > is either <TT
 406 CLASS="LITERAL"
 407 >T_element
 408 name</TT
 409 > or <TT
 410 CLASS="LITERAL"
 411 >T_data</TT
 412 >. The <TT
 413 CLASS="LITERAL"
 414 >name</TT
 415 > of the
 416 element type is the string included in the angle brackets. Note that only
 417 elements have children; data nodes are always leaves of the tree.</P
 418 ><P
 419 >There are some more methods in order to access a parsed node tree:
 420
 421 <P
 422 ></P
 423 ><UL
 424 COMPACT="COMPACT"
 425 ><LI
 426 STYLE="list-style-type: disc"
 427 ><P
 428 ><TT
 429 CLASS="LITERAL"
 430 >n # parent</TT
 431 >: Returns the parent node, or raises
 432 <TT
 433 CLASS="LITERAL"
 434 >Not_found</TT
 435 > if the node is already the root</P
 436 ></LI
 437 ><LI
 438 STYLE="list-style-type: disc"
 439 ><P
 440 ><TT
 441 CLASS="LITERAL"
 442 >n # root</TT
 443 >: Returns the root of the node tree. </P
 444 ></LI
 445 ><LI
 446 STYLE="list-style-type: disc"
 447 ><P
 448 ><TT
 449 CLASS="LITERAL"
 450 >n # attribute a</TT
 451 >: Returns the value of the attribute with
 452 name <TT
 453 CLASS="LITERAL"
 454 >a</TT
 455 >. The method returns a value for every
 456 <I
 457 CLASS="EMPHASIS"
 458 >declared</I
 459 > attribute, independently of whether the attribute
 460 instance is defined or not. If the attribute is not declared,
 461 <TT
 462 CLASS="LITERAL"
 463 >Not_found</TT
 464 > will be raised. (In well-formedness mode, every
 465 attribute is considered as being implicitly declared with type
 466 <TT
 467 CLASS="LITERAL"
 468 >CDATA</TT
 469 >.) </P
 470 ><P
 471 >The following return values are possible: <TT
 472 CLASS="LITERAL"
 473 >Value s</TT
 474 >,
 475 <TT
 476 CLASS="LITERAL"
 477 >Valuelist sl</TT
 478 > , and <TT
 479 CLASS="LITERAL"
 480 >Implied_value</TT
 481 >.
 482 The first two value types indicate that the attribute value is available,
 483 either because there is a definition
 484 <TT
 485 CLASS="LITERAL"
 486 ><TT
 487 CLASS="REPLACEABLE"
 488 ><I
 489 >a</I
 490 ></TT
 491 >="<TT
 492 CLASS="REPLACEABLE"
 493 ><I
 494 >value</I
 495 ></TT
 496 >"</TT
 497 >
 498 in the XML text, or because there is a default value (declared in the
 499 DTD). Only if both the instance definition and the default declaration are
 500 missing, the latter value <TT
 501 CLASS="LITERAL"
 502 >Implied_value</TT
 503 > will be returned.</P
 504 ><P
 505 >In the DTD, every attribute is typed. There are single-value types (CDATA, ID,
 506 IDREF, ENTITY, NMTOKEN, enumerations), in which case the method passes
 507 <TT
 508 CLASS="LITERAL"
 509 >Value s</TT
 510 > back, where <TT
 511 CLASS="LITERAL"
 512 >s</TT
 513 > is the normalized
 514 string value of the attribute. The other types (IDREFS, ENTITIES, NMTOKENS)
 515 represent list values, and the parser splits the XML literal into several
 516 tokens and returns these tokens as <TT
 517 CLASS="LITERAL"
 518 >Valuelist sl</TT
 519 >.</P
 520 ><P
 521 >Normalization means that entity references (the
 522 <TT
 523 CLASS="LITERAL"
 524 >&amp;<TT
 525 CLASS="REPLACEABLE"
 526 ><I
 527 >name</I
 528 ></TT
 529 >;</TT
 530 > tokens) and
 531 character references
 532 (<TT
 533 CLASS="LITERAL"
 534 >&amp;#<TT
 535 CLASS="REPLACEABLE"
 536 ><I
 537 >number</I
 538 ></TT
 539 >;</TT
 540 >) are replaced
 541 by the text they represent, and that white space characters are converted into
 542 plain spaces.</P
 543 ></LI
 544 ><LI
 545 STYLE="list-style-type: disc"
 546 ><P
 547 ><TT
 548 CLASS="LITERAL"
 549 >n # data</TT
 550 >: Returns the character data contained in the
 551 node. For data nodes, the meaning is obvious as this is the main content of
 552 data nodes. For element nodes, this method returns the concatenated contents of
 553 all inner data nodes.</P
 554 ><P
 555 >Note that entity references included in the text are resolved while they are
 556 being parsed; for example the text "a &#38;lt;&#38;gt; b" will be returned
 557 as "a &#60;&#62; b" by this method. Spaces of data nodes are always
 558 preserved. Newlines are preserved, but always converted to \n characters even
 559 if newlines are encoded as \r\n or \r. Normally you will never see two adjacent
 560 data nodes because the parser collapses all data material at one location into
 561 one node. (However, if you create your own tree or transform the parsed tree,
 562 it is possible to have adjacent data nodes.)</P
 563 ><P
 564 >Note that elements that do <I
 565 CLASS="EMPHASIS"
 566 >not</I
 567 > allow #PCDATA as content
 568 will not have data nodes as children. This means that spaces and newlines, the
 569 only character material allowed for such elements, are silently dropped.</P
 570 ></LI
 571 ></UL
 572 >
 573
 574 For example, if the task is to print all contents of elements with type
 575 "valuable" whose attribute "priority" is "1", this function can help:
 576
 577 <PRE
 578 CLASS="PROGRAMLISTING"
 579 >let rec print_valuable_prio1 n =
 580   let ntype = n # node_type in
 581   match ntype with
 582     T_element "valuable" when n # attribute "priority" = Value "1" -&#62;
 583       print_endline "Valuable node with priotity 1 found:";
 584       print_endline (n # data)
 585   | (T_element _ | T_data) -&#62;
 586       let children = n # sub_nodes in
 587       List.iter print_valuable_prio1 children
 588   | _ -&#62;
 589       assert false</PRE
 590 >
 591
 592 You can call this function by:
 593
 594 <PRE
 595 CLASS="PROGRAMLISTING"
 596 >print_valuable_prio1 root</PRE
 597 >
 598
 599 If you like a DSSSL-like style, you can make the function
 600 <TT
 601 CLASS="LITERAL"
 602 >process_children</TT
 603 > explicit:
 604
 605 <PRE
 606 CLASS="PROGRAMLISTING"
 607 >let rec print_valuable_prio1 n =
 608
 609   let process_children n =
 610     let children = n # sub_nodes in
 611     List.iter print_valuable_prio1 children
 612   in
 613
 614   let ntype = n # node_type in
 615   match ntype with
 616     T_element "valuable" when n # attribute "priority" = Value "1" -&#62;
 617       print_endline "Valuable node with priority 1 found:";
 618       print_endline (n # data)
 619   | (T_element _ | T_data) -&#62;
 620       process_children n
 621   | _ -&#62;
 622       assert false</PRE
 623 >
 624
 625 So far, O'Caml is now a simple "style-sheet language": You can form a big
 626 "match" expression to distinguish between all significant cases, and provide
 627 different reactions on different conditions. But this technique has
 628 limitations; the "match" expression tends to get larger and larger, and it is
 629 difficult to store intermediate values as there is only one big
 630 recursion. Alternatively, it is also possible to represent the various cases as
 631 classes, and to use dynamic method lookup to find the appropiate class. The
 632 next section explains this technique in detail.&#13;</P
 633 ></DIV
 634 ><H3
 635 CLASS="FOOTNOTES"
 636 >Notes</H3
 637 ><TABLE
 638 BORDER="0"
 639 CLASS="FOOTNOTES"
 640 WIDTH="100%"
 641 ><TR
 642 ><TD
 643 ALIGN="LEFT"
 644 VALIGN="TOP"
 645 WIDTH="5%"
 646 ><A
 647 NAME="FTN.AEN562"
 648 HREF="x550.html#AEN562"
 649 >[1]</A
 650 ></TD
 651 ><TD
 652 ALIGN="LEFT"
 653 VALIGN="TOP"
 654 WIDTH="95%"
 655 ><P
 656 >Elements may
 657 also contain processing instructions. Unlike other document models, <SPAN
 658 CLASS="ACRONYM"
 659 >PXP</SPAN
 660 >
 661 separates processing instructions from the rest of the text and provides a
 662 second interface to access them (method <TT
 663 CLASS="LITERAL"
 664 >pinstr</TT
 665 >). However,
 666 there is a parser option (<TT
 667 CLASS="LITERAL"
 668 >enable_pinstr_nodes</TT
 669 >) which changes
 670 the behaviour of the parser such that extra nodes for processing instructions
 671 are included into the tree.</P
 672 ><P
 673 >Furthermore, the tree does normally not contain nodes for XML comments;
 674 they are ignored by default. Again, there is an option
 675 (<TT
 676 CLASS="LITERAL"
 677 >enable_comment_nodes</TT
 678 >) changing this.</P
 679 ></TD
 680 ></TR
 681 ><TR
 682 ><TD
 683 ALIGN="LEFT"
 684 VALIGN="TOP"
 685 WIDTH="5%"
 686 ><A
 687 NAME="FTN.AEN582"
 688 HREF="x550.html#AEN582"
 689 >[2]</A
 690 ></TD
 691 ><TD
 692 ALIGN="LEFT"
 693 VALIGN="TOP"
 694 WIDTH="95%"
 695 ><P
 696 >Due to the typing system it is more or less impossible to
 697 derive recursive classes in O'Caml. To get around this, it is common practice
 698 to put the modifiable or extensible part of recursive objects into parallel
 699 objects.</P
 700 ></TD
 701 ></TR
 702 ></TABLE
 703 ><DIV
 704 CLASS="NAVFOOTER"
 705 ><HR
 706 ALIGN="LEFT"
 707 WIDTH="100%"><TABLE
 708 WIDTH="100%"
 709 BORDER="0"
 710 CELLPADDING="0"
 711 CELLSPACING="0"
 712 ><TR
 713 ><TD
 714 WIDTH="33%"
 715 ALIGN="left"
 716 VALIGN="top"
 717 ><A
 718 HREF="c533.html"
 719 >Prev</A
 720 ></TD
 721 ><TD
 722 WIDTH="34%"
 723 ALIGN="center"
 724 VALIGN="top"
 725 ><A
 726 HREF="index.html"
 727 >Home</A
 728 ></TD
 729 ><TD
 730 WIDTH="33%"
 731 ALIGN="right"
 732 VALIGN="top"
 733 ><A
 734 HREF="x675.html"
 735 >Next</A
 736 ></TD
 737 ></TR
 738 ><TR
 739 ><TD
 740 WIDTH="33%"
 741 ALIGN="left"
 742 VALIGN="top"
 743 >Using <SPAN
 744 CLASS="ACRONYM"
 745 >PXP</SPAN
 746 ></TD
 747 ><TD
 748 WIDTH="34%"
 749 ALIGN="center"
 750 VALIGN="top"
 751 ><A
 752 HREF="c533.html"
 753 >Up</A
 754 ></TD
 755 ><TD
 756 WIDTH="33%"
 757 ALIGN="right"
 758 VALIGN="top"
 759 >Class-based processing of the node tree</TD
 760 ></TR
 761 ></TABLE
 762 ></DIV
 763 ></BODY
 764 ></HTML
 765 >