helm/DEVEL/pxp/pxp/doc/manual/html/c36.html

   1 <HTML
   2 ><HEAD
   3 ><TITLE
   4 >What is XML?</TITLE
   5 ><META
   6 NAME="GENERATOR"
   7 CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
   8 REL="HOME"
   9 TITLE="The PXP user's guide"
  10 HREF="index.html"><LINK
  11 REL="UP"
  12 TITLE="User's guide"
  13 HREF="p34.html"><LINK
  14 REL="PREVIOUS"
  15 TITLE="User's guide"
  16 HREF="p34.html"><LINK
  17 REL="NEXT"
  18 TITLE="Highlights of XML"
  19 HREF="x107.html"><LINK
  20 REL="STYLESHEET"
  21 TYPE="text/css"
  22 HREF="markup.css"></HEAD
  23 ><BODY
  24 CLASS="CHAPTER"
  25 BGCOLOR="#FFFFFF"
  26 TEXT="#000000"
  27 LINK="#0000FF"
  28 VLINK="#840084"
  29 ALINK="#0000FF"
  30 ><DIV
  31 CLASS="NAVHEADER"
  32 ><TABLE
  33 WIDTH="100%"
  34 BORDER="0"
  35 CELLPADDING="0"
  36 CELLSPACING="0"
  37 ><TR
  38 ><TH
  39 COLSPAN="3"
  40 ALIGN="center"
  41 >The PXP user's guide</TH
  42 ></TR
  43 ><TR
  44 ><TD
  45 WIDTH="10%"
  46 ALIGN="left"
  47 VALIGN="bottom"
  48 ><A
  49 HREF="p34.html"
  50 >Prev</A
  51 ></TD
  52 ><TD
  53 WIDTH="80%"
  54 ALIGN="center"
  55 VALIGN="bottom"
  56 ></TD
  57 ><TD
  58 WIDTH="10%"
  59 ALIGN="right"
  60 VALIGN="bottom"
  61 ><A
  62 HREF="x107.html"
  63 >Next</A
  64 ></TD
  65 ></TR
  66 ></TABLE
  67 ><HR
  68 ALIGN="LEFT"
  69 WIDTH="100%"></DIV
  70 ><DIV
  71 CLASS="CHAPTER"
  72 ><H1
  73 ><A
  74 NAME="AEN36"
  75 >Chapter 1. What is XML?</A
  76 ></H1
  77 ><DIV
  78 CLASS="TOC"
  79 ><DL
  80 ><DT
  81 ><B
  82 >Table of Contents</B
  83 ></DT
  84 ><DT
  85 >1.1. <A
  86 HREF="c36.html#AEN38"
  87 >Introduction</A
  88 ></DT
  89 ><DT
  90 >1.2. <A
  91 HREF="x107.html"
  92 >Highlights of XML</A
  93 ></DT
  94 ><DT
  95 >1.3. <A
  96 HREF="x468.html"
  97 >A complete example: The <I
  98 CLASS="EMPHASIS"
  99 >readme</I
 100 > DTD</A
 101 ></DT
 102 ></DL
 103 ></DIV
 104 ><DIV
 105 CLASS="SECT1"
 106 ><H1
 107 CLASS="SECT1"
 108 ><A
 109 NAME="AEN38"
 110 >1.1. Introduction</A
 111 ></H1
 112 ><P
 113 >XML (short for <I
 114 CLASS="EMPHASIS"
 115 >Extensible Markup Language</I
 116 >)
 117 generalizes the idea that text documents are typically structured in sections,
 118 sub-sections, paragraphs, and so on. The format of the document is not fixed
 119 (as, for example, in HTML), but can be declared by a so-called DTD (document
 120 type definition). The DTD describes only the rules how the document can be
 121 structured, but not how the document can be processed. For example, if you want
 122 to publish a book that uses XML markup, you will need a processor that converts
 123 the XML file into a printable format such as Postscript. On the one hand, the
 124 structure of XML documents is configurable; on the other hand, there is no
 125 longer a canonical interpretation of the elements of the document; for example
 126 one XML DTD might want that paragraphes are delimited by
 127 <TT
 128 CLASS="LITERAL"
 129 >para</TT
 130 > tags, and another DTD expects <TT
 131 CLASS="LITERAL"
 132 >p</TT
 133 > tags
 134 for the same purpose. As a result, for every DTD a new processor is required.</P
 135 ><P
 136 >Although XML can be used to express structured text documents it is not limited
 137 to this kind of application. For example, XML can also be used to exchange
 138 structured data over a network, or to simply store structured data in
 139 files. Note that XML documents cannot contain arbitrary binary data because
 140 some characters are forbidden; for some applications you need to encode binary
 141 data as text (e.g. the base 64 encoding).</P
 142 ><DIV
 143 CLASS="SECT2"
 144 ><H2
 145 CLASS="SECT2"
 146 ><A
 147 NAME="AEN45"
 148 >1.1.1. The "hello world" example</A
 149 ></H2
 150 ><P
 151 >The following example shows a very simple DTD, and a corresponding document
 152 instance. The document is structured such that it consists of sections, and
 153 that sections consist of paragraphs, and that paragraphs contain plain text:</P
 154 ><PRE
 155 CLASS="PROGRAMLISTING"
 156 >&#60;!ELEMENT document (section)+&#62;
 157 &#60;!ELEMENT section (paragraph)+&#62;
 158 &#60;!ELEMENT paragraph (#PCDATA)&#62;</PRE
 159 ><P
 160 >The following document is an instance of this DTD:</P
 161 ><PRE
 162 CLASS="PROGRAMLISTING"
 163 >&#60;?xml version="1.0" encoding="ISO-8859-1"?&#62;
 164 &#60;!DOCTYPE document SYSTEM "simple.dtd"&#62;
 165 &#60;document&#62;
 166   &#60;section&#62;
 167     &#60;paragraph&#62;This is a paragraph of the first section.&#60;/paragraph&#62;
 168     &#60;paragraph&#62;This is another paragraph of the first section.&#60;/paragraph&#62;
 169   &#60;/section&#62;
 170   &#60;section&#62;
 171     &#60;paragraph&#62;This is the only paragraph of the second section.&#60;/paragraph&#62;
 172   &#60;/section&#62;
 173 &#60;/document&#62;</PRE
 174 ><P
 175 >As in HTML (and, of course, in grand-father SGML), the "pieces" of
 176 the document are delimited by element braces, i.e. such a piece begins with
 177 <TT
 178 CLASS="LITERAL"
 179 >&lt;name-of-the-type-of-the-piece&gt;</TT
 180 > and ends with
 181 <TT
 182 CLASS="LITERAL"
 183 >&lt;/name-of-the-type-of-the-piece&gt;</TT
 184 >, and the pieces are
 185 called <I
 186 CLASS="EMPHASIS"
 187 >elements</I
 188 >. Unlike HTML and SGML, both start tags and
 189 end tags (i.e. the delimiters written in angle brackets) can never be left
 190 out. For example, HTML calls the paragraphs simply <TT
 191 CLASS="LITERAL"
 192 >p</TT
 193 >, and
 194 because paragraphs never contain paragraphs, a sequence of several paragraphs
 195 can be written as:
 196
 197 <PRE
 198 CLASS="PROGRAMLISTING"
 199 >&#60;p&#62;First paragraph
 200 &#60;p&#62;Second paragraph</PRE
 201 >
 202
 203 This is not possible in XML; continuing our example above we must always write
 204
 205 <PRE
 206 CLASS="PROGRAMLISTING"
 207 >&#60;paragraph&#62;First paragraph&#60;/paragraph&#62;
 208 &#60;paragraph&#62;Second paragraph&#60;/paragraph&#62;</PRE
 209 >
 210
 211 The rationale behind that is to (1) simplify the development of XML parsers
 212 (you need not convert the DTD into a deterministic finite automaton which is
 213 required to detect omitted tags), and to (2) make it possible to parse the
 214 document independent of whether the DTD is known or not.</P
 215 ><P
 216 >The first line of our sample document,
 217
 218 <PRE
 219 CLASS="PROGRAMLISTING"
 220 >&#60;?xml version="1.0" encoding="ISO-8859-1"?&#62;</PRE
 221 >
 222
 223 is the so-called <I
 224 CLASS="EMPHASIS"
 225 >XML declaration</I
 226 >. It expresses that the
 227 document follows the conventions of XML version 1.0, and that the document is
 228 encoded using characters from the ISO-8859-1 character set (often known as
 229 "Latin 1", mostly used in Western Europe). Although the XML declaration is not
 230 mandatory, it is good style to include it; everybody sees at the first glance
 231 that the document uses XML markup and not the similar-looking HTML and SGML
 232 markup languages. If you omit the XML declaration, the parser will assume
 233 that the document is encoded as UTF-8 or UTF-16 (there is a rule that makes
 234 it possible to distinguish between UTF-8 and UTF-16 automatically); these
 235 are encodings of Unicode's universal character set. (Note that <SPAN
 236 CLASS="ACRONYM"
 237 >PXP</SPAN
 238 >, unlike its
 239 predecessor "Markup", fully supports Unicode.)</P
 240 ><P
 241 >The second line,
 242
 243 <PRE
 244 CLASS="PROGRAMLISTING"
 245 >&#60;!DOCTYPE document SYSTEM "simple.dtd"&#62;</PRE
 246 >
 247
 248 names the DTD that is going to be used for the rest of the document. In
 249 general, it is possible that the DTD consists of two parts, the so-called
 250 external and the internal subset. "External" means that the DTD exists as a
 251 second file; "internal" means that the DTD is included in the same file. In
 252 this example, there is only an external subset, and the system identifier
 253 "simple.dtd" specifies where the DTD file can be found. System identifiers are
 254 interpreted as URLs; for instance this would be legal:
 255
 256 <PRE
 257 CLASS="PROGRAMLISTING"
 258 >&#60;!DOCTYPE document SYSTEM "http://host/location/simple.dtd"&#62;</PRE
 259 >
 260
 261 Please note that <SPAN
 262 CLASS="ACRONYM"
 263 >PXP</SPAN
 264 > cannot interpret HTTP identifiers by default, but it is
 265 possible to change the interpretation of system identifiers.</P
 266 ><P
 267 >The word immediately following <TT
 268 CLASS="LITERAL"
 269 >DOCTYPE</TT
 270 > determines which of
 271 the declared element types (here "document", "section", and "paragraph") is
 272 used for the outermost element, the <I
 273 CLASS="EMPHASIS"
 274 >root element</I
 275 >. In this
 276 example it is <TT
 277 CLASS="LITERAL"
 278 >document</TT
 279 > because the outermost element is
 280 delimited by <TT
 281 CLASS="LITERAL"
 282 >&lt;document&gt;</TT
 283 > and
 284 <TT
 285 CLASS="LITERAL"
 286 >&lt;/document&gt;</TT
 287 >. </P
 288 ><P
 289 >The DTD consists of three declarations for element types:
 290 <TT
 291 CLASS="LITERAL"
 292 >document</TT
 293 >, <TT
 294 CLASS="LITERAL"
 295 >section</TT
 296 >, and
 297 <TT
 298 CLASS="LITERAL"
 299 >paragraph</TT
 300 >. Such a declaration has two parts:
 301
 302 <PRE
 303 CLASS="PROGRAMLISTING"
 304 >&lt;!ELEMENT <TT
 305 CLASS="REPLACEABLE"
 306 ><I
 307 >name</I
 308 ></TT
 309 > <TT
 310 CLASS="REPLACEABLE"
 311 ><I
 312 >content-model</I
 313 ></TT
 314 >&gt;</PRE
 315 >
 316
 317 The content model is a regular expression which describes the possible inner
 318 structure of the element. Here, <TT
 319 CLASS="LITERAL"
 320 >document</TT
 321 > contains one or
 322 more sections, and a <TT
 323 CLASS="LITERAL"
 324 >section</TT
 325 > contains one or more
 326 paragraphs. Note that these two element types are not allowed to contain
 327 arbitrary text. Only the <TT
 328 CLASS="LITERAL"
 329 >paragraph</TT
 330 > element type is declared
 331 such that parsed character data (indicated by the symbol
 332 <TT
 333 CLASS="LITERAL"
 334 >#PCDATA</TT
 335 >) is permitted.</P
 336 ><P
 337 >See below for a detailed discussion of content models. </P
 338 ></DIV
 339 ><DIV
 340 CLASS="SECT2"
 341 ><H2
 342 CLASS="SECT2"
 343 ><A
 344 NAME="AEN84"
 345 >1.1.2. XML parsers and processors</A
 346 ></H2
 347 ><P
 348 >XML documents are human-readable, but this is not the main purpose of this
 349 language. XML has been designed such that documents can be read by a program
 350 called an <I
 351 CLASS="EMPHASIS"
 352 >XML parser</I
 353 >. The parser checks that the document
 354 is well-formatted, and it represents the document as objects of the programming
 355 language. There are two aspects when checking the document: First, the document
 356 must follow some basic syntactic rules, such as that tags are written in angle
 357 brackets, that for every start tag there must be a corresponding end tag and so
 358 on. A document respecting these rules is
 359 <I
 360 CLASS="EMPHASIS"
 361 >well-formed</I
 362 >. Second, the document must match the DTD in
 363 which case the document is <I
 364 CLASS="EMPHASIS"
 365 >valid</I
 366 >. Many parsers check only
 367 on well-formedness and ignore the DTD; <SPAN
 368 CLASS="ACRONYM"
 369 >PXP</SPAN
 370 > is designed such that it can
 371 even validate the document.</P
 372 ><P
 373 >A parser does not make a sensible application, it only reads XML
 374 documents. The whole application working with XML-formatted data is called an
 375 <I
 376 CLASS="EMPHASIS"
 377 >XML processor</I
 378 >. Often XML processors convert documents into
 379 another format, such as HTML or Postscript. Sometimes processors extract data
 380 of the documents and output the processed data again XML-formatted. The parser
 381 can help the application processing the document; for example it can provide
 382 means to access the document in a specific manner. <SPAN
 383 CLASS="ACRONYM"
 384 >PXP</SPAN
 385 > supports an
 386 object-oriented access layer specially.</P
 387 ></DIV
 388 ><DIV
 389 CLASS="SECT2"
 390 ><H2
 391 CLASS="SECT2"
 392 ><A
 393 NAME="AEN94"
 394 >1.1.3. Discussion</A
 395 ></H2
 396 ><P
 397 >As we have seen, there are two levels of description: On the one hand, XML can
 398 define rules about the format of a document (the DTD), on the other hand, XML
 399 expresses structured documents. There are a number of possible applications:</P
 400 ><P
 401 ></P
 402 ><UL
 403 COMPACT="COMPACT"
 404 ><LI
 405 STYLE="list-style-type: disc"
 406 ><P
 407 >XML can be used to express structured texts. Unlike HTML, there is no canonical
 408 interpretation; one would have to write a backend for the DTD that translates
 409 the structured texts into a format that existing browsers, printers
 410 etc. understand. The advantage of a self-defined document format is that it is
 411 possible to design the format in a more problem-oriented way. For example, if
 412 the task is to extract reports from a database, one can use a DTD that reflects
 413 the structure of the report or the database. A possible approach would be to
 414 have an element type for every database table and for every column. Once the
 415 DTD has been designed, the report procedure can be splitted up in a part that
 416 selects the database rows and outputs them as an XML document according to the
 417 DTD, and in a part that translates the document into other formats. Of course,
 418 the latter part can be solved in a generic way, e.g. there may be configurable
 419 backends for all DTDs that follow the approach and have element types for
 420 tables and columns.</P
 421 ><P
 422 >XML plays the role of a configurable intermediate format. The database
 423 extraction function can be written without having to know the details of
 424 typesetting; the backends can be written without having to know the details of
 425 the database.</P
 426 ><P
 427 >Of course, there are traditional solutions. One can define an ad hoc
 428 intermediate text file format. This disadvantage is that there are no names for
 429 the pieces of the format, and that such formats usually lack of documentation
 430 because of this. Another solution would be to have a binary representation,
 431 either as language-dependent or language-independent structure (example of the
 432 latter can be found in RPC implementations). The disadvantage is that it is
 433 harder to view such representations, one has to write pretty printers for this
 434 purpose. It is also more difficult to enter test data; XML is plain text that
 435 can be written using an arbitrary editor (Emacs has even a good XML mode,
 436 PSGML). All these alternatives suffer from a missing structure checker,
 437 i.e. the programs processing these formats usually do not check the input file
 438 or input object in detail; XML parsers check the syntax of the input (the
 439 so-called well-formedness check), and the advanced parsers like <SPAN
 440 CLASS="ACRONYM"
 441 >PXP</SPAN
 442 > even
 443 verify that the structure matches the DTD (the so-called validation).</P
 444 ></LI
 445 ><LI
 446 STYLE="list-style-type: disc"
 447 ><P
 448 >XML can be used as configurable communication language. A fundamental problem
 449 of every communication is that sender and receiver must follow the same
 450 conventions about the language. For data exchange, the question is usually
 451 which data records and fields are available, how they are syntactically
 452 composed, and which values are possible for the various fields. Similar
 453 questions arise for text document exchange. XML does not answer these problems
 454 completely, but it reduces the number of ambiguities for such conventions: The
 455 outlines of the syntax are specified by the DTD (but not necessarily the
 456 details), and XML introduces canonical names for the components of documents
 457 such that it is simpler to describe the rest of the syntax and the semantics
 458 informally.</P
 459 ></LI
 460 ><LI
 461 STYLE="list-style-type: disc"
 462 ><P
 463 >XML is a data storage format. Currently, every software product tends to use
 464 its own way to store data; commercial software often does not describe such
 465 formats, and it is a pain to integrate such software into a bigger project.
 466 XML can help to improve this situation when several applications share the same
 467 syntax of data files. DTDs are then neutral instances that check the format of
 468 data files independent of applications. </P
 469 ></LI
 470 ></UL
 471 ></DIV
 472 ></DIV
 473 ></DIV
 474 ><DIV
 475 CLASS="NAVFOOTER"
 476 ><HR
 477 ALIGN="LEFT"
 478 WIDTH="100%"><TABLE
 479 WIDTH="100%"
 480 BORDER="0"
 481 CELLPADDING="0"
 482 CELLSPACING="0"
 483 ><TR
 484 ><TD
 485 WIDTH="33%"
 486 ALIGN="left"
 487 VALIGN="top"
 488 ><A
 489 HREF="p34.html"
 490 >Prev</A
 491 ></TD
 492 ><TD
 493 WIDTH="34%"
 494 ALIGN="center"
 495 VALIGN="top"
 496 ><A
 497 HREF="index.html"
 498 >Home</A
 499 ></TD
 500 ><TD
 501 WIDTH="33%"
 502 ALIGN="right"
 503 VALIGN="top"
 504 ><A
 505 HREF="x107.html"
 506 >Next</A
 507 ></TD
 508 ></TR
 509 ><TR
 510 ><TD
 511 WIDTH="33%"
 512 ALIGN="left"
 513 VALIGN="top"
 514 >User's guide</TD
 515 ><TD
 516 WIDTH="34%"
 517 ALIGN="center"
 518 VALIGN="top"
 519 ><A
 520 HREF="p34.html"
 521 >Up</A
 522 ></TD
 523 ><TD
 524 WIDTH="33%"
 525 ALIGN="right"
 526 VALIGN="top"
 527 >Highlights of XML</TD
 528 ></TR
 529 ></TABLE
 530 ></DIV
 531 ></BODY
 532 ></HTML
 533 >