helm/mathql/doc/mathql_overview.tex

   1 \section{Overview}
   2
   3 {\MathQL}%
   4 \footnote{See \CURI{http://helm.cs.unibo.it/mathql}.}
   5 is a query language for {\RDF} \cite{RDF,RDFS} databases, developed in the
   6 context of the {\HELM}%
   7 \footnote{See \CURI{http://helm.cs.unibo.it}.}
   8 project \cite{APSCGS03}.
   9 Its name suggests that it is supposed to be the first of a group of query
  10 languages for retrieving information from distributed digital libraries of
  11 formal mathematical knowledge by means of content-aware requests, but no other
  12 languages of this proposal have been implemented yet except for {\MathQL} that
  13 is not Mathematics-oriented. So the name is a bit misleading.
  14 This proposal has several domains of application and may be useful for
  15 database or on-line libraries reviewers, for proof assistants or
  16 proof-checking systems, and also for learning environments because these
  17 applications require features for classifying, searching and browsing
  18 mathematical information in a semantically meaningful way.
  19
  20 Other languages to be defined in the context of the MathQL proposal may be
  21 suitable for queries about the semantic structure of mathematical data:
  22 this includes content-based pattern-matching and possibly other forms of
  23 formal matching involving for instance isomorphism, unification and
  24 $\delta$-expansion%
  25 \footnote{By $\delta$-expansion we mean the expansion of definitions.}.
  26 In this perspective the role of a query on metadata is that of producing a
  27 filtered knowledge base containing relevant information for subsequent queries
  28 of other kind (see \cite{GSC03} for a more detailed description of this
  29 approach).
  30
  31 {\MathQL} is carefully designed for making up for two limitations that seem to
  32 characterize several implementations and proposals of current {\RDF}-oriented
  33 query languages, namely the insufficient compliance with the most requested
  34 features and the poor attention paid to query result management.
  35 Thus the language has the following design goals:
  36
  37 \begin{enumerate}
  38
  39 \item
  40 compliance with the main requirements stated by the {\RDF} community;
  41
  42 \item
  43 native support for post-processing the query results;
  44
  45 \item
  46 {\HELM}-independent implementation of the query engine.
  47
  48 \end{enumerate}
  49
  50 We will briefly analyze these features in the remaining part of this
  51 section.
  52
  53 \subsubsection*{The main requirements from the RDF community}
  54
  55 As a query language for {\RDF} databases, {\MathQL} has a well-conceived
  56 semantics, defined in term of an abstract metadata model, according to which
  57 queries return exhaustive solutions.
  58 The language provides facilities for imposing query constraints based on
  59 {\RDFS} \cite{RDFS} and for the traversal of compound values of properties.
  60 It also provides a full set of Boolean operators to compose the query
  61 constraints and facilities for selecting resources or literals by means of
  62 {\POSIX} regular expressions.
  63 Moreover the language allows to customize the query results specifying what
  64 part of a solution should be preserved, and supports a machine-processable
  65 {\XML} \cite{XML} syntax as well as a human-readable textual syntax to achieve
  66 the best usability.
  67 The two syntaxes concern both queries and results, making {\MathQL} usable in
  68 a distributed environment where query engines are implemented as stand-alone
  69 components. This is because in this setting both queries and query results
  70 must be exchanged by the system's components and thus need to be encoded in
  71 clearly defined format.
  72
  73 {\MathQL} provides a graph-oriented access to the {\RDF} metadata, based on
  74 tree instantiation.
  75 This approach has the advantage of providing an abstraction over the
  76 concrete representation of the {\RDF} database (that can consist of {\RDF}
  77 triples and {\XML} files simultaneously) at the user level, and this is
  78 definitely desirable especially in a distributed context.
  79
  80 {\MathQL} query results are meant to capture the structure of trees coming
  81 from an {\RDF} graph and for this purpose a standard $1$- or $2$-dimensional
  82 organization (as provided by most {\RDF}-oriented query languages) is not
  83 satisfactory. Here {\MathQL} approach is to use a $4$-dimensional organization
  84 for its query results.
  85
  86 \subsubsection*{Post-processing and code generation capabilities}
  87
  88 The {\MathQL} query engine, that is written in {\CAML}%
  89 \footnote{See \CURI{http://caml.inria.fr}.}
  90 for an easy integration with the {\HELM} software, provides two ways of
  91 processing the query results: at {\CAML} side and natively.
  92
  93 At {\CAML} side, an application issues a query calling a function of the
  94 engine and manipulates the result either operating directly on its internal
  95 representation (through a low-level interface), or using a set of dedicated
  96 functions specifically designed to manage the query results.
  97 This set of functions includes a basic library but is extensible depending
  98 on the {\CAML} modules included in the engine at compile-time. In this way
  99 an expert user can write a {\CAML} module with new dedicated functions and can
 100 include it in the engine recompiling it.
 101
 102 {\MathQL} supports native post-processing of the query results including the
 103 standard constructions of an imperative Turing-complete programming language,
 104 whose aim is definitely not that of being all-purpose (the user can work at
 105 {\CAML} side for that), but of being optimized for the management of the
 106 query results.
 107 In this context an {\SQL}-like ``select-from-where'' construction is provided
 108 (as required by the {\RDF} community) as well as a mechanism for accessing the
 109 post-processing dedicated functions available to the engine.
 110
 111 Moreover the language provides access to an extensible set of code-generating
 112 functions (also available at {\CAML} side) that the expert user can define
 113 writing suitable {\CAML} modules for the engine.
 114 Note that the generated code is always {\MathQL} code.
 115
 116 The code generation features allow to build complex queries incrementally and
 117 in an automatic manner, as required by the needs of the {\HELM} project.
 118 Using the native programming language, instead, queries can include the
 119 post-processing algorithms on their results so the querying code and the
 120 subsequent processing code (if needed) are treated together as a
 121 self-contained object that can be computed by a single engine.
 122 In this sense the alternative of performing a complex query on a remote
 123 component issuing some {\MathQL} querying code followed by some {\CAML}
 124 post-processing code is really infeasible in a distributed context.
 125
 126 \subsubsection*{Physical organization of the RDF database}
 127
 128 The implementation of the {\MathQL} query engine does not depend on any
 129 software developed within the {\HELM} project, nor it depends on the {\HELM}
 130 metadata model in any way.
 131
 132 However the engine does make few assumptions on the way metadata are
 133 physically organized and needs some user-provided knowledge about the concrete
 134 metadata representation.
 135 Metadata stored as {\RDF} triples are accessed through a {\MySQL}%
 136 \footnote{See \CURI{http://www.mysql.com}.}
 137 or a {\PostgreSQL}%
 138 \footnote{See \CURI{http://www.postgresql.org}.}
 139 engine, while metadata stored as {\RDF}/{\XML} files are accessed through a
 140 {\Galax}%
 141 \footnote{See \CURI{http://db.bell-labs.com/galax/}.}
 142 {\XQuery} \cite{XQuery} engine.