helm/mathql/doc/mathql_overview.tex

   1 \section{Overview}
   2
   3 {\MathQL}%
   4 \footnote{See \CURI{http://helm.cs.unibo.it/mathql}.}
   5 is a query language for {\RDF} \cite{RDF,RDFS} databases, developed in the
   6 context of the {\HELM}%
   7 \footnote{See \CURI{http://helm.cs.unibo.it}.}
   8 project \cite{APSCGS03}.
   9 Its name suggests that it is supposed to be the first of a group of query
  10 languages for retrieving information from distributed digital libraries of
  11 formal mathematical knowledge by means of content-aware requests, but no other
  12 languages of this proposal have been implemented yet except for {\MathQL} that
  13 is not Mathematics-oriented. So the name is a bit misleading.
  14 This proposal has several domains of application and may be useful for
  15 database or on-line libraries reviewers, for proof assistants or
  16 proof-checking systems, and also for learning environments because these
  17 applications require features for classifying, searching and browsing
  18 mathematical information in a semantically meaningful way.
  19 Other languages to be defined in the context of the MathQL proposal may be
  20 suitable for queries about the semantic structure of mathematical data:
  21 this includes content-based pattern-matching and possibly other forms of
  22 formal matching involving for instance isomorphism, unification and
  23 $\delta$-expansion%
  24 \footnote{By $\delta$-expansion we mean the expansion of definitions.}.
  25 In this perspective the role of a query on metadata is that of producing a
  26 filtered knowledge base containing relevant information for subsequent queries
  27 of other kind (see \cite{GSC03} for a more detailed description of this
  28 approach).
  29
  30 {\MathQL} is carefully designed for making up for two limitations that seem to
  31 characterize several implementations and proposals of current {\RDF}-oriented
  32 query languages, namely the insufficient compliance with the most requested
  33 features and the poor attention paid to query result management.
  34 Thus the language has the following design goals:
  35
  36 \begin{enumerate}
  37
  38 \item
  39 compliance with the main requirements stated by the {\RDF} community;
  40
  41 \item
  42 native support for post-processing the query results;
  43
  44 \item
  45 {\HELM}-independent implementation of the query engine.
  46
  47 \end{enumerate}
  48
  49 We will briefly analyze these features in the remaining part of this
  50 section.
  51
  52 \vspace{-1pc}
  53
  54 \subsubsection*{The main requirements from the RDF community}
  55
  56 As a query language for {\RDF} databases, {\MathQL} has a well-conceived
  57 semantics, defined in term of an abstract metadata model, according to which
  58 queries return exhaustive solutions.
  59 The language provides facilities for imposing query constraints based on
  60 {\RDFS} \cite{RDFS} and for the traversal of compound values of properties.
  61 It also provides a full set of Boolean operators to compose the query
  62 constraints and facilities for selecting resources or literals by means of
  63 {\POSIX} regular expressions.
  64 Moreover the language allows to customize the query results specifying what
  65 part of a solution should be preserved, and supports a machine-processable
  66 {\XML} \cite{XML} syntax as well as a human-readable textual syntax to achieve
  67 the best usability.
  68 The two syntaxes concern both queries and results, making {\MathQL} usable in
  69 a distributed environment where query engines are implemented as stand-alone
  70 components. In this setting in fact both the queries and their results must be
  71 exchanged by the system's components and thus need to be clearly encoded.
  72
  73 {\MathQL} provides a graph-oriented access to the {\RDF} metadata, based on
  74 tree instantiation.
  75 This approach has the advantage of providing an abstraction over the
  76 concrete representation of the {\RDF} database (that can consist of {\RDF}
  77 triples and {\XML} files simultaneously) at the user level, and this is
  78 definitely desirable especially in a distributed context.
  79
  80 {\MathQL} query results are meant to capture the structure of trees coming
  81 from an {\RDF} graph and for this purpose a standard $1$- or $2$-dimensional
  82 organization (as provided by most {\RDF}-oriented query languages) is not
  83 satisfactory. {\MathQL} approach is to use a $4$-dimensional organization
  84 for its query results.
  85
  86 \vspace{-1pc}
  87
  88 \subsubsection*{Post-processing and code generation capabilities}
  89
  90 The {\MathQL} query engine, that is written in {\CAML}%
  91 \footnote{See \CURI{http://caml.inria.fr}.}
  92 for an easy integration with the {\HELM} software, provides two ways of
  93 processing the query results: at {\CAML} side and natively.
  94
  95 At {\CAML} side, an application issues a query calling a function of the
  96 engine and manipulates the result either operating directly on its internal
  97 representation (through a low-level interface), or using a set of dedicated
  98 functions specifically designed to manage the query results.
  99 This set of functions includes a basic library but is extensible depending
 100 on the {\CAML} modules included in the engine at compile-time. In this way
 101 an expert user can write a {\CAML} module with new dedicated functions and can
 102 include it in the engine recompiling it.
 103
 104 {\MathQL} supports native post-processing of the query results including the
 105 standard constructions of an imperative Turing-complete programming language,
 106 whose aim is definitely not that of being all-purpose (the user can work at
 107 {\CAML} side for that), but of being optimized for the management of the
 108 query results.
 109 In this context an {\SQL}-like ``select-from-where'' construction is provided
 110 (as required by the {\RDF} community) as well as a mechanism for accessing the
 111 post-processing dedicated functions available to the engine.
 112
 113 Moreover the language provides access to an extensible set of code-generating
 114 functions (also available at {\CAML} side) that the expert user can define
 115 writing suitable {\CAML} modules for the engine.
 116 Note that the generated code is always {\MathQL} code.
 117 The code generation features allow to build complex queries incrementally and
 118 in an automatic manner, as required by the needs of the {\HELM} project.
 119 Using the native programming language, instead, queries can include the
 120 post-processing algorithms on their results so the querying code and the
 121 subsequent processing code (if needed) are treated together as a
 122 self-contained object that can be computed by a single engine.
 123 In this sense the alternative of performing a complex query on a remote
 124 component issuing some {\MathQL} querying code followed by some {\CAML}
 125 post-processing code is really infeasible in a distributed context.
 126
 127 \vspace{-1pc}
 128
 129 \subsubsection*{Physical organization of the RDF database}
 130
 131 The implementation of the {\MathQL} query engine does not depend on any
 132 software developed within the {\HELM} project, nor it depends on the {\HELM}
 133 metadata model in any way.
 134
 135 However the engine does make few assumptions on the way metadata are
 136 physically organized and needs some user-provided knowledge about the concrete
 137 metadata representation.
 138 Metadata stored as {\RDF} triples are accessed through a {\MySQL}%
 139 \footnote{See \CURI{http://www.mysql.com}.}
 140 or a {\PostgreSQL}%
 141 \footnote{See \CURI{http://www.postgresql.org}.}
 142 engine, while metadata stored as {\RDF}/{\XML} files are accessed through a
 143 {\Galax}%
 144 \footnote{See \CURI{http://db.bell-labs.com/galax/}.}
 145 {\XQuery} \cite{XQuery} engine.