1 \documentclass[a4paper]{llncs}
4 \usepackage{amssymb,amsmath}
8 \newcommand{\whelp}{Whelp}
10 %\newcommand{\logo}[3]{
11 %\parpic(0cm,0cm)(#2,#3)[l]{\includegraphics[width=#1]{whelp-bw}}
14 \newcommand{\coq}{Coq}
15 \newcommand{\mowgli}{MoWGLI}
16 \newcommand{\IR}{\ensuremath{\mathbb{R}}}
17 \newcommand{\IN}{\ensuremath{\mathbb{N}}}
18 \newcommand{\instance}{\textsc{Instance}}
19 \newcommand{\auto}{\textsc{Auto}}
20 \newcommand{\hint}{\textsc{Hint}}
21 \newcommand{\locate}{\textsc{Locate}}
22 \newcommand{\elim}{\textsc{Elim}}
23 \newcommand{\match}{\textsc{Match}}
24 \newcommand{\texmacro}[1]{\texttt{\char92 #1}}
25 \newcommand{\REF}[3]{\ensuremath{\mathit{Ref}_{#1}(#2,#3)}}
26 \newcommand{\NAT}{\ensuremath{\mathit{nat}}}
27 \newcommand{\Prop}{\mathit{Prop}}
28 \newcommand{\natind}{\mathit{nat\_ind}}
29 \newcommand{\METAHEADING}{Symbol & Position \\ \hline\hline}
30 \newcommand{\LIBXSLT}{LibXSLT}
31 \newcommand{\OCAML}{OCaml}
32 \newcommand{\UWOBO}{UWOBO}
34 \title{A content based mathematical search engine: \whelp}
35 \author{Andrea Asperti, Ferruccio Guidi, Claudio Sacerdoti Coen,\\
36 Enrico Tassi, and Stefano Zacchiroli}
37 \institute{Department of Computer Science, University of Bologna\\
38 Mura Anteo Zamboni, 7 --- 40127 Bologna, ITALY\\
39 \email{$\{$asperti,fguidi,sacerdot,tassi,zacchiro$\}$@cs.unibo.it}}
45 %\Large{\textbf{A content based mathematical search engine:\\
46 %\logo{0.8cm}{-1.4cm}{0.6cm}\whelp}}
47 %\author{Andrea Asperti, Ferruccio Guidi, Claudio Sacerdoti Coen,\\
48 % Enrico Tassi, and Stefano Zacchiroli}
51 %Department of Computer Science, University of Bologna\\
52 % Mura Anteo Zamboni, 7 -- 40127 Bologna, ITALY\\
53 % \texttt{$\{$asperti,fguidi,sacerdot,tassi,zacchiro$\}$@cs.unibo.it}}
57 The prototype of a content based search engine
58 for mathematical knowledge supporting a small set of queries requiring
59 matching and/or typing operations is described.
60 The prototype --- called \whelp{} --- exploits a metadata approach
61 for indexing the information that looks far more flexible than traditional
62 indexing techniques for structured expressions like substitution,
63 discrimination, or context trees. The prototype has been instantiated to the
64 standard library of the \coq{} proof assistant extended with many
68 \section{Introduction}
71 The paper describes the prototype of a content based search engine
72 for mathematical knowledge --- called \whelp{} --- developed inside
73 the European Project IST-2001-33562 MoWGLI~\cite{pechino}.
74 \whelp{} has been mostly
75 tested to search notions inside the library of formal mathematical
76 knowledge of the \coq{} proof assistant~\cite{coq}. Due to its dimension (about
77 40,000 theorems), this library was adopted by \mowgli{} as a main
78 example of repository of structured mathematical information. However,
79 \whelp{} --- better, its filtering phase --- only works on a small set
80 of metadata automatically extracted
81 from the structured sources, and is thus largely independent from
82 the actual syntax (and semantics) of the information.
83 Metadata also offer a higher flexibility with respect to more canonical
84 indexing techniques such as discrimination trees~\cite{McCune},
85 substitution trees~\cite{Graf} or context trees~\cite{Ganzinger} since
86 all these approaches are optimized for the single operation of
87 (forward) matching, and are difficult to adapt or tune with
88 additional constraints (such as global constraints on the
89 signature of the term, just to make a simple but significant example).
91 \whelp{} is the final output of a three-year research work inside \mowgli{}
92 which consisted in exporting the \coq{} library into XML, defining
93 a suitable set of metadata for indexing the information, implementing
94 the actual indexing tools, and finally designing and developing
95 the search engine. \whelp{} itself is the result of a complete
96 architectural re-visitation of a first prototype described in~\cite{metadata1},
97 integrated with the efficient retrieval
98 mechanisms described in~\cite{metadata2} (further improved
99 as described in Sect.~\ref{sec:hint}), and integrated with syntactic facilities
100 borrowed from the disambiguating parser of~\cite{disambiguation}.
101 Since the prototype version described in~\cite{metadata1}, also
102 the Web interface has been completely rewritten and simplified,
103 exploiting most of the publishing techniques
104 developed for the hypertextual rendering of the \coq{} library
105 (see \url{http://helm.cs.unibo.it/}) and described in
106 Sect.~\ref{sec:interface}.
108 The \whelp{} search engine is located at \url{http://helm.cs.unibo.it/whelp}.
113 \whelp{} interacts with the user as a classical World Wide Web search engine, it
114 expects single line queries and returns a list of results matching it. \whelp{}
115 currently supports four different kinds of queries, addressing different
116 user-requirements emerged in \mowgli: \match, \hint, \elim, and \locate{}
117 (described in Sect.~\ref{sec:queries}). The list is not meant to be exhaustive
118 and is likely to grow in the future.
120 The most typical of these queries (\match{} and \hint) require
121 the user to input a term of the Calculus of (co-)Inductive Constructions --- CIC
122 --- (the underlying calculus of \coq), supporting different kinds of pattern
123 based queries. Nevertheless, the concrete syntax we chose for writing the
124 input term is not bound to any specific logical system: it has been designed to
125 be as similar as possible to ordinary mathematics formulae, in their \TeX{}
126 encoding (see Table~\ref{tab:syntax}).
130 \begin{tabular}{lcl@{\hspace{1em}}l}
131 \emph{term} & ::= & \emph{identifier} & \\
132 & $|$ & \emph{number} & \\
133 & $|$ & \texttt{Prop} $~|~$ \texttt{Type} $~|~$ \texttt{Set} & sort \\
134 & $|$ & \texttt{?} & placeholder \\
135 & $|$ & \emph{term} \emph{term} & application \\
136 & $|$ & \emph{binder} \emph{vars} \texttt{.} \emph{term} & abstraction \\
137 & $|$ & \emph{term} \texmacro{to} \emph{term} & arrow type \\
138 & $|$ & \texttt{(} \emph{term} \texttt{)} & grouping \\
139 & $|$ & \emph{term} \emph{binop} \emph{term} & binary operator \\
140 & $|$ & \emph{unop} \emph{term} & unary operator \\
141 \emph{binder} & ::= & \texmacro{forall} $~|~$ \texmacro{exists} $~|~$
142 % \texmacro{Pi} $~|~$
144 \emph{vars} & ::= & \emph{names} & variables \\
145 & $|$ & \emph{names} \texttt{:} \emph{term} & typed variables \\
146 \emph{names} & ::= & \emph{identifier} $~|~$ \emph{identifier} \emph{names}
148 \emph{binop} & ::= & \texttt{+} $~|~$ \texttt{-} $~|~$ \texttt{*} $~|~$
149 \texttt{/} $~|~$ \texttt{\^} & arithmetic operators \\
150 & $|$ & \texttt{<} $~|~$ \texttt{>} $~|~$ \texmacro{leq} $~|~$
151 \texmacro{geq} $~|~$ \texttt{=} $~|~$ \texmacro{neq}
152 & comparison operators \\
153 & $|$ & \texmacro{lor} $~|~$ \texmacro{land} &
155 \emph{unop} & ::= & \texttt{-} & unary minus \\
156 & $|$ & \texmacro{lnot} & logical negation
159 \caption{\whelp's term syntax\label{tab:syntax}}
162 As a consequence of the generality of syntax, user
163 provided terms do not usually have a unique direct mapping to CIC term,
165 be suitably interpreted
166 in order to solve \emph{ambiguities}.
167 Consider for example the following term input:
168 \texttt{\texmacro{forall} x. 1*x = x}. in order to find the
169 corresponding CIC term we need to know the possible meanings of the
170 number \texttt{1} and the symbol \texttt{*}.\footnote{Note that
171 \texttt{x} is not undetermined, since it is a
172 bound variable.}%, number \texttt{1}, and symbol \texttt{+}.
174 The typical processing of a user query (depicted in Fig.~\ref{fig:engine}) is
175 therefore a pipeline made of four distinct phases: \emph{parsing} (canonical
176 transformation from concrete textual syntax to Abstract Syntax Trees, ASTs for
177 short), \emph{disambiguation} (described in next section), \emph{metadata
178 extraction} (described in Sect.~\ref{sec:metadata}), and the actual
179 \emph{query} (described in Sect.~\ref{sec:queries}).
181 It is worth observing that while the preliminary phases (disambiguation
182 and metadata extraction) of \whelp{} are not logic independent, the query
187 \includegraphics[width=0.9\textwidth]{engine_new}
188 \caption{\whelp's processing\label{fig:engine}}
193 \section{Disambiguation}
194 \label{sec:disambiguation}
196 The disambiguation phase builds CIC terms from ASTs of user inputs (also called
197 \emph{ambiguous terms}). Ambiguous terms may carry three different \emph{sources
198 of ambiguity}: unbound identifiers, literal numbers, and literal symbols.
199 \emph{Unbound identifiers} are sources of ambiguity since the same name
200 could have been used to represent different objects. For example, three
201 different theorems of the \coq{} library share the name $\mathit{plus\_assoc}$
202 (locating them is an exercise for the interested reader. Hint: use \whelp's
205 \emph{Numbers} are ambiguous since several different encodings of them could be
206 provided in logical systems. In the \coq{} standard library for example we found
207 naturals (in their unary encoding), positives (binary encoding), integers
208 (signed positives), and reals. Finally, \emph{symbols} (instances of the
209 \emph{binop} and \emph{unop} syntactic categories of Table~\ref{tab:syntax}) are
210 ambiguous as well: infix \texttt{+} for example is overloaded to represent
211 addition over the four different kinds of numbers available in the \coq{}
212 standard library. Note that given a term with more than one sources of
213 ambiguity, not all possible disambiguation choices are valid: for example, given
214 the input \texttt{1+1} we must choose an interpretation of \texttt{+} which is
215 typable in CIC according to the chosen interpretation for \texttt{1}; choosing
216 as \texttt{+} the addition over natural numbers and as \texttt{1} the real
217 number $1$ will lead to a type error.
219 A \emph{disambiguation algorithm} takes as input an ambiguous term and return a
220 fully determined CIC term. The \emph{naive disambiguation algorithm} takes as
221 input an ambiguous term $t$ and proceeds as follows:
225 \item Create disambiguation domains $\{D_i | i\in\mathit{Dom}(t)\}$, where
226 $\mathit{Dom}(t)$ is the set of ambiguity sources of $t$. Each $D_i$ is a set
229 \item Let $\Phi = \{\phi_i | {i\in\mathit{Dom}(t)},\phi_i\in D_i\}$
230 % such that $\forall i\in\mathit{Dom}(t),\exists\phi_j\in\Phi,i=j$
231 be an interpretation for $t$. Given $t$ and an interpretation $\Phi$, a CIC
232 term is fully determined. Iterate over all possible interpretations of $t$ and
233 type-check them, keep only typable interpretations (i.e. interpretations that
234 determine typable terms).
236 \item Let $n$ be the number of interpretations who survived step 2. If $n=0$
237 signal a type error. If $n=1$ we have found exactly one CIC term
238 corresponding to $t$, returns it as output of the disambiguation phase.
239 If $n>1$ let the user choose one of the $n$ interpretations and returns the
244 The above algorithm is highly inefficient since the number of possible
245 interpretations $\Phi$ grows exponentially with the number of ambiguity sources.
246 The actual algorithm used in \whelp{} is far more efficient being, in the
247 average case, linear in the number of ambiguity sources.
249 The efficient algorithm can be applied if the logic can be extended with
250 metavariables and a refiner can be implemented. This is the case for CIC and
251 several other logics.
252 \emph{Metavariables}~\cite{munoz} are typed, non linear placeholders that can
253 occur in terms; $?_i$ usually denotes the $i$-th metavariable, while $?$ denotes
254 a freshly created metavariable. A \emph{refiner}~\cite{mcbride} is a
255 function whose input is a term with placeholders and whose output is either a
256 new term obtained instantiating some placeholder or $\epsilon$, meaning that no
257 well typed instantiation could be found for the placeholders occurring in
258 the term (type error).
260 The efficient algorithm starts with an interpretation $\Phi_0 = \{\phi_i |
261 \phi_i = ?, i\in\mathit{Dom}(t)\}$,
262 which associates a fresh metavariable to each
263 source of ambiguity. Then it iterates refining the current CIC term (i.e. the
264 term obtained interpreting $t$ with $\Phi_i$). If the refinement succeeds the
265 next interpretation $\Phi_{i+1}$ will be created \emph{making a choice}, that is
266 replacing a placeholder with one of the possible choice from the corresponding
267 disambiguation domain. The placeholder to be replaced is chosen following a
268 preorder visit of the ambiguous term. If the refinement fails the current set of
269 choices cannot lead to a well-typed term and backtracking is attempted.
270 Once an unambiguous correct interpretation is found (i.e. $\Phi_i$ does no
271 longer contain any placeholder), backtracking is attempted
272 anyway to find the other correct interpretations.
274 The intuition which explain why this algorithm is more efficient is that as soon
275 as a term containing placeholders is not typable, no further instantiation of
276 its placeholders could lead to a typable term. For example, during the
277 disambiguation of user input \texttt{\texmacro{forall} x. x*0 = 0}, an
278 interpretation $\Phi_i$ is encountered which associates $?$ to the instance
279 of \texttt{0} on the right, the real number $0$ to the instance of \texttt{0} on
280 the left, and the multiplication over natural numbers (\texttt{mult} for short)
281 to \texttt{*}. The refiner will fail, since \texttt{mult} require a natural
282 argument, and no further instantiation of the placeholder will be tried.
284 If, at the end of the disambiguation, more than one possible interpretations are
285 possible, the user will be asked to choose the intended one (see
286 Fig.~\ref{fig:disambiguation}).
289 \centerline{\includegraphics[width=0.9\textwidth]{disambiguation-1}}
290 \caption{\label{fig:disambiguation} Disambiguation: interpretation choice}
293 Details of the disambiguation algorithm of \whelp{} can
294 be found in~\cite{disambiguation}, where an equivalent algorithm
295 that avoids backtracking is also presented.
297 %\example{Pattern matching query.} When fed with input \texttt{\texmacro{forall}
298 %n,m:nat. (S ?) = ? \texmacro{to} n = m}, \match{} return the following 3
302 % \item[\url{cic:/Coq/Init/Peano/eq_add_S.con}]
303 % \item[\url{cic:/Eindhoven/POCKLINGTON/lemmas/S_inj.con}]
304 % \item[\url{cic:/Rocq/TreeAutomata/bases/Sn_eq_Sm_n_eq_m.con}]
307 %all having type $\forall n,m\in\IN.(S~n)=(S~m)\to n=m$, where first placeholder
308 %has been instantiated with $n$ and second with $(S~m)$.
315 We use a logic-independent metadata model for indexing mathematical notions.
316 The model is essentially based on a single ternary relation \REF{p}{s}{t}
317 stating that an object $s$ refers an object $t$ at a given position
318 $p$. We use a minimal set of positions discriminating the hypotheses
319 (H), from the conclusion (C) and the proof (P) of a theorem (respectively,
320 the type of the input parameters, the type of the result, and
321 the body of a definition). Moreover, in the hypothesis and in the
322 conclusion we also distinguish the root position (MH and MC, respectively)
323 from deeper positions (that, in a first order setting, essentially amounts
324 to distinguish relational symbols from functional ones).
325 Extending the set of positions we could improve the granularity
326 and the precision of our indexing technique but so far, apart from a simple
327 extension discussed below, we never felt this need.
330 Consider the statement:
331 \[\forall m,n:\NAT. m \le n \to m < (S~n)\]
332 its metadata are described by the following table:
334 \begin{tabular}{|l|l|}
343 All occurrences of bound variables in position MH or MC are collapsed under
344 a unique reserved name \emph{Rel}, forgetting the actual variable name.
345 The occurrences in other positions are not considered.
346 See Sect.~\ref{sec:elim} for an example of use.
349 %Note that bound variables are not indexed: it would require extra work to
350 %take care of alpha conversion and also in this case we do not feel the
351 %need to index them to improve accuracy; however for the \elim{}
352 %query (Sect.~\ref{sec:elim}) we will index the fact that \emph{a} bound variable
353 %occurs in root position (without recording which variable).
354 %Looking for the statement in the library (that is the match
355 %operation of section ...) amounts
356 %to revert the indexing, searching for some some term $c$ such that
357 %\[Ref_{MH}(c,nat) \wedge Ref_{MH}(c,\le) \wedge Ref_{MC}(c,<)
358 %\wedge Ref_{C}(c,S) \]
359 %In a relational database, this is a simple and efficient join operation.
362 The accuracy of metadata for discriminating the statements of the library
363 is remarkable. We computed (see~\cite{edinburgh})
364 that the average number of mathematical notions
365 in the Coq library sharing the same metadata set is close to the actual number
366 of \emph{duplicates} (i.e. metadata almost precisely identify statements).
368 If more accuracy is needed, further filtering phases can be appended to the
369 \whelp{} pipeline of Fig.~\ref{fig:engine} to prune out false matches.
370 For instance, since the number of results of the query phase is usually
371 small, very accurate yet slow filters can be exploited.
373 According to the type as proposition analogy, the metadata above may
374 be also used to index the type of functions.
375 For instance, functions from $\NAT$ to $R$ (reals) would be identified by the
378 \begin{tabular}{|l|l|}
385 in this case, however, the metadata model is a bit too rough, since for instance
386 functions of type $\NAT\to\NAT$, $\NAT\to\NAT\to\NAT$,
387 $(\NAT\to\NAT)\to\NAT\to\NAT$ and so on would all share the following metadata
390 \begin{tabular}{|l|l|}
397 To improve this situation, we add an integer to MC (MH),
398 expressing the number of parameters of the term (respectively,
399 of the given hypothesis).
400 We call \emph{depth} this value since in the case of CIC is equal to
402 dependent products\footnote{Recall that in type theory, the function space is
403 just a degenerate case
404 of dependent product.} along the spine relative to the given position.
405 For instance, the three types above would now have the
406 following metadata sets:
410 \begin{tabular}[t]{|l|l|}
411 \multicolumn{2}{c}{$\NAT\to\NAT$} \\ \hline
413 $\NAT$ & MH(0) \\ \hline
414 $\NAT$ & MC(1) \\ \hline
417 \begin{tabular}[t]{|l|l|}
418 \multicolumn{2}{c}{$\NAT\to\NAT\to\NAT$} \\ \hline
420 $\NAT$ & MH(0) \\ \hline
421 $\NAT$ & MC(2) \\ \hline
424 \begin{tabular}[t]{|l|l|}
425 \multicolumn{2}{c}{$(\NAT\to\NAT)\to\NAT\to\NAT$} \\ \hline
427 $\NAT$ & MH(1) \\ \hline
429 $\NAT$ & MH(0) \\ \hline
430 $\NAT$ & MC(2) \\ \hline
435 The depth is a technical improvement that is particularly important
436 for retrieving functions from their types (we shall also see a use in the
437 \elim{} query, Sect.~\ref{sec:elim}), but is otherwise of minor relevance. In
438 the following examples,
439 we shall always list it among the metadata
440 for the sake of completeness, but it may be usually neglected by the reader.
442 \section{\whelp{} Queries}
447 Not all mathematical results have a canonical name or a
448 set of keywords which could easily identify them.
449 For this reason, it is extremely useful to be able to search the library
450 by means of the explicit statement. More generally, exploiting the
451 well-known types-as-formulae analogy of Curry-Howard,
452 \whelp's \match{} operation takes as input a type and returns a list
453 of objects (definition or proofs) inhabiting it.
455 \example{Find a proof of the distributivity of times over plus on natural
457 In order to retrieve those statements, \whelp{} need to be fed with the
458 distributivity law as input: \texttt{\texmacro{forall}
459 x,y,z:nat. x * (y+z) = x*y + x*z}. The \match{} query will return 4 results:
463 \item \url{cic:/Coq/Arith/Mult/mult_plus_distr_l.con}
464 \item \url{cic:/Coq/Arith/Mult/mult_plus_distr_r.con}
465 \item \url{cic:/Rocq/SUBST/comparith/mult_plus_distr_r.con}
466 \item \url{cic:/Sophia-Antipolis/HARDWARE/GENE/Arith_compl/mult_plus_distr2.con}
469 Each result locates a theorem in the Coq library that is organized in
470 a hiearchical fashion.
471 For example (4) identifies the \texttt{mult\_plus\_distr2}
472 theorem in the \texttt{Arith\_compl.v} file that is part of a contribution
473 on hardware circuits developed at the university of Sophia-Antipolis.
475 (1), (3), and (4) have types which are $\alpha$-convertible with the
476 user query; (2) is an interesting ``false match'' returned by \whelp{} having
477 type $\forall n,m,p\in\IN.(n+m)*p=n*p+m*p$, i.e. it is the
478 symmetric version of the distributivity proposition we were looking for.
480 The match operation simply amounts
481 to revert the indexing operation, looking for terms matching
482 the metadata set computed from the input. For instance, the term
483 \texttt{\texmacro{forall}
484 x,y,z:nat. x * (y+z) = x*y + x*z} has the following metadata:
486 \begin{tabular}{|l|l|}
489 $\NAT$ & MH(0) \\\hline
496 Note that $\NAT$ occurs in conclusion as an hidden parameter of equality;
497 the indexed term is the term after disambiguation, not the user input.
499 Searching for the distributivity law then amounts to look for a term
502 %\[\REF{\mathit{MH}(0)}{c}{nat} \wedge \REF{\mathit{MC}(3)}{c}{\le} \wedge
503 %\REF{C}{c}{nat} \wedge \REF{C}{c}{+} \wedge \REF{C}{c}{*}\]
504 \[\REF{\mathit{MH}(0)}{s}{nat} \land \REF{\mathit{MC}(3)}{s}{=} \land
505 \REF{C}{s}{nat} \land \REF{C}{s}{*} \land \REF{C}{s}{+}\]
506 In a relational database, this is a simple and efficient join operation.
508 \example{Suppose we are interested in looking for a definition of
509 summation for series of natural numbers}.
510 The type of such an object
511 is something of the kind $(\NAT\to\NAT)\to\NAT\to\NAT\to\NAT$, taking
512 the series, two natural numbers expressing summation lower and upper bound, and
513 giving back the resulting sum.
514 Feeding \whelp's \match{} query
515 with such a type does give back four results:
518 \item \url{cic:/Coq/Reals/Rfunctions/sum_nat_f.con}
519 \item \url{cic:/Sophia-Antipolis/Bertrand/Product/prod_nm.con}
520 \item \url{cic:/Sophia-Antipolis/Bertrand/Summation/sum_nm.con}
521 \item \url{cic:/Sophia-Antipolis/Rsa/Binomials/sum_nm.con}
523 Although we have a definition for summation in the standard library,
524 namely $\mathit{sum\_nat\_f}$, its theory is very underdeveloped. Luckily we
525 have a much more complete development for $\mathit{sum\_nm}$ in a contribution
528 \[\mathit{sum\_nm} \;n \;m \;f = \sum_{x=n}^{n+m}f(x)\]
530 Having discovered the name for summation, we may then
531 inquire about the proofs of some of its properties; for instance, considering
532 the semantics of $\mathit{sum\_nm}$, we may wonder if the following statement is
533 already in the library:
534 \[ \forall m,n,c:nat.(\mathit{sum\_nm}~n~m~\lambda x:nat.c) = (S~m)*c\]
535 Matching the previous theorem actually succeed, returning the following:\\
536 {\small\url{cic:/Sophia-Antipolis/Bertrand/Summation/sum_nm_c.con}}.
538 \subsubsection{Matching Incomplete Patterns}
539 \whelp{} also support matching with partial patterns, i.e. patterns
540 with placeholders denoted by $?$. The approach is essentially
541 identical to the previous one: we compute all the constants $A$ appearing
542 in the pattern, and look for all terms referring at least the set
543 of constants in $A$, at suitable positions.
545 Suppose for instance that you are interested in looking for all known facts
546 about the computation of $\mathit{sin}$ on given reals. You may just ask
547 \whelp{} to \match{} $\mathit{sin}~?=\,\,?$, that would result in the following
548 list (plus a couple of spurious results due to the fact that \coq{} variables
549 are not indexed, at present):
553 \item \url{cic:/Coq/Reals/Rtrigo/sin_2PI.con}
554 \item \url{cic:/Coq/Reals/Rtrigo/sin_PI.con}
555 \item \url{cic:/Coq/Reals/Rtrigo/sin_PI2.con}
556 \item \url{cic:/Coq/Reals/Rtrigo_calc/sin3PI4.con}
557 \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_2PI3.con}
558 \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_3PI2.con}
559 \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_5PI4.con}
560 \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_PI3.con}
561 \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_PI3_cos_PI6.con}
562 \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_PI4.con}
563 \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_PI6.con}
564 \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_PI6_cos_PI3.con}
565 \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_cos5PI4.con}
566 \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_cos_PI4.con}
567 \item \url{cic:/Coq/Reals/Rtrigo_def/sin_0.con}
570 Previous statements semantics is reasonably clear by their names; \whelp,
571 however, also performs in-line expansion of the statements, and provides
572 hyperlinks to the corresponding proofs (see Fig.~\ref{fig:match}).
575 \centerline{\includegraphics[width=0.9\textwidth]{match_sin}}
576 \caption{\label{fig:match} \match{} results.}
581 In a process of backward construction of a proof, typical of
582 proof assistants, one is often interested in knowing what theorems
583 can be applied to derive the current goal. The \hint{} operation of
584 \whelp{} is exactly meant to this purpose.
587 given a goal $g$ and a theorem
588 $t_1\to t_2\to\cdots\to t_n\to t$, the problem consists in checking
589 if there exists a substitution $\theta$ such that:
596 A necessary condition for (\ref{eq:hint1}),
597 which provides a very efficient filtering of the solution space,
598 is that the set of constants in $t$ must be a subset of those in
599 $g$. In terms of our metadata model, the problem consists
600 to find all s such as
603 \{x | Ref(s,x) \} \subseteq A
607 where $A$ is the set of constants in $g$.
608 This is not a simple operation: the naive
609 approach would require to iterate, for every possible source $s$, the
610 computation of its forward references, i.e. of $\{x | Ref(s,x) \}$,
611 followed by a set comparison with $A$.
613 The solution of~\cite{metadata2} is based on the following remarks.
614 Let us call $Card(s)$ the cardinality of $ \{x | Ref(s,x) \}$, which
615 can be pre-computed for every $s$. Then, (\ref{eq:hint2}) holds if and only if
616 there is a subset $A'$ of $A$ such that $A' = \{x | Ref(s,x) \}$, or
618 \[ A' \subseteq \{x | Ref(s,x) \} \wedge |A'| = Card(s) \]
620 \[ \bigwedge_{a \in A'} Ref(s,a) \wedge |A'| = Card(s) \]
621 The last one is a simple join that can be efficiently computed by any
622 relational database. So the actual cost is essentially bounded
623 by the computation of all subsets of $A$, and $A$, being the signature
624 of a formula, is never very large (and often quite small).
626 The problem of matching against a large library of heterogeneous
627 notions is however different and far more complex than in a
628 traditional theorem proving setting, where one typically works with
629 respect to a given theory with a fixed, and usually small signature.
630 If e.g. we look for theorems whose conclusion matches some kind
631 of equation like $e_1 = e_2$ we shall eventually find in the library
632 \emph{a lot} of injectivity results relative to operators we are
633 surely not interested in: in a library of 40,000 theorems like the
634 one of Coq we would get back about 3,000 of such \emph{silly
635 matches}. Stated in other words, canonical indexing techniques
636 specifically tailored on the matching problem such as
637 discrimination trees~\cite{McCune},
638 substitution trees~\cite{Graf} or context trees~\cite{Ganzinger}
639 are eventually doomed to fail in a mathematical knowledge management
640 context where one cannot assume a preliminary knowledge on term signatures.
642 On the other side, the metadata approach is much more flexible,
643 allowing a simple integration of matching operation with additional
644 and different constraints. For instance in the version of
645 \emph{hint} described in~\cite{metadata1} the problem of reducing
646 the number of \emph{silly matches} was solved by requiring at
647 least a minimal intersection between the signatures
648 of the two matching terms. However, this approach did sometimes
649 rule out some interesting answers. In the current version the
650 problem has been solved imposing further constraints of the full
651 signature of the term (in particular on the hypothesis),
652 essentially filtering out all solutions that would extend the
653 signature of the goal. The actual implementation of this approach
654 requires a more or less trivial extension to hypothesis of the
655 methodology described in~\cite{metadata2}.
660 Most statements in the \coq{} knowledge base concern properties of functions and
661 relations over algebraic types. Proofs of such statements are carried out
662 by structural induction over these types. In particular, to prove a goal by
663 induction over the type $t$, one needs to apply a lemma stating an induction
664 principle over $t$ (an \emph{eliminator} of $t$ \cite{mcbride}) to that
665 goal. Since many different eliminators can be provided for the same type $t$
666 (either automatically generated from $t$, or set up by the user), it is
667 convenient to have a way of retrieving all the eliminators of a given type. The
668 \elim{} query of \whelp{} does this job.\footnote{Of course it is possible to
669 let the author state what are the elimination principles, storing this
670 information as metadata accessible to \whelp{}. The technique we present
671 consists in automatically guessing the intended usage of a theorem from
672 its shape. Moreover, this is the only possible technique for legacy libraries
673 that lack classification information.}
674 To understand how it works, let's
675 take the case of the algebraic type of the natural numbers: one feeds \whelp{}
676 with the identifier $\NAT$, which denotes this type in the knowledge base, and
677 expects to find at least the well-known induction principle $\natind$:
679 \[ \forall P:\NAT \to \Prop.(P~0) \to (\forall n:\NAT.(P~n) \to (P~(S~n))) \to
680 \forall n:\NAT.(P~n) \]
682 A fairly good approximation of this statement is based on the following
683 observations: the first premise ($\NAT \to \Prop$) has an antecedent, a
684 reference to $\Prop$ in its root and a reference to $\NAT$; the forth premise
685 has no antecedents and a reference to $\NAT$ in its root; the conclusion
686 contains a bound variable in its root (i.e. $P$).
687 Notice that we choose not to approximate the major premises of $\natind$ (the
688 second and the third) because they depend on the structure of $\NAT$ and
689 discriminate the different induction principles over this type.
691 Thus, a set of constraints approximating $\natind$ is the following
692 (recall that \emph{Rel} stands for an arbitrary bound variable):
694 \begin{tabular}{|l|l|}
697 $\Prop$ & MH(1) \\\hline
699 $\NAT$ & MH(0) \\\hline
700 \emph{Rel} & MC \\\hline
704 The \elim{} query of \whelp{} simply generalizes this scheme substituting
705 $\NAT$ for a given type $t$ and retrieving any statement $c$ such that:
706 \[ \REF{\mathit{MH}(1)}{c}{\Prop} \land \REF{\mathit{H}}{c}{t} \land
707 \REF{\mathit{MH}(0)}{c}{t} \land
708 \REF{\mathit{MC}}{c}{Rel}
711 In the case of $\NAT$, \elim{} returns $47$ statements ordered by the
712 frequency of their use in the library (as expected $\natind$
715 %More sophisticated approximations of $c$ are possible, but they require a
716 %greater amount of computation over the metadata set indexing the knowledge
722 \whelp's \locate{} query implements a simple ``lookup by name'' for library
723 notions. Once fed with an identifier $i$, \locate{} returns the list of all
724 objects whose name is $i$. Intuitively, the \emph{name} of an object contained
725 in library is the name chosen for it by its author.\footnote{In the current
726 implementation object names correspond to the last fragment of object URIs,
728 %For example, the name of the definition of natural numbers
729 %\url{cic:/Coq/Init/Datatypes/nat.ind} is \texttt{nat}.
732 This list is obtained querying the specific relational metadata
733 $\mathit{Name}(c,i)$ that binds each unit of knowledge $c$ to an identifier $i$
734 (its \emph{name}). Unix-shell-like wildcards can be used to specify an
735 incomplete identifier: all objects whose name matches the incomplete identifier
738 Even if not based on the metadata model described in Sect.~\ref{sec:metadata},
739 \locate{} turns out to be really useful to browse the library since quite often
740 one remembers the name of an object, but not the corresponding contribution.
743 By entering the name $\mathit{gcd}$, \locate{} returns four different versions
744 of the ``greatest common divisor'':
748 \item \url{cic:/Orsay/Maths/gcd/gcd.ind#xpointer(1/1)}
749 \item \url{cic:/Eindhoven/POCKLINGTON/gcd/gcd.con}
750 \item \url{cic:/Sophia-Antipolis/Bertrand/Gcd/gcd.con}
751 \item \url{cic:/Sophia-Antipolis/Rsa/Divides/gcd.con}
754 %These are four versions of the ``greatest common divisor'':
755 %the first refers to a relation, the second is a function taking two integer
756 %numbers, and the last two refer to functions taking two natural numbers.
759 \section{Web Interface}
760 \label{sec:interface}
762 The result of \whelp{} is, for all queries, a
763 list of URIs (unique identifiers) for notions in the library. This list is not
764 particularly informative for the user, who would like to have
765 hyperlinks or, even better, in-line expansion of the notions.
767 In the \mowgli{} project we developed a service to render on the fly,
768 via a complex chain of XSLT transformations, mathematical objects encoded
769 in XML (those objects of the library of Coq that \whelp{} is indexing).
770 XSLT is the standard presentational language for XML documents and an
771 XSLT transformation (or \emph{stylesheet}) is a purely functional program
772 written in XSLT that describes a simple transformation between two XML formats.
774 The service can also render \emph{views} (misleadingly called ``theories''
775 in~\cite{annals}), that is an arbitrary, structured collection of mathematical
776 objects, suitably assembled (by an author or some mechanical tool) for
777 presentational purposes.
778 In a view, definitions, theorems, and so on may be intermixed
779 with explanatory text or figures, and statements are expanded
780 without proofs: a link to the corresponding proof objects
781 allows the user to inspect proofs, if desired.
783 Providing \whelp{} with an
784 appealing user interface for presenting the answers (see Fig.~\ref{fig:match})
785 has been as simple as making it generate a view and pipelining \whelp{}
786 with \UWOBO,\footnote{\url{http://helm.cs.unibo.it/software/uwobo/}}
787 the component of our architecture that implements the rendering service.
789 \UWOBO{} is a stylesheet manager implemented in
790 \OCAML\footnote{\url{http://caml.inria.fr/}} and based on \LIBXSLT
792 %\UWOBO{} has largely evolved in recent years: the version described
793 %in~\cite{annals} was implemented in Java and based on Xalan! \LIBXSLT{}
794 %(\url{http://xmlsoft.org/XSLT/}) is now the fastest available implementation of
796 , whose main functionality is the application of a list of stylesheets
797 (each one with the respective list of parameters) to a document. The
798 stylesheets are pre-compiled to improve performance. Both stylesheets and the
799 document are identified using HTTP URLs and can reside on any host. \UWOBO{} is
800 both a Web server and a Web client, accepting processing requests and asking for
801 the document to be processed. \whelp{} is a Web server, accepting queries as
802 processing requests and returning views to the client.
804 The \whelp{} interface is thus simply organized as a HTTP pipeline (see
805 Fig.~\ref{fig:http}).
808 \centerline{\includegraphics[width=0.7\textwidth]{architecture_new}}
809 \caption{\label{fig:http} \whelp's HTTP pipeline.}
813 \section{Conclusions}
814 \label{sec:conclusions}
816 \whelp{} is the Web searching helper developed at the University
817 of Bologna as a part of the European Project IST-2001-33562 \mowgli.
818 HELP is also the acronym of the four operations currently supported
819 by the system: Hint, Elim, Locate and Pattern-matching.
821 Much work remains to be done, spanning from relatively simple
822 technical improvements, to more complex architectural re-visitations
823 concerning the indexing technique and the design and implementation
826 Among the main technical improvements which we plan to support in
827 a near future there are:
829 \item the possibility to confine the search to
830 sub-libraries, and in particular to the standard library alone (this
831 is easy due to the paths of names);
832 \item skipping the annoying dialog phase with the user during
833 disambiguation for the choice of the intended interpretation, replacing it
834 with a direct investigation of all possibilities;
835 \item interactive support for advanced queries, allowing the user
836 to directly manipulate the metadata constraints (very powerful,
840 The current indexing politics has some evident limitations, resulting
841 in unexpected results of queries.
843 The most annoying problem is due to the current management of \coq{} variables.
844 Roughly, in \coq, \emph{variables} are meant for \emph{declarations}, while
845 \emph{constants} are meant for \emph{definitions}.
846 The current XML-exportation module of \coq{} \cite{exportation}
847 does not discharge section
848 variables, replacing this operation with an explicit substitution mechanism;
849 in particular, variables may be instantiated and their status look more
850 similar to local variables than to constants. For this reason, variables
851 have not been indexed; that currently looks as a mistake.
853 A second problem is due to coercions. The lack of an explicit
854 mechanism for composition of coercions tends to clutter the terms
855 with long chains of coercions, which in case of big algebraic developments
856 as e.g. C-Corn \cite{C-Corn}, can easily reach about ten elements. The fact
857 that an Object $r$ refers a coercion $c$ contains very little information,
858 especially if coercions typically come in a row, as in Coq. In the future,
859 we plan to skip coercions during indexing.
861 Notice that the two previous problems, both logic dependent, only affect
862 metadata extraction and not the metadata model that is logic independent.
864 The final set of improvements concerns the queries. A major issue, for
865 all kinds of content based operations, is to take care, at some extent,
866 of delta reduction. For instance, in Coq, the elementary order relations over
867 natural numbers are defined in terms of the \emph{less or equal} relation,
868 that is a suitable inductive type. Every query concerning such relations
869 could be thus reduced to a similar one about the \emph{less or equal} relation
870 by delta reduction. Even more appealing it looks the possibility to
871 extend the queries up to equational rewriting of the input (via
872 equations available in the library).\footnote{The possibility of
873 considering search up to isomorphism (see \cite{iso1,iso2})
874 looks instead less interesting, because
875 our indexing policy is an interesting surrogate that works very
876 well in practice while being much simpler than search up to isomorphisms.}
878 Similarly, the \hint{} operation, could and should be improved by suitably
879 tuning the current politics for computing the \emph{intended} signature
880 of the search space (for instance, ``closing'' it by delta reduction or
881 rewriting, adding constructors of inductive types, and so on).
883 Different kind of queries could be designed as well. An obvious
884 generalization of \hint{} is a \auto{}, automatically attempting
885 to solve a goal by repeated applications of theorems of the library
886 (a deeper exploration of the search space could be also useful
887 for a better rating of the \hint{} results).
889 A more interesting example of content-based query that exploits the higher
890 order nature of the input syntax is to look for all mathematical objects
891 providing examples, or instances, of a given notion. For instance
892 we may define the following property, asserting the commutativity of
894 \[is\_commutative := \lambda A:Set.\lambda f:A\to A \to A.
895 \forall x,y:A. (f\;x\;y) = (f\;y\;x)\]
896 Then, an \instance{} query should be able to retrieve from the library
897 all commutative operations which have been defined.\footnote{\match{} is
898 in fact a particular case of \instance{} where the initial sequence of
899 lambda abstractions is empty.}
901 To conclude, we remark that the only two components of \whelp{} that are
902 dependent on CIC, the logic of the Coq system, are the disambiguator for the
903 user input and the metadata extractor.
904 Moreover, the algorithm used for disambiguation depends only on the existence of
905 a refiner for mathematical formulae extended with placeholders and the metadata
906 model is logic independent. Thus \whelp{}
907 can be easily modified to index and search other mathematical libraries provided
908 that the statements of the theorems can be easily parsed (to extract the
909 metadata) and that there exists a service to render the results of the queries.
910 In particular, the Mizar development team has just released version 7.3.01 of
911 the Mizar proof assistant that provides both a native XML format and XSLT
912 stylesheets to render the proofs. Thus it is now possible to instantiate
913 \whelp{} to work on the library of Mizar, soon making possible a direct
914 comparison on the field between \whelp{} and MML Query, the new search engine
915 for Mizar articles described in \cite{mizarbrowsing,mizarsearchengine}.
916 Another interesting comparison is with the approach described in
917 \cite{cairns}, which has been tested on the Mizar library.
919 \begin{thebibliography}{}
921 \bibitem{edinburgh} A.~Asperti, F.~Guidi, L.~Padovani, C.~Sacerdoti Coen,
922 I.~Schena. \emph{The Science of Equality: some statistical considerations on
923 the Coq library}. Mathematical Knowledge Management Symposium, 25-29 November
924 2003, Heriot-Watt University, Edinburgh, Scotland.
926 \bibitem{annals} A.~Asperti, F.~Guidi, L.~Padovani, C.~Sacerdoti Coen,
927 I.~Schena. \emph{Mathematical Knowledge Management in HELM}. Annals of
928 Mathematics and Artificial Intelligence, 38(1): 27--46; May 2003.
930 \bibitem{metadata2} A. Asperti, M. Selmi. \emph{Efficient Retrieval of
931 Mathematical Statements}. In Proceeding of the Third International Conference
932 on Mathematical Knowledge Management, MKM 2004. Bialowieza, Poland. LNCS 3119.
934 \bibitem{pechino} A.Asperti, B.Wegner. \emph{An Approach to
935 Machine-Understandable Representation of the Mathematical Information in
936 Digital Documents}. In: Fengshai Bai and Bernd Wegner (eds.): Electronic
937 Information and Communication in Mathematics, LNCS vol. 2730, pp. 14--23, 2003
939 \bibitem{mizarsearchengine} G. Bancerek, P. Rudnicki. \emph{Information
940 Retrieval in MML}. In A.Asperti, B.Buchberger,J.Davenport (eds),
941 Proceedings of the Second International Conference on
942 Mathematical Knowledge Management, MKM 2003. LNCS, 2594.
944 \bibitem{mizarbrowsing} G. Bancerek, J.Urban. \emph{Integrated Semantic
945 Browsing of the Mizar Mathematical Repository}. In: A.Asperti,
946 G.Bancerek, A.Trybulec (eds.), Proceeding of the Third
947 International Conference on Mathematical Knowledge
948 Management, Springer LNCS 3119.
950 \bibitem{cairns} P.Cairns. \emph{Informalising Formal Mathematics: Searching
951 the Mizar Library with Latent Semantics}. In Proceeding of the Third
952 International Conference on Mathematical Knowledge Management, MKM 2004.
953 Bialowieza, Poland. LNCS 3119.
955 \bibitem{coq} The Coq proof-assistant, \url{http://coq.inria.fr}
957 \bibitem{C-Corn} L. Cruz-Filipe, H. Geuvers, F. Wiedijk, ``C-CoRN, the
958 Constructive Coq Repository at Nijmegen''. In: A.Asperti,
959 G.Bancerek, A.Trybulec (eds.), Proceeding of the Third
960 International Conference on Mathematical Knowledge
961 Management, Springer LNCS 3119, 88-103, 2004.
964 D.Delahaye, R. Di Cosmo. Information Retrieval in a Coq Proof Library using
965 Type Isomorphisms. In Proceedings of TYPES 99, L\"okeberg. Springer-Verlag
969 R. Di Cosmo. \emph{Isomorphisms of Types: from Lambda Calculus to
970 Information Retrieval and Language Design}, Birkhauser, 1995,
973 %\bibitem{diffeq} D. Draheim, W. Neun, D. Suliman. \emph{Classifying
974 % Differential Equations on the Web}. In Proceeding of the Third
975 % International Conference on Mathematical Knowledge Management, MKM 2004. LNCS,
978 \bibitem{Ganzinger} H. Ganzinger, R. Nieuwehuis, P. Nivela. \emph{Fast Term
979 Indexing with Coded Context Trees. Journal of Automated Reasoning}. To
982 \bibitem{Graf} P.Graf. Substitution Tree Indexing. Proceedings of the 6th RTA
983 Conference, Springer-Verlag LNCS 914, pp. 117-131, Kaiserlautern, Germany,
986 \bibitem{metadata1} F. Guidi, C. Sacerdoti Coen. \emph{Querying Distributed
987 Digital Libraries of Mathematics}. In Proceedings of Calculemus 2003, 11th
988 Symposium on the Integration of Symbolic Computation and Mechanized Reasoning.
991 \bibitem{mcbride} C. McBride. Dependently Typed Functional Programs and their
992 Proofs. Ph.D. thesis, University of Edinburgh, 1999.
994 \bibitem{McCune} W. McCune. \emph{Experiments with discrimination tree indexing
995 and path indexing for term retrieval}. Journal of Automated Reasoning,
996 9(2):147-167, October 1992.
998 \bibitem{munoz}C. Munoz. A Calculus of Substitutions for Incomplete-Proof
999 Representation in Type Theory. Ph.D. thesis, INRIA, 1997.
1001 \bibitem{exportation} C. Sacerdoti Coen. \emph{From Proof-Assistans to
1002 Distributed Libraries of Mathematics: Tips and Pitfalls}.
1003 In Proc. Mathematical Knowledge Management 2003, Lecture Notes in Computer
1004 Science, Vol. 2594, pp. 30--44, Springer-Verlag.
1006 \bibitem{disambiguation} C. Sacerdoti Coen, S. Zacchiroli. \emph{Efficient
1007 Ambiguous Parsing of Mathematical Formulae}. In Proceedings of the Third
1008 International Conference on Mathematical Knowledge Management, MKM 2004. LNCS,
1011 \end{thebibliography}