X-Git-Url: http://matita.cs.unibo.it/gitweb/?a=blobdiff_plain;f=helm%2Fpapers%2Fwhelp%2Fmoogle.tex;fp=helm%2Fpapers%2Fwhelp%2Fmoogle.tex;h=0000000000000000000000000000000000000000;hb=85747dc6d0578b484544bb8120aad7aa89813f27;hp=060fd54162809e010a8f14ffc6bb7fbb2000e8f1;hpb=c1986639552e01334a05db4236627a6c1ffacf21;p=helm.git diff --git a/helm/papers/whelp/moogle.tex b/helm/papers/whelp/moogle.tex deleted file mode 100644 index 060fd5416..000000000 --- a/helm/papers/whelp/moogle.tex +++ /dev/null @@ -1,1014 +0,0 @@ -\documentclass[a4paper]{llncs} -\pagestyle{headings} -\usepackage{graphicx} -\usepackage{amssymb,amsmath} -\usepackage{hyperref} -\usepackage{picins} - -\newcommand{\whelp}{Whelp} - -%\newcommand{\logo}[3]{ -%\parpic(0cm,0cm)(#2,#3)[l]{\includegraphics[width=#1]{whelp-bw}} -%} - -\newcommand{\coq}{Coq} -\newcommand{\mowgli}{MoWGLI} -\newcommand{\IR}{\ensuremath{\mathbb{R}}} -\newcommand{\IN}{\ensuremath{\mathbb{N}}} -\newcommand{\instance}{\textsc{Instance}} -\newcommand{\auto}{\textsc{Auto}} -\newcommand{\hint}{\textsc{Hint}} -\newcommand{\locate}{\textsc{Locate}} -\newcommand{\elim}{\textsc{Elim}} -\newcommand{\match}{\textsc{Match}} -\newcommand{\texmacro}[1]{\texttt{\char92 #1}} -\newcommand{\REF}[3]{\ensuremath{\mathit{Ref}_{#1}(#2,#3)}} -\newcommand{\NAT}{\ensuremath{\mathit{nat}}} -\newcommand{\Prop}{\mathit{Prop}} -\newcommand{\natind}{\mathit{nat\_ind}} -\newcommand{\METAHEADING}{Symbol & Position \\ \hline\hline} -\newcommand{\LIBXSLT}{LibXSLT} -\newcommand{\OCAML}{OCaml} -\newcommand{\UWOBO}{UWOBO} - -\title{A content based mathematical search engine: \whelp} -\author{Andrea Asperti, Ferruccio Guidi, Claudio Sacerdoti Coen,\\ - Enrico Tassi, and Stefano Zacchiroli} -\institute{Department of Computer Science, University of Bologna\\ - Mura Anteo Zamboni, 7 --- 40127 Bologna, ITALY\\ - \email{$\{$asperti,fguidi,sacerdot,tassi,zacchiro$\}$@cs.unibo.it}} - -\begin{document} -\maketitle - -%\begin{center} -%\Large{\textbf{A content based mathematical search engine:\\ -%\logo{0.8cm}{-1.4cm}{0.6cm}\whelp}} -%\author{Andrea Asperti, Ferruccio Guidi, Claudio Sacerdoti Coen,\\ -% Enrico Tassi, and Stefano Zacchiroli} -%\vspace{0.4cm} -%{\footnotesize -%Department of Computer Science, University of Bologna\\ -% Mura Anteo Zamboni, 7 -- 40127 Bologna, ITALY\\ -% \texttt{$\{$asperti,fguidi,sacerdot,tassi,zacchiro$\}$@cs.unibo.it}} -%\end{center} - -\begin{abstract} - The prototype of a content based search engine - for mathematical knowledge supporting a small set of queries requiring - matching and/or typing operations is described. - The prototype --- called \whelp{} --- exploits a metadata approach - for indexing the information that looks far more flexible than traditional - indexing techniques for structured expressions like substitution, - discrimination, or context trees. The prototype has been instantiated to the - standard library of the \coq{} proof assistant extended with many - user contributions. -\end{abstract} - -\section{Introduction} -\label{sec:intro} - -The paper describes the prototype of a content based search engine -for mathematical knowledge --- called \whelp{} --- developed inside -the European Project IST-2001-33562 MoWGLI~\cite{pechino}. -\whelp{} has been mostly -tested to search notions inside the library of formal mathematical -knowledge of the \coq{} proof assistant~\cite{coq}. Due to its dimension (about -40,000 theorems), this library was adopted by \mowgli{} as a main -example of repository of structured mathematical information. However, -\whelp{} --- better, its filtering phase --- only works on a small set -of metadata automatically extracted -from the structured sources, and is thus largely independent from -the actual syntax (and semantics) of the information. -Metadata also offer a higher flexibility with respect to more canonical -indexing techniques such as discrimination trees~\cite{McCune}, -substitution trees~\cite{Graf} or context trees~\cite{Ganzinger} since -all these approaches are optimized for the single operation of -(forward) matching, and are difficult to adapt or tune with -additional constraints (such as global constraints on the -signature of the term, just to make a simple but significant example). - -\whelp{} is the final output of a three-year research work inside \mowgli{} -which consisted in exporting the \coq{} library into XML, defining -a suitable set of metadata for indexing the information, implementing -the actual indexing tools, and finally designing and developing -the search engine. \whelp{} itself is the result of a complete -architectural re-visitation of a first prototype described in~\cite{metadata1}, -integrated with the efficient retrieval -mechanisms described in~\cite{metadata2} (further improved -as described in Sect.~\ref{sec:hint}), and integrated with syntactic facilities -borrowed from the disambiguating parser of~\cite{disambiguation}. -Since the prototype version described in~\cite{metadata1}, also -the Web interface has been completely rewritten and simplified, -exploiting most of the publishing techniques -developed for the hypertextual rendering of the \coq{} library -(see \url{http://helm.cs.unibo.it/}) and described in -Sect.~\ref{sec:interface}. - -The \whelp{} search engine is located at \url{http://helm.cs.unibo.it/whelp}. - -\section{Syntax} -\label{sec:syntax} - -\whelp{} interacts with the user as a classical World Wide Web search engine, it -expects single line queries and returns a list of results matching it. \whelp{} -currently supports four different kinds of queries, addressing different -user-requirements emerged in \mowgli: \match, \hint, \elim, and \locate{} -(described in Sect.~\ref{sec:queries}). The list is not meant to be exhaustive -and is likely to grow in the future. - -The most typical of these queries (\match{} and \hint) require -the user to input a term of the Calculus of (co-)Inductive Constructions --- CIC ---- (the underlying calculus of \coq), supporting different kinds of pattern -based queries. Nevertheless, the concrete syntax we chose for writing the -input term is not bound to any specific logical system: it has been designed to -be as similar as possible to ordinary mathematics formulae, in their \TeX{} -encoding (see Table~\ref{tab:syntax}). - -\begin{table}[ht] - \begin{center} - \begin{tabular}{lcl@{\hspace{1em}}l} - \emph{term} & ::= & \emph{identifier} & \\ - & $|$ & \emph{number} & \\ - & $|$ & \texttt{Prop} $~|~$ \texttt{Type} $~|~$ \texttt{Set} & sort \\ - & $|$ & \texttt{?} & placeholder \\ - & $|$ & \emph{term} \emph{term} & application \\ - & $|$ & \emph{binder} \emph{vars} \texttt{.} \emph{term} & abstraction \\ - & $|$ & \emph{term} \texmacro{to} \emph{term} & arrow type \\ - & $|$ & \texttt{(} \emph{term} \texttt{)} & grouping \\ - & $|$ & \emph{term} \emph{binop} \emph{term} & binary operator \\ - & $|$ & \emph{unop} \emph{term} & unary operator \\ - \emph{binder} & ::= & \texmacro{forall} $~|~$ \texmacro{exists} $~|~$ -% \texmacro{Pi} $~|~$ - \texmacro{lambda} \\ - \emph{vars} & ::= & \emph{names} & variables \\ - & $|$ & \emph{names} \texttt{:} \emph{term} & typed variables \\ - \emph{names} & ::= & \emph{identifier} $~|~$ \emph{identifier} \emph{names} - \\ - \emph{binop} & ::= & \texttt{+} $~|~$ \texttt{-} $~|~$ \texttt{*} $~|~$ - \texttt{/} $~|~$ \texttt{\^} & arithmetic operators \\ - & $|$ & \texttt{<} $~|~$ \texttt{>} $~|~$ \texmacro{leq} $~|~$ - \texmacro{geq} $~|~$ \texttt{=} $~|~$ \texmacro{neq} - & comparison operators \\ - & $|$ & \texmacro{lor} $~|~$ \texmacro{land} & - logical operators \\ - \emph{unop} & ::= & \texttt{-} & unary minus \\ - & $|$ & \texmacro{lnot} & logical negation - \end{tabular} - \end{center} - \caption{\whelp's term syntax\label{tab:syntax}} -\end{table} - -As a consequence of the generality of syntax, user -provided terms do not usually have a unique direct mapping to CIC term, -but must -be suitably interpreted -in order to solve \emph{ambiguities}. -Consider for example the following term input: -\texttt{\texmacro{forall} x. 1*x = x}. in order to find the -corresponding CIC term we need to know the possible meanings of the -number \texttt{1} and the symbol \texttt{*}.\footnote{Note that -\texttt{x} is not undetermined, since it is a -bound variable.}%, number \texttt{1}, and symbol \texttt{+}. - -The typical processing of a user query (depicted in Fig.~\ref{fig:engine}) is -therefore a pipeline made of four distinct phases: \emph{parsing} (canonical -transformation from concrete textual syntax to Abstract Syntax Trees, ASTs for -short), \emph{disambiguation} (described in next section), \emph{metadata -extraction} (described in Sect.~\ref{sec:metadata}), and the actual -\emph{query} (described in Sect.~\ref{sec:queries}). - -It is worth observing that while the preliminary phases (disambiguation -and metadata extraction) of \whelp{} are not logic independent, the query -engine is. - -\begin{figure}[ht] - \begin{center} - \includegraphics[width=0.9\textwidth]{engine_new} - \caption{\whelp's processing\label{fig:engine}} - \end{center} -\end{figure} - - -\section{Disambiguation} -\label{sec:disambiguation} - -The disambiguation phase builds CIC terms from ASTs of user inputs (also called -\emph{ambiguous terms}). Ambiguous terms may carry three different \emph{sources -of ambiguity}: unbound identifiers, literal numbers, and literal symbols. -\emph{Unbound identifiers} are sources of ambiguity since the same name -could have been used to represent different objects. For example, three -different theorems of the \coq{} library share the name $\mathit{plus\_assoc}$ -(locating them is an exercise for the interested reader. Hint: use \whelp's -\locate{} query). - -\emph{Numbers} are ambiguous since several different encodings of them could be -provided in logical systems. In the \coq{} standard library for example we found -naturals (in their unary encoding), positives (binary encoding), integers -(signed positives), and reals. Finally, \emph{symbols} (instances of the -\emph{binop} and \emph{unop} syntactic categories of Table~\ref{tab:syntax}) are -ambiguous as well: infix \texttt{+} for example is overloaded to represent -addition over the four different kinds of numbers available in the \coq{} -standard library. Note that given a term with more than one sources of -ambiguity, not all possible disambiguation choices are valid: for example, given -the input \texttt{1+1} we must choose an interpretation of \texttt{+} which is -typable in CIC according to the chosen interpretation for \texttt{1}; choosing -as \texttt{+} the addition over natural numbers and as \texttt{1} the real -number $1$ will lead to a type error. - -A \emph{disambiguation algorithm} takes as input an ambiguous term and return a -fully determined CIC term. The \emph{naive disambiguation algorithm} takes as -input an ambiguous term $t$ and proceeds as follows: - -\begin{enumerate} - - \item Create disambiguation domains $\{D_i | i\in\mathit{Dom}(t)\}$, where - $\mathit{Dom}(t)$ is the set of ambiguity sources of $t$. Each $D_i$ is a set - of CIC terms. - - \item Let $\Phi = \{\phi_i | {i\in\mathit{Dom}(t)},\phi_i\in D_i\}$ -% such that $\forall i\in\mathit{Dom}(t),\exists\phi_j\in\Phi,i=j$ - be an interpretation for $t$. Given $t$ and an interpretation $\Phi$, a CIC - term is fully determined. Iterate over all possible interpretations of $t$ and - type-check them, keep only typable interpretations (i.e. interpretations that - determine typable terms). - - \item Let $n$ be the number of interpretations who survived step 2. If $n=0$ - signal a type error. If $n=1$ we have found exactly one CIC term - corresponding to $t$, returns it as output of the disambiguation phase. - If $n>1$ let the user choose one of the $n$ interpretations and returns the - corresponding term. - -\end{enumerate} - -The above algorithm is highly inefficient since the number of possible -interpretations $\Phi$ grows exponentially with the number of ambiguity sources. -The actual algorithm used in \whelp{} is far more efficient being, in the -average case, linear in the number of ambiguity sources. - -The efficient algorithm can be applied if the logic can be extended with -metavariables and a refiner can be implemented. This is the case for CIC and -several other logics. -\emph{Metavariables}~\cite{munoz} are typed, non linear placeholders that can -occur in terms; $?_i$ usually denotes the $i$-th metavariable, while $?$ denotes -a freshly created metavariable. A \emph{refiner}~\cite{mcbride} is a -function whose input is a term with placeholders and whose output is either a -new term obtained instantiating some placeholder or $\epsilon$, meaning that no -well typed instantiation could be found for the placeholders occurring in -the term (type error). - -The efficient algorithm starts with an interpretation $\Phi_0 = \{\phi_i | -\phi_i = ?, i\in\mathit{Dom}(t)\}$, -which associates a fresh metavariable to each -source of ambiguity. Then it iterates refining the current CIC term (i.e. the -term obtained interpreting $t$ with $\Phi_i$). If the refinement succeeds the -next interpretation $\Phi_{i+1}$ will be created \emph{making a choice}, that is -replacing a placeholder with one of the possible choice from the corresponding -disambiguation domain. The placeholder to be replaced is chosen following a -preorder visit of the ambiguous term. If the refinement fails the current set of -choices cannot lead to a well-typed term and backtracking is attempted. -Once an unambiguous correct interpretation is found (i.e. $\Phi_i$ does no -longer contain any placeholder), backtracking is attempted -anyway to find the other correct interpretations. - -The intuition which explain why this algorithm is more efficient is that as soon -as a term containing placeholders is not typable, no further instantiation of -its placeholders could lead to a typable term. For example, during the -disambiguation of user input \texttt{\texmacro{forall} x. x*0 = 0}, an -interpretation $\Phi_i$ is encountered which associates $?$ to the instance -of \texttt{0} on the right, the real number $0$ to the instance of \texttt{0} on -the left, and the multiplication over natural numbers (\texttt{mult} for short) -to \texttt{*}. The refiner will fail, since \texttt{mult} require a natural -argument, and no further instantiation of the placeholder will be tried. - -If, at the end of the disambiguation, more than one possible interpretations are -possible, the user will be asked to choose the intended one (see -Fig.~\ref{fig:disambiguation}). - -\begin{figure}[htb] - \centerline{\includegraphics[width=0.9\textwidth]{disambiguation-1}} - \caption{\label{fig:disambiguation} Disambiguation: interpretation choice} -\end{figure} - -Details of the disambiguation algorithm of \whelp{} can -be found in~\cite{disambiguation}, where an equivalent algorithm -that avoids backtracking is also presented. - -%\example{Pattern matching query.} When fed with input \texttt{\texmacro{forall} -%n,m:nat. (S ?) = ? \texmacro{to} n = m}, \match{} return the following 3 -%results: -% -%\begin{description} -% \item[\url{cic:/Coq/Init/Peano/eq_add_S.con}] -% \item[\url{cic:/Eindhoven/POCKLINGTON/lemmas/S_inj.con}] -% \item[\url{cic:/Rocq/TreeAutomata/bases/Sn_eq_Sm_n_eq_m.con}] -%\end{description} -% -%all having type $\forall n,m\in\IN.(S~n)=(S~m)\to n=m$, where first placeholder -%has been instantiated with $n$ and second with $(S~m)$. -% - - -\section{Metadata} -\label{sec:metadata} - -We use a logic-independent metadata model for indexing mathematical notions. -The model is essentially based on a single ternary relation \REF{p}{s}{t} -stating that an object $s$ refers an object $t$ at a given position -$p$. We use a minimal set of positions discriminating the hypotheses -(H), from the conclusion (C) and the proof (P) of a theorem (respectively, -the type of the input parameters, the type of the result, and -the body of a definition). Moreover, in the hypothesis and in the -conclusion we also distinguish the root position (MH and MC, respectively) -from deeper positions (that, in a first order setting, essentially amounts -to distinguish relational symbols from functional ones). -Extending the set of positions we could improve the granularity -and the precision of our indexing technique but so far, apart from a simple -extension discussed below, we never felt this need. - -\begin{example} -Consider the statement: -\[\forall m,n:\NAT. m \le n \to m < (S~n)\] -its metadata are described by the following table: -\begin{center} -\begin{tabular}{|l|l|} -\hline -\METAHEADING -$\NAT$ & MH \\\hline -$\le$ & MH \\\hline -$<$ & MC \\\hline -$S$ & C \\\hline -\end{tabular} -\end{center} -All occurrences of bound variables in position MH or MC are collapsed under -a unique reserved name \emph{Rel}, forgetting the actual variable name. -The occurrences in other positions are not considered. -See Sect.~\ref{sec:elim} for an example of use. -% -% -%Note that bound variables are not indexed: it would require extra work to -%take care of alpha conversion and also in this case we do not feel the -%need to index them to improve accuracy; however for the \elim{} -%query (Sect.~\ref{sec:elim}) we will index the fact that \emph{a} bound variable -%occurs in root position (without recording which variable). -%Looking for the statement in the library (that is the match -%operation of section ...) amounts -%to revert the indexing, searching for some some term $c$ such that -%\[Ref_{MH}(c,nat) \wedge Ref_{MH}(c,\le) \wedge Ref_{MC}(c,<) -%\wedge Ref_{C}(c,S) \] -%In a relational database, this is a simple and efficient join operation. -\end{example} - -The accuracy of metadata for discriminating the statements of the library -is remarkable. We computed (see~\cite{edinburgh}) -that the average number of mathematical notions -in the Coq library sharing the same metadata set is close to the actual number -of \emph{duplicates} (i.e. metadata almost precisely identify statements). - -If more accuracy is needed, further filtering phases can be appended to the -\whelp{} pipeline of Fig.~\ref{fig:engine} to prune out false matches. -For instance, since the number of results of the query phase is usually -small, very accurate yet slow filters can be exploited. - -According to the type as proposition analogy, the metadata above may -be also used to index the type of functions. -For instance, functions from $\NAT$ to $R$ (reals) would be identified by the -following metadata: -\begin{center} -\begin{tabular}{|l|l|} -\hline -\METAHEADING -$\NAT$ & MH \\\hline -$R$ & MC \\\hline -\end{tabular} -\end{center} -in this case, however, the metadata model is a bit too rough, since for instance -functions of type $\NAT\to\NAT$, $\NAT\to\NAT\to\NAT$, -$(\NAT\to\NAT)\to\NAT\to\NAT$ and so on would all share the following metadata -set: -\begin{center} -\begin{tabular}{|l|l|} -\hline -\METAHEADING -$\NAT$ & MH \\\hline -$\NAT$ & MC \\\hline -\end{tabular} -\end{center} -To improve this situation, we add an integer to MC (MH), -expressing the number of parameters of the term (respectively, -of the given hypothesis). -We call \emph{depth} this value since in the case of CIC is equal to -the nesting depth of -dependent products\footnote{Recall that in type theory, the function space is -just a degenerate case -of dependent product.} along the spine relative to the given position. -For instance, the three types above would now have the -following metadata sets: - -\begin{center} - \hspace*{\fill} - \begin{tabular}[t]{|l|l|} - \multicolumn{2}{c}{$\NAT\to\NAT$} \\ \hline - \METAHEADING - $\NAT$ & MH(0) \\ \hline - $\NAT$ & MC(1) \\ \hline - \end{tabular} - \hfill - \begin{tabular}[t]{|l|l|} - \multicolumn{2}{c}{$\NAT\to\NAT\to\NAT$} \\ \hline - \METAHEADING - $\NAT$ & MH(0) \\ \hline - $\NAT$ & MC(2) \\ \hline - \end{tabular} - \hfill - \begin{tabular}[t]{|l|l|} - \multicolumn{2}{c}{$(\NAT\to\NAT)\to\NAT\to\NAT$} \\ \hline - \METAHEADING - $\NAT$ & MH(1) \\ \hline - $\NAT$ & H \\ \hline - $\NAT$ & MH(0) \\ \hline - $\NAT$ & MC(2) \\ \hline - \end{tabular} - \hspace*{\fill} -\end{center} - -The depth is a technical improvement that is particularly important -for retrieving functions from their types (we shall also see a use in the -\elim{} query, Sect.~\ref{sec:elim}), but is otherwise of minor relevance. In -the following examples, -we shall always list it among the metadata -for the sake of completeness, but it may be usually neglected by the reader. - -\section{\whelp{} Queries} -\label{sec:queries} - -\subsection{\match} -\label{sec:match} -Not all mathematical results have a canonical name or a -set of keywords which could easily identify them. -For this reason, it is extremely useful to be able to search the library -by means of the explicit statement. More generally, exploiting the -well-known types-as-formulae analogy of Curry-Howard, -\whelp's \match{} operation takes as input a type and returns a list -of objects (definition or proofs) inhabiting it. - -\example{Find a proof of the distributivity of times over plus on natural -numbers.} -In order to retrieve those statements, \whelp{} need to be fed with the -distributivity law as input: \texttt{\texmacro{forall} -x,y,z:nat. x * (y+z) = x*y + x*z}. The \match{} query will return 4 results: - -\begin{enumerate} - \small - \item \url{cic:/Coq/Arith/Mult/mult_plus_distr_l.con} - \item \url{cic:/Coq/Arith/Mult/mult_plus_distr_r.con} - \item \url{cic:/Rocq/SUBST/comparith/mult_plus_distr_r.con} - \item \url{cic:/Sophia-Antipolis/HARDWARE/GENE/Arith_compl/mult_plus_distr2.con} -\end{enumerate} - -Each result locates a theorem in the Coq library that is organized in -a hiearchical fashion. -For example (4) identifies the \texttt{mult\_plus\_distr2} -theorem in the \texttt{Arith\_compl.v} file that is part of a contribution -on hardware circuits developed at the university of Sophia-Antipolis. - -(1), (3), and (4) have types which are $\alpha$-convertible with the -user query; (2) is an interesting ``false match'' returned by \whelp{} having -type $\forall n,m,p\in\IN.(n+m)*p=n*p+m*p$, i.e. it is the -symmetric version of the distributivity proposition we were looking for. - -The match operation simply amounts -to revert the indexing operation, looking for terms matching -the metadata set computed from the input. For instance, the term -\texttt{\texmacro{forall} -x,y,z:nat. x * (y+z) = x*y + x*z} has the following metadata: -\begin{center} -\begin{tabular}{|l|l|} -\hline -\METAHEADING -$\NAT$ & MH(0) \\\hline -$=$ & MC(3) \\\hline -$\NAT$ & C \\\hline -$*$ & C \\\hline -$+$ & C \\\hline -\end{tabular} -\end{center} -Note that $\NAT$ occurs in conclusion as an hidden parameter of equality; -the indexed term is the term after disambiguation, not the user input. - -Searching for the distributivity law then amounts to look for a term -$s$ -such that : -%\[\REF{\mathit{MH}(0)}{c}{nat} \wedge \REF{\mathit{MC}(3)}{c}{\le} \wedge -%\REF{C}{c}{nat} \wedge \REF{C}{c}{+} \wedge \REF{C}{c}{*}\] -\[\REF{\mathit{MH}(0)}{s}{nat} \land \REF{\mathit{MC}(3)}{s}{=} \land -\REF{C}{s}{nat} \land \REF{C}{s}{*} \land \REF{C}{s}{+}\] -In a relational database, this is a simple and efficient join operation. - -\example{Suppose we are interested in looking for a definition of -summation for series of natural numbers}. -The type of such an object -is something of the kind $(\NAT\to\NAT)\to\NAT\to\NAT\to\NAT$, taking -the series, two natural numbers expressing summation lower and upper bound, and -giving back the resulting sum. -Feeding \whelp's \match{} query -with such a type does give back four results: -\begin{enumerate} - \small - \item \url{cic:/Coq/Reals/Rfunctions/sum_nat_f.con} - \item \url{cic:/Sophia-Antipolis/Bertrand/Product/prod_nm.con} - \item \url{cic:/Sophia-Antipolis/Bertrand/Summation/sum_nm.con} - \item \url{cic:/Sophia-Antipolis/Rsa/Binomials/sum_nm.con} -\end{enumerate} -Although we have a definition for summation in the standard library, -namely $\mathit{sum\_nat\_f}$, its theory is very underdeveloped. Luckily we -have a much more complete development for $\mathit{sum\_nm}$ in a contribution -from Sophia, where: - -\[\mathit{sum\_nm} \;n \;m \;f = \sum_{x=n}^{n+m}f(x)\] - -Having discovered the name for summation, we may then -inquire about the proofs of some of its properties; for instance, considering -the semantics of $\mathit{sum\_nm}$, we may wonder if the following statement is -already in the library: -\[ \forall m,n,c:nat.(\mathit{sum\_nm}~n~m~\lambda x:nat.c) = (S~m)*c\] -Matching the previous theorem actually succeed, returning the following:\\ -{\small\url{cic:/Sophia-Antipolis/Bertrand/Summation/sum_nm_c.con}}. - -\subsubsection{Matching Incomplete Patterns} -\whelp{} also support matching with partial patterns, i.e. patterns -with placeholders denoted by $?$. The approach is essentially -identical to the previous one: we compute all the constants $A$ appearing -in the pattern, and look for all terms referring at least the set -of constants in $A$, at suitable positions. - -Suppose for instance that you are interested in looking for all known facts -about the computation of $\mathit{sin}$ on given reals. You may just ask -\whelp{} to \match{} $\mathit{sin}~?=\,\,?$, that would result in the following -list (plus a couple of spurious results due to the fact that \coq{} variables -are not indexed, at present): - -\begin{enumerate} - \small - \item \url{cic:/Coq/Reals/Rtrigo/sin_2PI.con} - \item \url{cic:/Coq/Reals/Rtrigo/sin_PI.con} - \item \url{cic:/Coq/Reals/Rtrigo/sin_PI2.con} - \item \url{cic:/Coq/Reals/Rtrigo_calc/sin3PI4.con} - \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_2PI3.con} - \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_3PI2.con} - \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_5PI4.con} - \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_PI3.con} - \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_PI3_cos_PI6.con} - \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_PI4.con} - \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_PI6.con} - \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_PI6_cos_PI3.con} - \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_cos5PI4.con} - \item \url{cic:/Coq/Reals/Rtrigo_calc/sin_cos_PI4.con} - \item \url{cic:/Coq/Reals/Rtrigo_def/sin_0.con} -\end{enumerate} - -Previous statements semantics is reasonably clear by their names; \whelp, -however, also performs in-line expansion of the statements, and provides -hyperlinks to the corresponding proofs (see Fig.~\ref{fig:match}). - -\begin{figure}[tb] - \centerline{\includegraphics[width=0.9\textwidth]{match_sin}} - \caption{\label{fig:match} \match{} results.} -\end{figure} - -\subsection{\hint} -\label{sec:hint} -In a process of backward construction of a proof, typical of -proof assistants, one is often interested in knowing what theorems -can be applied to derive the current goal. The \hint{} operation of -\whelp{} is exactly meant to this purpose. - -Formally, -given a goal $g$ and a theorem -$t_1\to t_2\to\cdots\to t_n\to t$, the problem consists in checking -if there exists a substitution $\theta$ such that: - -\begin{equation} - t\theta = g - \label{eq:hint1} -\end{equation} - -A necessary condition for (\ref{eq:hint1}), -which provides a very efficient filtering of the solution space, -is that the set of constants in $t$ must be a subset of those in -$g$. In terms of our metadata model, the problem consists -to find all s such as - -\begin{equation} - \{x | Ref(s,x) \} \subseteq A - \label{eq:hint2} -\end{equation} - -where $A$ is the set of constants in $g$. -This is not a simple operation: the naive -approach would require to iterate, for every possible source $s$, the -computation of its forward references, i.e. of $\{x | Ref(s,x) \}$, -followed by a set comparison with $A$. - -The solution of~\cite{metadata2} is based on the following remarks. -Let us call $Card(s)$ the cardinality of $ \{x | Ref(s,x) \}$, which -can be pre-computed for every $s$. Then, (\ref{eq:hint2}) holds if and only if -there is a subset $A'$ of $A$ such that $A' = \{x | Ref(s,x) \}$, or -equivalently: -\[ A' \subseteq \{x | Ref(s,x) \} \wedge |A'| = Card(s) \] -and finally: -\[ \bigwedge_{a \in A'} Ref(s,a) \wedge |A'| = Card(s) \] -The last one is a simple join that can be efficiently computed by any -relational database. So the actual cost is essentially bounded -by the computation of all subsets of $A$, and $A$, being the signature -of a formula, is never very large (and often quite small). - -The problem of matching against a large library of heterogeneous -notions is however different and far more complex than in a -traditional theorem proving setting, where one typically works with -respect to a given theory with a fixed, and usually small signature. -If e.g. we look for theorems whose conclusion matches some kind -of equation like $e_1 = e_2$ we shall eventually find in the library -\emph{a lot} of injectivity results relative to operators we are -surely not interested in: in a library of 40,000 theorems like the -one of Coq we would get back about 3,000 of such \emph{silly -matches}. Stated in other words, canonical indexing techniques -specifically tailored on the matching problem such as -discrimination trees~\cite{McCune}, -substitution trees~\cite{Graf} or context trees~\cite{Ganzinger} -are eventually doomed to fail in a mathematical knowledge management -context where one cannot assume a preliminary knowledge on term signatures. - -On the other side, the metadata approach is much more flexible, -allowing a simple integration of matching operation with additional -and different constraints. For instance in the version of -\emph{hint} described in~\cite{metadata1} the problem of reducing -the number of \emph{silly matches} was solved by requiring at -least a minimal intersection between the signatures -of the two matching terms. However, this approach did sometimes -rule out some interesting answers. In the current version the -problem has been solved imposing further constraints of the full -signature of the term (in particular on the hypothesis), -essentially filtering out all solutions that would extend the -signature of the goal. The actual implementation of this approach -requires a more or less trivial extension to hypothesis of the -methodology described in~\cite{metadata2}. - -\subsection{\elim} -\label{sec:elim} - -Most statements in the \coq{} knowledge base concern properties of functions and -relations over algebraic types. Proofs of such statements are carried out -by structural induction over these types. In particular, to prove a goal by -induction over the type $t$, one needs to apply a lemma stating an induction -principle over $t$ (an \emph{eliminator} of $t$ \cite{mcbride}) to that -goal. Since many different eliminators can be provided for the same type $t$ -(either automatically generated from $t$, or set up by the user), it is -convenient to have a way of retrieving all the eliminators of a given type. The -\elim{} query of \whelp{} does this job.\footnote{Of course it is possible to -let the author state what are the elimination principles, storing this -information as metadata accessible to \whelp{}. The technique we present -consists in automatically guessing the intended usage of a theorem from -its shape. Moreover, this is the only possible technique for legacy libraries -that lack classification information.} -To understand how it works, let's -take the case of the algebraic type of the natural numbers: one feeds \whelp{} -with the identifier $\NAT$, which denotes this type in the knowledge base, and -expects to find at least the well-known induction principle $\natind$: - -\[ \forall P:\NAT \to \Prop.(P~0) \to (\forall n:\NAT.(P~n) \to (P~(S~n))) \to - \forall n:\NAT.(P~n) \] - -A fairly good approximation of this statement is based on the following -observations: the first premise ($\NAT \to \Prop$) has an antecedent, a -reference to $\Prop$ in its root and a reference to $\NAT$; the forth premise -has no antecedents and a reference to $\NAT$ in its root; the conclusion -contains a bound variable in its root (i.e. $P$). -Notice that we choose not to approximate the major premises of $\natind$ (the -second and the third) because they depend on the structure of $\NAT$ and -discriminate the different induction principles over this type. - -Thus, a set of constraints approximating $\natind$ is the following -(recall that \emph{Rel} stands for an arbitrary bound variable): -\begin{center} -\begin{tabular}{|l|l|} -\hline -\METAHEADING -$\Prop$ & MH(1) \\\hline -$\NAT$ & H \\\hline -$\NAT$ & MH(0) \\\hline -\emph{Rel} & MC \\\hline -\end{tabular} -\end{center} - -The \elim{} query of \whelp{} simply generalizes this scheme substituting -$\NAT$ for a given type $t$ and retrieving any statement $c$ such that: -\[ \REF{\mathit{MH}(1)}{c}{\Prop} \land \REF{\mathit{H}}{c}{t} \land - \REF{\mathit{MH}(0)}{c}{t} \land - \REF{\mathit{MC}}{c}{Rel} -\] - -In the case of $\NAT$, \elim{} returns $47$ statements ordered by the -frequency of their use in the library (as expected $\natind$ -is the first one). - -%More sophisticated approximations of $c$ are possible, but they require a -%greater amount of computation over the metadata set indexing the knowledge -%base. - -\subsection{\locate} -\label{sec:locate} - -\whelp's \locate{} query implements a simple ``lookup by name'' for library -notions. Once fed with an identifier $i$, \locate{} returns the list of all -objects whose name is $i$. Intuitively, the \emph{name} of an object contained -in library is the name chosen for it by its author.\footnote{In the current -implementation object names correspond to the last fragment of object URIs, -without extension. -%For example, the name of the definition of natural numbers -%\url{cic:/Coq/Init/Datatypes/nat.ind} is \texttt{nat}. -} - -This list is obtained querying the specific relational metadata -$\mathit{Name}(c,i)$ that binds each unit of knowledge $c$ to an identifier $i$ -(its \emph{name}). Unix-shell-like wildcards can be used to specify an -incomplete identifier: all objects whose name matches the incomplete identifier -are returned. - -Even if not based on the metadata model described in Sect.~\ref{sec:metadata}, -\locate{} turns out to be really useful to browse the library since quite often -one remembers the name of an object, but not the corresponding contribution. - -\begin{example} -By entering the name $\mathit{gcd}$, \locate{} returns four different versions -of the ``greatest common divisor'': - -\begin{enumerate} - \small - \item \url{cic:/Orsay/Maths/gcd/gcd.ind#xpointer(1/1)} - \item \url{cic:/Eindhoven/POCKLINGTON/gcd/gcd.con} - \item \url{cic:/Sophia-Antipolis/Bertrand/Gcd/gcd.con} - \item \url{cic:/Sophia-Antipolis/Rsa/Divides/gcd.con} -\end{enumerate} - -%These are four versions of the ``greatest common divisor'': -%the first refers to a relation, the second is a function taking two integer -%numbers, and the last two refer to functions taking two natural numbers. -\end{example} - -\section{Web Interface} -\label{sec:interface} - -The result of \whelp{} is, for all queries, a -list of URIs (unique identifiers) for notions in the library. This list is not -particularly informative for the user, who would like to have -hyperlinks or, even better, in-line expansion of the notions. - -In the \mowgli{} project we developed a service to render on the fly, -via a complex chain of XSLT transformations, mathematical objects encoded -in XML (those objects of the library of Coq that \whelp{} is indexing). -XSLT is the standard presentational language for XML documents and an -XSLT transformation (or \emph{stylesheet}) is a purely functional program -written in XSLT that describes a simple transformation between two XML formats. - -The service can also render \emph{views} (misleadingly called ``theories'' -in~\cite{annals}), that is an arbitrary, structured collection of mathematical -objects, suitably assembled (by an author or some mechanical tool) for -presentational purposes. -In a view, definitions, theorems, and so on may be intermixed -with explanatory text or figures, and statements are expanded -without proofs: a link to the corresponding proof objects -allows the user to inspect proofs, if desired. - -Providing \whelp{} with an -appealing user interface for presenting the answers (see Fig.~\ref{fig:match}) -has been as simple as making it generate a view and pipelining \whelp{} -with \UWOBO,\footnote{\url{http://helm.cs.unibo.it/software/uwobo/}} -the component of our architecture that implements the rendering service. - -\UWOBO{} is a stylesheet manager implemented in -\OCAML\footnote{\url{http://caml.inria.fr/}} and based on \LIBXSLT -%\footnote{Even -%\UWOBO{} has largely evolved in recent years: the version described -%in~\cite{annals} was implemented in Java and based on Xalan! \LIBXSLT{} -%(\url{http://xmlsoft.org/XSLT/}) is now the fastest available implementation of -%XSLT.} -, whose main functionality is the application of a list of stylesheets -(each one with the respective list of parameters) to a document. The -stylesheets are pre-compiled to improve performance. Both stylesheets and the -document are identified using HTTP URLs and can reside on any host. \UWOBO{} is -both a Web server and a Web client, accepting processing requests and asking for -the document to be processed. \whelp{} is a Web server, accepting queries as -processing requests and returning views to the client. - -The \whelp{} interface is thus simply organized as a HTTP pipeline (see -Fig.~\ref{fig:http}). - -\begin{figure}[!ht] - \centerline{\includegraphics[width=0.7\textwidth]{architecture_new}} - \caption{\label{fig:http} \whelp's HTTP pipeline.} -\end{figure} - - -\section{Conclusions} -\label{sec:conclusions} - -\whelp{} is the Web searching helper developed at the University -of Bologna as a part of the European Project IST-2001-33562 \mowgli. -HELP is also the acronym of the four operations currently supported -by the system: Hint, Elim, Locate and Pattern-matching. - -Much work remains to be done, spanning from relatively simple -technical improvements, to more complex architectural re-visitations -concerning the indexing technique and the design and implementation -of the queries. - -Among the main technical improvements which we plan to support in -a near future there are: -\begin{enumerate} -\item the possibility to confine the search to -sub-libraries, and in particular to the standard library alone (this -is easy due to the paths of names); -\item skipping the annoying dialog phase with the user during -disambiguation for the choice of the intended interpretation, replacing it -with a direct investigation of all possibilities; -\item interactive support for advanced queries, allowing the user -to directly manipulate the metadata constraints (very powerful, -if properly used). -\end{enumerate} - -The current indexing politics has some evident limitations, resulting -in unexpected results of queries. - -The most annoying problem is due to the current management of \coq{} variables. -Roughly, in \coq, \emph{variables} are meant for \emph{declarations}, while -\emph{constants} are meant for \emph{definitions}. -The current XML-exportation module of \coq{} \cite{exportation} -does not discharge section -variables, replacing this operation with an explicit substitution mechanism; -in particular, variables may be instantiated and their status look more -similar to local variables than to constants. For this reason, variables -have not been indexed; that currently looks as a mistake. - -A second problem is due to coercions. The lack of an explicit -mechanism for composition of coercions tends to clutter the terms -with long chains of coercions, which in case of big algebraic developments -as e.g. C-Corn \cite{C-Corn}, can easily reach about ten elements. The fact -that an Object $r$ refers a coercion $c$ contains very little information, -especially if coercions typically come in a row, as in Coq. In the future, -we plan to skip coercions during indexing. - -Notice that the two previous problems, both logic dependent, only affect -metadata extraction and not the metadata model that is logic independent. - -The final set of improvements concerns the queries. A major issue, for -all kinds of content based operations, is to take care, at some extent, -of delta reduction. For instance, in Coq, the elementary order relations over -natural numbers are defined in terms of the \emph{less or equal} relation, -that is a suitable inductive type. Every query concerning such relations -could be thus reduced to a similar one about the \emph{less or equal} relation -by delta reduction. Even more appealing it looks the possibility to -extend the queries up to equational rewriting of the input (via -equations available in the library).\footnote{The possibility of -considering search up to isomorphism (see \cite{iso1,iso2}) -looks instead less interesting, because -our indexing policy is an interesting surrogate that works very -well in practice while being much simpler than search up to isomorphisms.} - -Similarly, the \hint{} operation, could and should be improved by suitably -tuning the current politics for computing the \emph{intended} signature -of the search space (for instance, ``closing'' it by delta reduction or -rewriting, adding constructors of inductive types, and so on). - -Different kind of queries could be designed as well. An obvious -generalization of \hint{} is a \auto{}, automatically attempting -to solve a goal by repeated applications of theorems of the library -(a deeper exploration of the search space could be also useful -for a better rating of the \hint{} results). - -A more interesting example of content-based query that exploits the higher -order nature of the input syntax is to look for all mathematical objects -providing examples, or instances, of a given notion. For instance -we may define the following property, asserting the commutativity of -a generic function f -\[is\_commutative := \lambda A:Set.\lambda f:A\to A \to A. -\forall x,y:A. (f\;x\;y) = (f\;y\;x)\] -Then, an \instance{} query should be able to retrieve from the library -all commutative operations which have been defined.\footnote{\match{} is -in fact a particular case of \instance{} where the initial sequence of -lambda abstractions is empty.} - -To conclude, we remark that the only two components of \whelp{} that are -dependent on CIC, the logic of the Coq system, are the disambiguator for the -user input and the metadata extractor. -Moreover, the algorithm used for disambiguation depends only on the existence of -a refiner for mathematical formulae extended with placeholders and the metadata -model is logic independent. Thus \whelp{} -can be easily modified to index and search other mathematical libraries provided -that the statements of the theorems can be easily parsed (to extract the -metadata) and that there exists a service to render the results of the queries. -In particular, the Mizar development team has just released version 7.3.01 of -the Mizar proof assistant that provides both a native XML format and XSLT -stylesheets to render the proofs. Thus it is now possible to instantiate -\whelp{} to work on the library of Mizar, soon making possible a direct -comparison on the field between \whelp{} and MML Query, the new search engine -for Mizar articles described in \cite{mizarbrowsing,mizarsearchengine}. -Another interesting comparison is with the approach described in -\cite{cairns}, which has been tested on the Mizar library. - -\begin{thebibliography}{} - - \bibitem{edinburgh} A.~Asperti, F.~Guidi, L.~Padovani, C.~Sacerdoti Coen, - I.~Schena. \emph{The Science of Equality: some statistical considerations on - the Coq library}. Mathematical Knowledge Management Symposium, 25-29 November - 2003, Heriot-Watt University, Edinburgh, Scotland. - - \bibitem{annals} A.~Asperti, F.~Guidi, L.~Padovani, C.~Sacerdoti Coen, - I.~Schena. \emph{Mathematical Knowledge Management in HELM}. Annals of - Mathematics and Artificial Intelligence, 38(1): 27--46; May 2003. - - \bibitem{metadata2} A. Asperti, M. Selmi. \emph{Efficient Retrieval of - Mathematical Statements}. In Proceeding of the Third International Conference - on Mathematical Knowledge Management, MKM 2004. Bialowieza, Poland. LNCS 3119. - - \bibitem{pechino} A.Asperti, B.Wegner. \emph{An Approach to - Machine-Understandable Representation of the Mathematical Information in - Digital Documents}. In: Fengshai Bai and Bernd Wegner (eds.): Electronic - Information and Communication in Mathematics, LNCS vol. 2730, pp. 14--23, 2003 - - \bibitem{mizarsearchengine} G. Bancerek, P. Rudnicki. \emph{Information - Retrieval in MML}. In A.Asperti, B.Buchberger,J.Davenport (eds), - Proceedings of the Second International Conference on - Mathematical Knowledge Management, MKM 2003. LNCS, 2594. - - \bibitem{mizarbrowsing} G. Bancerek, J.Urban. \emph{Integrated Semantic - Browsing of the Mizar Mathematical Repository}. In: A.Asperti, - G.Bancerek, A.Trybulec (eds.), Proceeding of the Third - International Conference on Mathematical Knowledge - Management, Springer LNCS 3119. - - \bibitem{cairns} P.Cairns. \emph{Informalising Formal Mathematics: Searching - the Mizar Library with Latent Semantics}. In Proceeding of the Third - International Conference on Mathematical Knowledge Management, MKM 2004. - Bialowieza, Poland. LNCS 3119. - - \bibitem{coq} The Coq proof-assistant, \url{http://coq.inria.fr} - - \bibitem{C-Corn} L. Cruz-Filipe, H. Geuvers, F. Wiedijk, ``C-CoRN, the - Constructive Coq Repository at Nijmegen''. In: A.Asperti, - G.Bancerek, A.Trybulec (eds.), Proceeding of the Third - International Conference on Mathematical Knowledge - Management, Springer LNCS 3119, 88-103, 2004. - - \bibitem{iso1} - D.Delahaye, R. Di Cosmo. Information Retrieval in a Coq Proof Library using - Type Isomorphisms. In Proceedings of TYPES 99, L\"okeberg. Springer-Verlag - LNCS, 1999. - - \bibitem{iso2} - R. Di Cosmo. \emph{Isomorphisms of Types: from Lambda Calculus to - Information Retrieval and Language Design}, Birkhauser, 1995, - IBSN-0-8176-3763-X. - - %\bibitem{diffeq} D. Draheim, W. Neun, D. Suliman. \emph{Classifying - % Differential Equations on the Web}. In Proceeding of the Third - % International Conference on Mathematical Knowledge Management, MKM 2004. LNCS, - % 3119. - - \bibitem{Ganzinger} H. Ganzinger, R. Nieuwehuis, P. Nivela. \emph{Fast Term - Indexing with Coded Context Trees. Journal of Automated Reasoning}. To - appear. - - \bibitem{Graf} P.Graf. Substitution Tree Indexing. Proceedings of the 6th RTA - Conference, Springer-Verlag LNCS 914, pp. 117-131, Kaiserlautern, Germany, - April 4-7, 1995. - - \bibitem{metadata1} F. Guidi, C. Sacerdoti Coen. \emph{Querying Distributed - Digital Libraries of Mathematics}. In Proceedings of Calculemus 2003, 11th - Symposium on the Integration of Symbolic Computation and Mechanized Reasoning. - Aracne Editrice. - - \bibitem{mcbride} C. McBride. Dependently Typed Functional Programs and their - Proofs. Ph.D. thesis, University of Edinburgh, 1999. - - \bibitem{McCune} W. McCune. \emph{Experiments with discrimination tree indexing - and path indexing for term retrieval}. Journal of Automated Reasoning, - 9(2):147-167, October 1992. - - \bibitem{munoz}C. Munoz. A Calculus of Substitutions for Incomplete-Proof - Representation in Type Theory. Ph.D. thesis, INRIA, 1997. - - \bibitem{exportation} C. Sacerdoti Coen. \emph{From Proof-Assistans to - Distributed Libraries of Mathematics: Tips and Pitfalls}. - In Proc. Mathematical Knowledge Management 2003, Lecture Notes in Computer - Science, Vol. 2594, pp. 30--44, Springer-Verlag. - - \bibitem{disambiguation} C. Sacerdoti Coen, S. Zacchiroli. \emph{Efficient - Ambiguous Parsing of Mathematical Formulae}. In Proceedings of the Third - International Conference on Mathematical Knowledge Management, MKM 2004. LNCS, - 3119. - -\end{thebibliography} - -\end{document} -