added content level ref

[helm.git] / helm / papers / matita / matita.tex
diff --git a/helm/papers/matita/matita.tex b/helm/papers/matita/matita.tex

index b5c27f1bfd0eed62ce3beaffa2342fdb78938a81..09b1f0b79a80a379f473ac2a8153e28e518f135e 100644 (file)
--- a/helm/papers/matita/matita.tex
+++ b/helm/papers/matita/matita.tex
@@ -25,6 +25,7 @@
  \newcommand{\MOWGLI}{MoWGLI}
  \newcommand{\NAT}{\ensuremath{\mathit{nat}}}
  \newcommand{\NATIND}{\mathit{nat\_ind}}
+\newcommand{\NUPRL}{NuPRL}
  \newcommand{\OCAML}{OCaml}
  \newcommand{\PROP}{\mathit{Prop}}
  \newcommand{\REF}[3]{\ensuremath{\mathit{Ref}_{#1}(#2,#3)}}
@@ -33,6 +34,8 @@
  \newcommand{\WHELP}{Whelp}
  
  \newcommand{\ASSIGNEDTO}[1]{\textbf{Assigned to:} #1}
+\newcommand{\NOTE}[1]{\marginpar{\scriptsize #1}}
+\newcommand{\NT}[1]{\langle\mathit{#1}\rangle}
  
  \title{The Matita proof assistant}
  \author{Andrea Asperti, Claudio Sacerdoti Coen, Enrico Tassi
@@ -40,6 +43,7 @@
  \institute{Department of Computer Science, University of Bologna\\
   Mura Anteo Zamboni, 7 --- 40127 Bologna, ITALY\\
   \email{$\{$asperti,sacerdot,tassi,zacchiro$\}$@cs.unibo.it}}
+\bibliographystyle{plain}
  
  \begin{document}
  \maketitle
@@ -49,44 +53,44 @@
  
  \section{Introduction}
  \label{sec:intro}
-{\em Matita} is the proof assistant under development by the Helm team
-\cite{annals} at the University of Bologna, under the direction of 
-Prof.Asperti. 
+{\em Matita} is the proof assistant under development by the \HELM{} team
+\cite{mkm-helm} at the University of Bologna, under the direction of 
+Prof.~Asperti. 
  The origin of the system goes back to 1999. At the time we were mostly 
  interested to develop tools and techniques to enhance the accessibility
  via web of formal libraries of mathematics. Due to its dimension, the
-library of the coq proof assistant (of the order of 35000 theorems) 
+library of the \COQ{} proof assistant (of the order of 35'000 theorems) 
  was choosed as a privileged test bench for our work, although experiments
-have been also conducted with other systems, and notably with Nuprl.
+have been also conducted with other systems, and notably with \NUPRL{}.
  The work, mostly performed in the framework of the recently concluded 
-European project IST-33562-Mowgli \cite{pechino}, mainly consisted in the 
+European project IST-33562 \MOWGLI{}~\cite{pechino}, mainly consisted in the 
  following teps:
  \begin{itemize}
  \item exporting the information from the internal representation of
-Coq to a system and platform independent format. Since XML was at the 
+ \COQ{} to a system and platform independent format. Since XML was at the 
  time an emerging standard, we naturally adopted this technology, fostering
  a content-based architecture for future system, where the documents
  of the library were the the main components around which everything else 
  has to be build;
  \item developing indexing and searching techniques supporting semantic
-queries to the library; these efforts gave birth to our {\em whelp}
-search engine, described in \cite{whelp};
+ queries to the library; these efforts gave birth to our \WHELP{}
+search engine, described in~\cite{whelp};
  \item developing languages and tools for a high-quality notational 
  rendering of mathematical information; in particular, we have been 
  active in the MathML Working group since 1999, and developed inside
-Helm a MathML-compliant widget for the GTK graphical environment
+\HELM{} a MathML-compliant widget for the GTK graphical environment
  which can be integrated in any application.
  \end{itemize}
-The exportation issue, extensively discussed in \cite{exportation},
+The exportation issue, extensively discussed in \cite{exportation-module},
  has several major implications worth to be discussed. 
  
  The first
-point concern the kind of content information to be exported. In a
-proof assistant like coq, proofs are represented in at least three clearly
-distinguishable formats: scripts (i.e. sequences of commands issued by the
-user to the system during an interactive session of proof), proof objects
+point concerns the kind of content information to be exported. In a
+proof assistant like \COQ{}, proofs are represented in at least three clearly
+distinguishable formats: \emph{scripts} (i.e. sequences of commands issued by the
+user to the system during an interactive session of proof), \emph{proof objects}
  (which is the low-level representation of proofs in the form of
-lambda-terms readable to and checked by kernel) and proof-trees (which
+lambda-terms readable to and checked by kernel) and \emph{proof-trees} (which
  is a kind of intermediate representation, vaguely inspired by a sequent
  like notation, that inherits most of the defects but essentially
  none of the advantages of the previous representations). 
@@ -97,14 +101,14 @@ proof objects correspond to individual mathemtical items.
  
  In our case, the choice of the content encoding was eventually dictated
  by the methodological assumption of offering the information in a
-stable and system independent format. The language of scripts is too
-oriented to Coq, and it changes too rapidly to be of any interest
+stable and system-independent format. The language of scripts is too
+oriented to \COQ, and it changes too rapidly to be of any interest
  to third parties. On the other side, the language of proof objects 
  merely depend on
  the logical framework (the Calculus of Inductive Constructions, in
-the case of Coq), is grammatically simple, semantically clear and, 
-especially, is very stable (as the kernel of the proof assistants 
-often is). 
+the case of \COQ), is grammatically simple, semantically clear and, 
+especially, is very stable (as kernels of proof assistants 
+often are). 
  So the granularity of the library is at the level of individual 
  objects, that also justifies from another point of view the need
  for efficient searching techniques for retrieving individual 
@@ -114,20 +118,20 @@ The main (possibly only) problem with proof objects is that they are
  difficult to read and do not directly correspond to what the user typed
  in. An analogy frequently made in the proof assistant community is that of
  comparing the vernacular language of scripts to a high level source language
-and lambda terms to the assembly language they are compiled in, We do not
+and lambda terms to the assembly language they are compiled in. We do not
  share this view and prefer to look at scripts as an imperative language, 
  and to lambda terms as their denotational semantics; still, however,
  denotational semantics is possibly more formal but surely not more readable 
  than the imperative source.
  
-For all the previous reasons, a huge amount of work inside Mowgli has
+For all the previous reasons, a huge amount of work inside \MOWGLI{} has
  been devoted to automatic reconstruction of proofs in natural language
  from lambda terms. Since lambda terms are in close connection 
  with natural deduction 
-(thay is still the most natural logical language discovered so far)
+(that is still the most natural logical language discovered so far)
  the work is not hopeless as it may seem, especially if rendering
  is combined, as in our case, with dynamic features supporting 
-in-line expansions or contraction of subproofs. The final 
+in-line expansions or contractions of subproofs. The final 
  rendering is probably not entirely satisfactory (see \cite{ida} for a
  discussion), but surely
  readable (the actual quality largely depends by the way the lambda 
@@ -140,7 +144,7 @@ with tools for parsing and saving mathematical objects in such a format;
  \item metadata specifications and tools for indexing and querying the
  XML knowledge base;
  \item a proof checker (i.e. the {\em kernel} of a proof assistant), 
-implemented to check that we exported form the coq library all the 
+ implemented to check that we exported form the \COQ{} library all the 
  logically relevant content;
  \item a sophisticated parser (used by the search engine), able to deal 
  with potentially ambiguous and incomplete information, typical of the 
@@ -151,31 +155,31 @@ existential variables, used by the disambiguating parser;
  language;
  \item an innovative rendering widget, supporting high-quality bidimensional
  rendering, and semantic selection, i.e. the possibility to select semantically
-meaningfull rendering expressions, and to past the respective contebt into
+meaningfull rendering expressions, and to past the respective content into
  a different text area.
+\NOTE{il widget\\ non ha sel\\ semantica}
  \end{itemize}
  Starting from all this, the further step of developing our own 
  proof assistant was too
  small and too tempting to be neglected. Essentially, we ``just'' had to
  add an authoring interface, and a set of functionalities for the
  overall management of the library, integrating everything into a
-single system. {\em Matita} is the result of this effort. 
+single system. \MATITA{} is the result of this effort. 
  
-At first sight, Matita looks as (and partly is) a Coq clone. This is
+At first sight, \MATITA{} looks as (and partly is) a \COQ{} clone. This is
  more the effect of the circumstances of its creation described 
  above than the result of a deliberate design. In particular, we
-(essentially) share the same foundational dialect of Coq (the
+(essentially) share the same foundational dialect of \COQ{} (the
  Calculus of Inductive Constructions), the same implementative
-language (Ocaml), and the same (script based) authoring philosophy.
+language (\OCAML{}), and the same (script based) authoring philosophy.
  However, as we shall see, the analogy essentially stops here. 
  
-In a sense; we like to think of Matita as the way Coq would 
+In a sense; we like to think of \MATITA{} as the way \COQ{} would 
  look like if entirely rerwritten from scratch: just to give an
-idea, although Matita currently supports almost all functionalities of
-Coq, it links 60000 lins of coaml code, against ... of Coq (and
+idea, although \MATITA{} currently supports almost all functionalities of
+\COQ{}, it links 60'000 lins of \OCAML{} code, against ... of \COQ{} (and
  we are convinced that, starting from scratch againg, we could furtherly
-reduce our code in sensible way).   
-
+reduce our code in sensible way).\NOTE{righe\\\COQ{}}
  
  \begin{itemize}
   \item scelta del sistema fondazionale
@@ -201,10 +205,225 @@ reduce our code in sensible way).
  \subsection{tatticali}
  \ASSIGNEDTO{gares}
  
-\subsection{disambiguazione}
+\subsection{Disambiguation}
+\label{sec:disambiguation}
  \ASSIGNEDTO{zack}
  
+\begin{table}
+ \caption{\label{tab:termsyn} Concrete syntax of CIC terms: built-in notation\strut}
+\hrule
+\[
+\begin{array}{@{}rcll@{}}
+  \NT{term} & ::= & & \mbox{\bf terms} \\
+    &     & x & \mbox{(identifier)} \\
+    &  |  & n & \mbox{(number)} \\
+    &  |  & s & \mbox{(symbol)} \\
+    &  |  & \mathrm{URI} & \mbox{(URI)} \\
+    &  |  & \verb+?+ & \mbox{(implicit)} \\
+    &  |  & \verb+?+n~[\verb+[+~\{\NT{subst}\}~\verb+]+] & \mbox{(meta)} \\
+    &  |  & \verb+let+~\NT{ptname}~\verb+\def+~\NT{term}~\verb+in+~\NT{term} \\
+    &  |  & \verb+let+~\NT{kind}~\NT{defs}~\verb+in+~\NT{term} \\
+    &  |  & \NT{binder}~\{\NT{ptnames}\}^{+}~\verb+.+~\NT{term} \\
+    &  |  & \NT{term}~\NT{term} & \mbox{(application)} \\
+    &  |  & \verb+Prop+ \mid \verb+Set+ \mid \verb+Type+ \mid \verb+CProp+ & \mbox{(sort)} \\
+    &  |  & \verb+match+~\NT{term}~ & \mbox{(pattern matching)} \\
+    &     & ~ ~ [\verb+[+~\verb+in+~x~\verb+]+]
+             ~ [\verb+[+~\verb+return+~\NT{term}~\verb+]+] \\
+    &     & ~ ~ \verb+with [+~[\NT{rule}~\{\verb+|+~\NT{rule}\}]~\verb+]+ & \\
+    &  |  & \verb+(+~\NT{term}~\verb+:+~\NT{term}~\verb+)+ & \mbox{(cast)} \\
+    &  |  & \verb+(+~\NT{term}~\verb+)+ \\
+  \NT{defs}  & ::= & & \mbox{\bf mutual definitions} \\
+    &     & \NT{fun}~\{\verb+and+~\NT{fun}\} \\
+  \NT{fun} & ::= & & \mbox{\bf functions} \\
+    &     & \NT{arg}~\{\NT{ptnames}\}^{+}~[\verb+on+~x]~\verb+\def+~\NT{term} \\
+  \NT{binder} & ::= & & \mbox{\bf binders} \\
+    &     & \verb+\forall+ \mid \verb+\lambda+ \\
+  \NT{arg} & ::= & & \mbox{\bf single argument} \\
+    &     & \verb+_+ \mid x \\
+  \NT{ptname} & ::= & & \mbox{\bf possibly typed name} \\
+    &     & \NT{arg} \\
+    &  |  & \verb+(+~\NT{arg}~\verb+:+~\NT{term}~\verb+)+ \\
+  \NT{ptnames} & ::= & & \mbox{\bf bound variables} \\
+    &     & \NT{arg} \\
+    &  |  & \verb+(+~\NT{arg}~\{\verb+,+~\NT{arg}\}~[\verb+:+~\NT{term}]~\verb+)+ \\
+  \NT{kind} & ::= & & \mbox{\bf induction kind} \\
+    &     & \verb+rec+ \mid \verb+corec+ \\
+  \NT{rule} & ::= & & \mbox{\bf rules} \\
+    &     & x~\{\NT{ptname}\}~\verb+\Rightarrow+~\NT{term}
+\end{array}
+\]
+\hrule
+\end{table}
+
+\subsubsection{Term input}
+
+The primary form of user interaction employed by \MATITA{} is textual script
+editing: the user can modifies it and evaluate step by step its composing
+\emph{statements}. Examples of statements are inductive type definitions,
+theorem declarations, LCF-style tacticals, and macros (e.g. \texttt{Check} can
+be used to ask the system to refine a given term and pretty print the result).
+Since many statements refer to terms of the underlying calculus, \MATITA{} needs
+a concrete syntax able to encode terms of the Calculus of Inductive
+Constructions.
+
+Two of the requirements in the design of such a syntax are apparently in
+contrast:
+\begin{enumerate}
+ \item the syntax should be as close as possible to common mathematical practice
+  and implement widespread mathematical notions;
+ \item each term described by the syntax should be non-ambiguous meaning that it
+  should exists a function which associates to each term of the syntax a CIC
+  term.
+\end{enumerate}
+
+These two requirements are addressed in \MATITA{} by the mean of two mechanisms
+which work together: \emph{term disambiguation} and \emph{extensible notation}.
+Their interaction is visible in the architecture of the \MATITA{} input phase,
+depicted in Fig.~\ref{fig:inputphase}. The architecture is articulated as a
+pipline of three levels: the concrete syntax level (level 0) is the one the user
+has to deal with when inserting CIC terms; the abstract syntax level (level 2)
+is an internal representation which intuitively encodes mathematical formulae at
+the content level~\cite{adams}~\cite{mkm-structure}; the formal mathematics
+level (level 3) is the CIC encoding of terms.
+
+Requirement (1) is addressed by a built-in concrete syntax for terms, described
+in Tab.~\ref{tab:termsyn}, and the extensible notation mechanisms which offers a
+way for extending available mathematical notations. Extensible notation, which
+is also in charge of providing a parsing function mapping concrete syntax terms
+to content level terms, is described in Sect.~\ref{sec:notation}.  Requirement
+(2) is addressed by the conjunct action of that parsing function and
+disambiguation which provides a function from content level terms to CIC terms.
+
+\subsubsection{Sources of ambiguity}
+
+The translation from content level terms to CIC terms is not straightforward
+because some nodes of the content encoding admit more that one CIC encoding,
+invalidating requirement (2).
+
+\begin{example}
+
+ Consider the term \texttt{\TEXMACRO{forall} x. x + ln 1 = x}, the type of a
+ lemma the user may want to prove. Assuming that both \texttt{+} and \texttt{=}
+ are parsed as infix operators, all the following questions are legitimate and
+ must be answered before obtaining a CIC term from its content level encoding
+ (Fig.~\ref{fig:inputphase}(b)):
+
+ \begin{enumerate}
+
+  \item Since \texttt{ln} is an unbound identifier, which CIC constants does it
+   represent? Many different theorems in the library may share its (rather
+   short) name \dots
+
+  \item Which kind of number (\IN, \IR, \dots) the \texttt{1} literal stand for?
+   Which encoding is used in CIC to represent it? E.g., assuming $1\in\IN$, is
+   it an unary or a binary encoding?
+
+  \item Which kind of equality the ``='' node represents? Is it Leibniz's
+   polymorhpic equality? Is it a decidable equality over \IN, \IR, \dots?
+
+ \end{enumerate}
+
+\end{example}
+
+In \MATITA, three \emph{sources of ambiguity} are admitted for content level
+terms: unbound identifiers, literal numbers, and literal symbols.
+
+\emph{Unbound identifiers} (question 1) are sources of ambiguity since the same
+name could have been used in the proof assistant library to represent different
+objects. \emph{Numbers} (question 2) are ambiguous since several different
+encodings of them could be provided in the calculus. Finally, \emph{symbols}
+(question 3) are ambiguous as well, since they may be used in an overloaded
+fashion to represent the application of different objects.
+
+\textbf{FINQUI, il resto \`e copy and paste dal Whelp paper \dots}
+
+Note that given a content level term with more than one sources of ambiguity,
+not all possible disambiguation choices are valid: for example, given the input
+\texttt{1+1} we must choose an interpretation of \texttt{+} which is typable in
+CIC according to the chosen interpretation for \texttt{1}; choosing as
+\texttt{+} the addition over natural numbers and as \texttt{1} the real number
+$1$ will lead to a type error.
+
+A \emph{disambiguation algorithm} takes as input an ambiguous term and return a
+fully determined CIC term. The \emph{naive disambiguation algorithm} takes as
+input an ambiguous term $t$ and proceeds as follows:
+
+\begin{enumerate}
+
+ \item Create disambiguation domains $\{D_i | i\in\mathit{Dom}(t)\}$, where
+  $\mathit{Dom}(t)$ is the set of ambiguity sources of $t$. Each $D_i$ is a set
+  of CIC terms.
+
+ \item Let $\Phi = \{\phi_i | {i\in\mathit{Dom}(t)},\phi_i\in D_i\}$
+%  such that $\forall i\in\mathit{Dom}(t),\exists\phi_j\in\Phi,i=j$
+  be an interpretation for $t$. Given $t$ and an interpretation $\Phi$, a CIC
+  term is fully determined. Iterate over all possible interpretations of $t$ and
+  type-check them, keep only typable interpretations (i.e. interpretations that
+  determine typable terms).
+
+ \item Let $n$ be the number of interpretations who survived step 2. If $n=0$
+  signal a type error. If $n=1$ we have found exactly one CIC term
+  corresponding to $t$, returns it as output of the disambiguation phase.
+  If $n>1$ let the user choose one of the $n$ interpretations and returns the
+  corresponding term.
+
+\end{enumerate}
+
+The above algorithm is highly inefficient since the number of possible
+interpretations $\Phi$ grows exponentially with the number of ambiguity sources.
+The actual algorithm used in \WHELP{} is far more efficient being, in the
+average case, linear in the number of ambiguity sources.
+
+The efficient algorithm can be applied if the logic can be extended with
+metavariables and a refiner can be implemented. This is the case for CIC and
+several other logics.
+\emph{Metavariables}~\cite{munoz} are typed, non linear placeholders that can
+occur in terms; $?_i$ usually denotes the $i$-th metavariable, while $?$ denotes
+a freshly created metavariable. A \emph{refiner}~\cite{McBride} is a
+function whose input is a term with placeholders and whose output is either a
+new term obtained instantiating some placeholder or $\epsilon$, meaning that no
+well typed instantiation could be found for the placeholders occurring in
+the term (type error).
+
+The efficient algorithm starts with an interpretation $\Phi_0 = \{\phi_i |
+\phi_i = ?, i\in\mathit{Dom}(t)\}$,
+which associates a fresh metavariable to each
+source of ambiguity. Then it iterates refining the current CIC term (i.e. the
+term obtained interpreting $t$ with $\Phi_i$). If the refinement succeeds the
+next interpretation $\Phi_{i+1}$ will be created \emph{making a choice}, that is
+replacing a placeholder with one of the possible choice from the corresponding
+disambiguation domain. The placeholder to be replaced is chosen following a
+preorder visit of the ambiguous term. If the refinement fails the current set of
+choices cannot lead to a well-typed term and backtracking is attempted.
+Once an unambiguous correct interpretation is found (i.e. $\Phi_i$ does no
+longer contain any placeholder), backtracking is attempted
+anyway to find the other correct interpretations.
+
+The intuition which explain why this algorithm is more efficient is that as soon
+as a term containing placeholders is not typable, no further instantiation of
+its placeholders could lead to a typable term. For example, during the
+disambiguation of user input \texttt{\TEXMACRO{forall} x. x*0 = 0}, an
+interpretation $\Phi_i$ is encountered which associates $?$ to the instance
+of \texttt{0} on the right, the real number $0$ to the instance of \texttt{0} on
+the left, and the multiplication over natural numbers (\texttt{mult} for short)
+to \texttt{*}. The refiner will fail, since \texttt{mult} require a natural
+argument, and no further instantiation of the placeholder will be tried.
+
+If, at the end of the disambiguation, more than one possible interpretations are
+possible, the user will be asked to choose the intended one (see
+Fig.~\ref{fig:disambiguation}).
+
+\begin{figure}[htb]
+%   \centerline{\includegraphics[width=0.9\textwidth]{disambiguation-1}}
+  \caption{\label{fig:disambiguation} Disambiguation: interpretation choice}
+\end{figure}
+
+Details of the disambiguation algorithm of \WHELP{} can
+be found in~\cite{disambiguation}, where an equivalent algorithm
+that avoids backtracking is also presented.
+
  \subsection{notazione}
+\label{sec:notation}
  \ASSIGNEDTO{zack}
  
  \subsection{libreria tutta visibile}
@@ -252,51 +471,7 @@ the development of Matita, and in particular
  A.Griggio, F.Guidi, P. Di Lena, L.Padovani, I.Schena, M.Selmi, 
  V.Tamburrelli.
  
-
-\begin{thebibliography}{}
-
- \bibitem{ida}A.Asperti, H.Geuvers, I.Loeb, L.E.Mamane, C.Sacerdoti Coen.
-  An Interactive Algebra Course with Formalised Proofs and Definitions. 
-  Post-Proceedings of the Fourth International Conference on
-  Mathemtical Knowledge Management. Bremen, Germany, July 2005. LNCS, 
-  to appear.
-
- \bibitem{annals} A.~Asperti, F.~Guidi, L.~Padovani, C.~Sacerdoti Coen,
-  I.~Schena. \emph{Mathematical Knowledge Management in HELM}. Annals of
-  Mathematics and Artificial Intelligence, 38(1): 27--46; May 2003.
-
- \bibitem{whelp} A.~Asperti, F.~Guidi, C.~Sacerdoti Coen,
-  E.Tassi, S.Zacchiroli. \emph{A content based mathematical search
-  engine: whelp}. Post-proceedings of the Types 2004 International 
-  Conference, Jouy-en-Josas, France, December 2004. LNCS (to appear). 
-
- \bibitem{metadata2} A. Asperti, M. Selmi. \emph{Efficient Retrieval of
-  Mathematical Statements}. In Proceeding of the Third International Conference
-  on Mathematical Knowledge Management, MKM 2004. Bialowieza, Poland. LNCS 3119.
- \bibitem{pechino} A.Asperti, B.Wegner. \emph{An Approach to
-  Machine-Understandable Representation of the Mathematical Information in
-  Digital Documents}.  In: Fengshai Bai and Bernd Wegner (eds.): Electronic
-  Information and Communication in Mathematics, LNCS vol. 2730, 
-  pp. 14--23, 2003
-
- \bibitem{coq} The Coq proof-assistant, \url{http://coq.inria.fr}
-
- \bibitem{metadata1} F. Guidi, C. Sacerdoti Coen. \emph{Querying Distributed
-  Digital Libraries of Mathematics}. In Proceedings of Calculemus 2003, 11th
-  Symposium on the Integration of Symbolic Computation and Mechanized 
-  Reasoning. Aracne Editrice.
- 
- \bibitem{exportation} C. Sacerdoti Coen. \emph{From Proof-Assistans to
-  Distributed Libraries of Mathematics: Tips and Pitfalls}.
-  In Proc. Mathematical Knowledge Management 2003, Lecture Notes in Computer
-  Science, Vol. 2594, pp. 30--44, Springer-Verlag.
-
- \bibitem{disambiguation} C. Sacerdoti Coen, S. Zacchiroli. \emph{Efficient
-  Ambiguous Parsing of Mathematical Formulae}. In Proceedings of the Third
-  International Conference on Mathematical Knowledge Management, MKM 2004. 
-  LNCS,3119.
-
-\end{thebibliography}
+\bibliography{matita}
  
  \end{document}