\newcommand{\MOWGLI}{MoWGLI}
\newcommand{\NAT}{\ensuremath{\mathit{nat}}}
\newcommand{\NATIND}{\mathit{nat\_ind}}
+\newcommand{\NUPRL}{NuPRL}
\newcommand{\OCAML}{OCaml}
\newcommand{\PROP}{\mathit{Prop}}
\newcommand{\REF}[3]{\ensuremath{\mathit{Ref}_{#1}(#2,#3)}}
\newcommand{\WHELP}{Whelp}
\newcommand{\ASSIGNEDTO}[1]{\textbf{Assigned to:} #1}
+\newcommand{\NOTE}[1]{\marginpar{\scriptsize #1}}
\title{The Matita proof assistant}
\author{Andrea Asperti, Claudio Sacerdoti Coen, Enrico Tassi
\section{Introduction}
\label{sec:intro}
-{\em Matita} is the proof assistant under development by the Helm team
+{\em Matita} is the proof assistant under development by the \HELM{} team
\cite{annals} at the University of Bologna, under the direction of
-Prof.Asperti.
+Prof.~Asperti.
The origin of the system goes back to 1999. At the time we were mostly
interested to develop tools and techniques to enhance the accessibility
via web of formal libraries of mathematics. Due to its dimension, the
-library of the coq proof assistant (of the order of 35000 theorems)
+library of the \COQ{} proof assistant (of the order of 35'000 theorems)
was choosed as a privileged test bench for our work, although experiments
-have been also conducted with other systems, and notably with Nuprl.
+have been also conducted with other systems, and notably with \NUPRL{}.
The work, mostly performed in the framework of the recently concluded
-European project IST-33562-Mowgli \cite{pechino}, mainly consisted in the
+European project IST-33562 \MOWGLI{}~\cite{pechino}, mainly consisted in the
following teps:
\begin{itemize}
\item exporting the information from the internal representation of
-Coq to a system and platform independent format. Since XML was at the
+ \COQ{} to a system and platform independent format. Since XML was at the
time an emerging standard, we naturally adopted this technology, fostering
a content-based architecture for future system, where the documents
of the library were the the main components around which everything else
has to be build;
\item developing indexing and searching techniques supporting semantic
-queries to the library; these efforts gave birth to our {\em whelp}
-search engine, described in \cite{whelp};
+ queries to the library; these efforts gave birth to our \WHELP{}
+search engine, described in~\cite{whelp};
\item developing languages and tools for a high-quality notational
rendering of mathematical information; in particular, we have been
active in the MathML Working group since 1999, and developed inside
-Helm a MathML-compliant widget for the GTK graphical environment
+\HELM{} a MathML-compliant widget for the GTK graphical environment
which can be integrated in any application.
\end{itemize}
The exportation issue, extensively discussed in \cite{exportation},
has several major implications worth to be discussed.
The first
-point concern the kind of content information to be exported. In a
-proof assistant like coq, proofs are represented in at least three clearly
-distinguishable formats: scripts (i.e. sequences of commands issued by the
-user to the system during an interactive session of proof), proof objects
+point concerns the kind of content information to be exported. In a
+proof assistant like \COQ{}, proofs are represented in at least three clearly
+distinguishable formats: \emph{scripts} (i.e. sequences of commands issued by the
+user to the system during an interactive session of proof), \emph{proof objects}
(which is the low-level representation of proofs in the form of
-lambda-terms readable to and checked by kernel) and proof-trees (which
+lambda-terms readable to and checked by kernel) and \emph{proof-trees} (which
is a kind of intermediate representation, vaguely inspired by a sequent
like notation, that inherits most of the defects but essentially
none of the advantages of the previous representations).
In our case, the choice of the content encoding was eventually dictated
by the methodological assumption of offering the information in a
-stable and system independent format. The language of scripts is too
-oriented to Coq, and it changes too rapidly to be of any interest
+stable and system-independent format. The language of scripts is too
+oriented to \COQ, and it changes too rapidly to be of any interest
to third parties. On the other side, the language of proof objects
merely depend on
the logical framework (the Calculus of Inductive Constructions, in
-the case of Coq), is grammatically simple, semantically clear and,
-especially, is very stable (as the kernel of the proof assistants
-often is).
+the case of \COQ), is grammatically simple, semantically clear and,
+especially, is very stable (as kernels of proof assistants
+often are).
So the granularity of the library is at the level of individual
objects, that also justifies from another point of view the need
for efficient searching techniques for retrieving individual
difficult to read and do not directly correspond to what the user typed
in. An analogy frequently made in the proof assistant community is that of
comparing the vernacular language of scripts to a high level source language
-and lambda terms to the assembly language they are compiled in, We do not
+and lambda terms to the assembly language they are compiled in. We do not
share this view and prefer to look at scripts as an imperative language,
and to lambda terms as their denotational semantics; still, however,
denotational semantics is possibly more formal but surely not more readable
than the imperative source.
-For all the previous reasons, a huge amount of work inside Mowgli has
+For all the previous reasons, a huge amount of work inside \MOWGLI{} has
been devoted to automatic reconstruction of proofs in natural language
from lambda terms. Since lambda terms are in close connection
with natural deduction
-(thay is still the most natural logical language discovered so far)
+(that is still the most natural logical language discovered so far)
the work is not hopeless as it may seem, especially if rendering
is combined, as in our case, with dynamic features supporting
-in-line expansions or contraction of subproofs. The final
+in-line expansions or contractions of subproofs. The final
rendering is probably not entirely satisfactory (see \cite{ida} for a
discussion), but surely
readable (the actual quality largely depends by the way the lambda
\item metadata specifications and tools for indexing and querying the
XML knowledge base;
\item a proof checker (i.e. the {\em kernel} of a proof assistant),
-implemented to check that we exported form the coq library all the
+ implemented to check that we exported form the \COQ{} library all the
logically relevant content;
\item a sophisticated parser (used by the search engine), able to deal
with potentially ambiguous and incomplete information, typical of the
language;
\item an innovative rendering widget, supporting high-quality bidimensional
rendering, and semantic selection, i.e. the possibility to select semantically
-meaningfull rendering expressions, and to past the respective contebt into
+meaningfull rendering expressions, and to past the respective content into
a different text area.
+\NOTE{il widget\\ non ha sel\\ semantica}
\end{itemize}
Starting from all this, the further step of developing our own
proof assistant was too
small and too tempting to be neglected. Essentially, we ``just'' had to
add an authoring interface, and a set of functionalities for the
overall management of the library, integrating everything into a
-single system. {\em Matita} is the result of this effort.
+single system. \MATITA{} is the result of this effort.
-At first sight, Matita looks as (and partly is) a Coq clone. This is
+At first sight, \MATITA{} looks as (and partly is) a \COQ{} clone. This is
more the effect of the circumstances of its creation described
above than the result of a deliberate design. In particular, we
-(essentially) share the same foundational dialect of Coq (the
+(essentially) share the same foundational dialect of \COQ{} (the
Calculus of Inductive Constructions), the same implementative
-language (Ocaml), and the same (script based) authoring philosophy.
+language (\OCAML{}), and the same (script based) authoring philosophy.
However, as we shall see, the analogy essentially stops here.
-In a sense; we like to think of Matita as the way Coq would
+In a sense; we like to think of \MATITA{} as the way \COQ{} would
look like if entirely rerwritten from scratch: just to give an
-idea, although Matita currently supports almost all functionalities of
-Coq, it links 60000 lins of coaml code, against ... of Coq (and
+idea, although \MATITA{} currently supports almost all functionalities of
+\COQ{}, it links 60'000 lins of \OCAML{} code, against ... of \COQ{} (and
we are convinced that, starting from scratch againg, we could furtherly
-reduce our code in sensible way).
-
+reduce our code in sensible way).\NOTE{righe\\\COQ{}}
\begin{itemize}
\item scelta del sistema fondazionale
\subsection{tatticali}
\ASSIGNEDTO{gares}
-\subsection{disambiguazione}
+\subsection{Disambiguation}
+\label{sec:disambiguation}
\ASSIGNEDTO{zack}
+%The disambiguation phase builds CIC terms from ASTs of user inputs (also called
+%\emph{ambiguous terms}). Ambiguous terms may carry three different \emph{sources
+%of ambiguity}: unbound identifiers, literal numbers, and literal symbols.
+%\emph{Unbound identifiers} are sources of ambiguity since the same name
+%could have been used to represent different objects. For example, three
+%different theorems of the \COQ{} library share the name $\mathit{plus\_assoc}$
+%(locating them is an exercise for the interested reader. Hint: use \WHELP's
+%\LOCATE{} query).
+%
+%\emph{Numbers} are ambiguous since several different encodings of them could be
+%provided in logical systems. In the \COQ{} standard library for example we found
+%naturals (in their unary encoding), positives (binary encoding), integers
+%(signed positives), and reals. Finally, \emph{symbols} (instances of the
+%\emph{binop} and \emph{unop} syntactic categories of Table~\ref{tab:syntax}) are
+%ambiguous as well: infix \texttt{+} for example is overloaded to represent
+%addition over the four different kinds of numbers available in the \COQ{}
+%standard library. Note that given a term with more than one sources of
+%ambiguity, not all possible disambiguation choices are valid: for example, given
+%the input \texttt{1+1} we must choose an interpretation of \texttt{+} which is
+%typable in CIC according to the chosen interpretation for \texttt{1}; choosing
+%as \texttt{+} the addition over natural numbers and as \texttt{1} the real
+%number $1$ will lead to a type error.
+%
+%A \emph{disambiguation algorithm} takes as input an ambiguous term and return a
+%fully determined CIC term. The \emph{naive disambiguation algorithm} takes as
+%input an ambiguous term $t$ and proceeds as follows:
+%
+%\begin{enumerate}
+%
+% \item Create disambiguation domains $\{D_i | i\in\mathit{Dom}(t)\}$, where
+% $\mathit{Dom}(t)$ is the set of ambiguity sources of $t$. Each $D_i$ is a set
+% of CIC terms.
+%
+% \item Let $\Phi = \{\phi_i | {i\in\mathit{Dom}(t)},\phi_i\in D_i\}$
+%% such that $\forall i\in\mathit{Dom}(t),\exists\phi_j\in\Phi,i=j$
+% be an interpretation for $t$. Given $t$ and an interpretation $\Phi$, a CIC
+% term is fully determined. Iterate over all possible interpretations of $t$ and
+% type-check them, keep only typable interpretations (i.e. interpretations that
+% determine typable terms).
+%
+% \item Let $n$ be the number of interpretations who survived step 2. If $n=0$
+% signal a type error. If $n=1$ we have found exactly one CIC term
+% corresponding to $t$, returns it as output of the disambiguation phase.
+% If $n>1$ let the user choose one of the $n$ interpretations and returns the
+% corresponding term.
+%
+%\end{enumerate}
+%
+%The above algorithm is highly inefficient since the number of possible
+%interpretations $\Phi$ grows exponentially with the number of ambiguity sources.
+%The actual algorithm used in \WHELP{} is far more efficient being, in the
+%average case, linear in the number of ambiguity sources.
+%
+%The efficient algorithm can be applied if the logic can be extended with
+%metavariables and a refiner can be implemented. This is the case for CIC and
+%several other logics.
+%\emph{Metavariables}~\cite{munoz} are typed, non linear placeholders that can
+%occur in terms; $?_i$ usually denotes the $i$-th metavariable, while $?$ denotes
+%a freshly created metavariable. A \emph{refiner}~\cite{mcbride} is a
+%function whose input is a term with placeholders and whose output is either a
+%new term obtained instantiating some placeholder or $\epsilon$, meaning that no
+%well typed instantiation could be found for the placeholders occurring in
+%the term (type error).
+%
+%The efficient algorithm starts with an interpretation $\Phi_0 = \{\phi_i |
+%\phi_i = ?, i\in\mathit{Dom}(t)\}$,
+%which associates a fresh metavariable to each
+%source of ambiguity. Then it iterates refining the current CIC term (i.e. the
+%term obtained interpreting $t$ with $\Phi_i$). If the refinement succeeds the
+%next interpretation $\Phi_{i+1}$ will be created \emph{making a choice}, that is
+%replacing a placeholder with one of the possible choice from the corresponding
+%disambiguation domain. The placeholder to be replaced is chosen following a
+%preorder visit of the ambiguous term. If the refinement fails the current set of
+%choices cannot lead to a well-typed term and backtracking is attempted.
+%Once an unambiguous correct interpretation is found (i.e. $\Phi_i$ does no
+%longer contain any placeholder), backtracking is attempted
+%anyway to find the other correct interpretations.
+%
+%The intuition which explain why this algorithm is more efficient is that as soon
+%as a term containing placeholders is not typable, no further instantiation of
+%its placeholders could lead to a typable term. For example, during the
+%disambiguation of user input \texttt{\TEXMACRO{forall} x. x*0 = 0}, an
+%interpretation $\Phi_i$ is encountered which associates $?$ to the instance
+%of \texttt{0} on the right, the real number $0$ to the instance of \texttt{0} on
+%the left, and the multiplication over natural numbers (\texttt{mult} for short)
+%to \texttt{*}. The refiner will fail, since \texttt{mult} require a natural
+%argument, and no further instantiation of the placeholder will be tried.
+%
+%If, at the end of the disambiguation, more than one possible interpretations are
+%possible, the user will be asked to choose the intended one (see
+%Fig.~\ref{fig:disambiguation}).
+%
+%\begin{figure}[htb]
+%% \centerline{\includegraphics[width=0.9\textwidth]{disambiguation-1}}
+% \caption{\label{fig:disambiguation} Disambiguation: interpretation choice}
+%\end{figure}
+%
+%Details of the disambiguation algorithm of \WHELP{} can
+%be found in~\cite{disambiguation}, where an equivalent algorithm
+%that avoids backtracking is also presented.
+
\subsection{notazione}
\ASSIGNEDTO{zack}