+The first
+point concerns the kind of content information to be exported. In a
+proof assistant like \COQ{}, proofs are represented in at least three clearly
+distinguishable formats: \emph{scripts} (i.e. sequences of commands issued by the
+user to the system during an interactive session of proof), \emph{proof objects}
+(which is the low-level representation of proofs in the form of
+lambda-terms readable to and checked by kernel) and \emph{proof-trees} (which
+is a kind of intermediate representation, vaguely inspired by a sequent
+like notation, that inherits most of the defects but essentially
+none of the advantages of the previous representations).
+Partially related to this problem, there is the
+issue of the {\em granularity} of the library: scripts usually comprise
+small developments with many definitions and theorems, while
+proof objects correspond to individual mathematical items.
+
+In our case, the choice of the content encoding was eventually dictated
+by the methodological assumption of offering the information in a
+stable and system-independent format. The language of scripts is too
+oriented to \COQ, and it changes too rapidly to be of any interest
+to third parties. On the other side, the language of proof objects
+merely depend on
+the logical framework (the Calculus of Inductive Constructions, in
+the case of \COQ), is grammatically simple, semantically clear and,
+especially, is very stable (as kernels of proof assistants
+often are).
+So the granularity of the library is at the level of individual
+objects, that also justifies from another point of view the need
+for efficient searching techniques for retrieving individual
+logical items from the repository.
+
+The main (possibly only) problem with proof objects is that they are
+difficult to read and do not directly correspond to what the user typed
+in. An analogy frequently made in the proof assistant community is that of
+comparing the vernacular language of scripts to a high level source language
+and lambda terms to the assembly language they are compiled in. We do not
+share this view and prefer to look at scripts as an imperative language,
+and to lambda terms as their denotational semantics; still, however,
+denotational semantics is possibly more formal but surely not more readable
+than the imperative source.
+
+For all the previous reasons, a huge amount of work inside \MOWGLI{} has
+been devoted to automatic reconstruction of proofs in natural language
+from lambda terms. Since lambda terms are in close connection
+with natural deduction
+(that is still the most natural logical language discovered so far)
+the work is not hopeless as it may seem, especially if rendering
+is combined, as in our case, with dynamic features supporting
+in-line expansions or contractions of subproofs. The final
+rendering is probably not entirely satisfactory (see \cite{ida} for a
+discussion), but surely
+readable (the actual quality largely depends by the way the lambda
+term is written).
+
+Summing up, we already disposed of the following tools/techniques:
+\begin{itemize}
+\item XML specifications for the Calculus of Inductive Constructions,
+with tools for parsing and saving mathematical objects in such a format;
+\item metadata specifications and tools for indexing and querying the
+XML knowledge base;
+\item a proof checker (i.e. the {\em kernel} of a proof assistant),
+ implemented to check that we exported form the \COQ{} library all the
+logically relevant content;
+\item a sophisticated parser (used by the search engine), able to deal
+with potentially ambiguous and incomplete information, typical of the
+mathematical notation \cite{};
+\item a {\em refiner}, i.e. a type inference system, based on complex
+existential variables, used by the disambiguating parser;
+\item complex transformation algorithms for proof rendering in natural
+language;
+\item an innovative rendering widget, supporting high-quality bidimensional
+rendering, and semantic selection, i.e. the possibility to select semantically
+meaningful rendering expressions, and to past the respective content into
+a different text area.
+\NOTE{il widget non ha sel semantica}
+\end{itemize}
+Starting from all this, the further step of developing our own
+proof assistant was too
+small and too tempting to be neglected. Essentially, we ``just'' had to
+add an authoring interface, and a set of functionalities for the
+overall management of the library, integrating everything into a
+single system. \MATITA{} is the result of this effort.
+
+At first sight, \MATITA{} looks as (and partly is) a \COQ{} clone. This is
+more the effect of the circumstances of its creation described
+above than the result of a deliberate design. In particular, we
+(essentially) share the same foundational dialect of \COQ{} (the
+Calculus of Inductive Constructions), the same implementative
+language (\OCAML{}), and the same (script based) authoring philosophy.
+However, as we shall see, the analogy essentially stops here.
+
+In a sense; we like to think of \MATITA{} as the way \COQ{} would
+look like if entirely rewritten from scratch: just to give an
+idea, although \MATITA{} currently supports almost all functionalities of
+\COQ{}, it links 60'000 lins of \OCAML{} code, against ... of \COQ{} (and
+we are convinced that, starting from scratch again, we could furtherly
+reduce our code in sensible way).\NOTE{righe \COQ{}}
+
+\begin{itemize}
+ \item scelta del sistema fondazionale
+ \item sistema indipendente (da Coq)
+ \begin{itemize}
+ \item possibilit\`a di sperimentare (soluzioni architetturali, logiche,
+ implementative, \dots)
+ \item compatibilit\`a con sistemi legacy
+ \end{itemize}
+\end{itemize}
+
+\section{\HELM{} library(??)}
+
+\subsection{libreria tutta visibile}
+\ASSIGNEDTO{csc}
+\NOTE{assumo che si sia gia' parlato di approccio content-centrico}
+Our commitment to the content-centric view of the architecture of the system
+has important consequences on the user's experience and on the functionalities
+of several components of \MATITA. In the content-centric view the library
+of mathematical knowledge is an already existent and completely autonomous
+entity that we are allowed to exploit and augment using \MATITA. Thus, in
+principle, when the user starts to prove a new theorem she has complete
+visibility of the library and she can refer to every definition and lemma,
+also using the mathematical notation already developed. In a similar way,
+every form of automation of the system must be able to analyze and possibly
+exploit every notion in the library.
+
+The benefits of this approach highly justify the non neglectable price to pay
+in the development of several components. We analyse now a few of the causes
+of this additional complexity.
+
+\subsubsection{Ambiguity}
+A rich mathematical library includes equivalent definitions and representations
+of the same notion. Moreover, mathematical notation inside a rich library is
+surely highly overloaded. As a consequence every mathematical expression the
+user provides is highly ambiguous since all the definitions,
+representations and special notations are available at once to the user.
+
+The usual solution to the problem, as adopted for instance in Coq, is to
+restrict the user's scope to just one interpretation for each definition,
+representation or notation. In this way much of the ambiguity is removed,
+burdening the user that must someway declare what is in scope and that must
+use special syntax when she needs to refer to something not in scope.
+
+Even with this approach ambiguity cannot be completely removed since implicit
+coercions can be arbitrarily inserted by the system to ``change the type''
+of subterms that do not have the expected type. Usually implicit coercions
+are used to overcome the absence of subtyping that should mimic the subset
+relation found in set theory. For instance, the expression
+$\forall n \in nat. 2 * n * \pi \equiv_\pi 0$ is correct in set theory since
+the set of natural numbers is a subset of that of real numbers; the
+corresponding expression $\forall n:nat. 2*n*\pi \equiv_\pi 0$ is not well typed
+and requires the automatic insertion of the coercion $real_of_nat: nat \to R$
+either around both 2 and $n$ (to make both products be on real numbers) or
+around the product $2*n$. The usual approach consists in either rejecting the
+ambiguous term or arbitrarily choosing one of the interpretations. For instance,
+Coq rejects the declaration of coercions that have alternatives
+(i.e. already declared coercions with the same domain and codomain)
+or that are obtained composing other coercions in order to
+avoid making several terms highly ambiguous by choosing to insert any one of the
+alternative coercions. Coq also arbitrarily chooses how to insert coercions in
+terms to make them well typed when there is more than one possibility (as in
+the previous example).
+
+The approach we are following is radically different. It consists in dealing
+with ambiguous expressions instead of avoiding them. As a last resource,
+when the system is unable to disambiguate the input, the user is interactively
+required to provide more information that is recorded to avoid asking the
+same question again in subsequent processing of the same input.
+More details on our approach can be found in \ref{sec:disambiguation}.
+
+\subsubsection{Consistency}
+A large mathematical library is likely to be logically inconsistent.
+It may contain incompatible axioms or alternative conjectures and it may
+even use definitions in incompatible ways. To clarify this last point,
+consider two identical definitions of a set of elements of a given type
+(or of a category of objects of a given type). Let us call the two definitions
+$A-Set$ and $B-Set$ (or $A-Category$ and $B-Category$).
+It is perfectly legitimate to either form the $A-Set$ of every $B-Set$
+or the $B-Set$ of every $A-Set$ (the same for categories). This just corresponds
+to assuming that a $B-Set$ (respectively an $A-Set$) is a small set, whereas
+an $A-Set$ (respectively a $B-Set$) is a big set (possibly of small sets).
+However, if one part of the library assumes $A-Set$s to be the small ones
+and another part of the library assumes $B-Set$s to be the small ones, the
+library as a whole will be logically inconsistent.
+
+Logical inconsistency has never been a problem in the daily work of a
+mathematician. The mathematician simply imposes himself a discipline to
+restrict himself to consistent subsets of the mathematical knowledge.
+However, in doing so he does not choose the subset in advance by forgetting
+the rest of his knowledge. On the contrary he may proceed with a sort of
+top-down strategy: he may always inspect or use part of his knowledge, but
+when he actually does so he should check recursively that inconsistencies are
+not exploited.
+
+Contrarily to the mathematical practice, the usual tendency in the world of
+assisted automation is that of building a logical environment (a consistent
+subset of the library) in a bottom up way, checking the consistency of a
+new axiom or theorem as soon as it is added to the environment. No lemma
+or definition outside the environment can be used until it is added to the
+library after every notion it depends on. Moreover, very often the logical
+environment is the only part of the library that can be inspected,
+that we can search lemmas in and that can be exploited by automatic tactics.
+
+Moving one by one notions from the library to the environment is a costly
+operation since it involves re-checking the correctness of the notion.
+As a consequence mathematical notions are packages into theories that must
+be added to the environment as a whole. However, the consistency problem is
+only raised at the level of theories: theories must be imported in a bottom
+up way and the system must check that no inconsistency arises.
+
+The practice of limiting the scope on the library to the logical environment
+is contrary to our commitment of being able to fully exploit as much as possible
+of the library at any given time. To reconcile consistency and visibility
+we have departed from the traditional implementation of an environment
+allowing environments to be built on demand in a top-down way. The new
+implementation is based on a modified meta-theory that changes the way
+convertibility, type checking, unification and refinement judgements.
+The modified meta-theory is fully described in \cite{libraryenvironments}.
+Here we just remark how a strong commitment on the way the user interacts
+with the library has lead to modifications to the logical core of the proof
+assistant. This is evidence that breakthroughs in the user interfaces
+and in the way the user interacts with the tools and with the library could
+be achieved only by means of strong architectural changes.
+
+\subsubsection{Accessibility}
+A large library that is completely in scope needs effective indexing and
+searching methods to make the user productive. Libraries of formal results
+are particularly critical since they hold a large percentage of technical
+lemmas that do not have a significative name and that must be retrieved
+using advanced methods based on matching, unification, generalization and
+instantiation.
+
+The efficiency of searching inside the library becomes a critical operation
+when automatic tactics exploit the library during the proof search. In this
+scenario the tactics must retrieve a set of candidates for backward or
+forward reasoning in a few milliseconds.
+
+The choice of several proof assistants is to use ad-hoc data structures,
+such as context trees, to index all the terms currently in scope. These
+data structures are expecially designed to quickly retrieve terms up
+to matching, instantiation and generalization. All these data structures
+try to maximize sharing of identical subterms so that matching can be
+reduced to a visit of the tree (or dag) that holds all the maximally shared
+terms together.
+
+Since the terms to be retrieved (or at least their initial prefix)
+are stored (actually ``melted'') in the data structure, these data structures
+must collect all the terms in a single location. In other words, adopting
+such data structures means centralizing the library.
+
+In the \MOWGLI{} project we have tried to follow an alternative approach
+that consists in keeping the library fully distributed and indexing it
+by means of spiders that collect metadata and store them in a database.
+The challenge is to be able to collect only a smaller as possible number
+of metadata that provide enough information to approximate the matching
+operation. A matching operation is then performed in two steps. The first
+step is a query to the remote search engine that stores the metadata in
+order to detect a (hopefully small) complete set of candidates that could
+match. Completeness means that no term that matches should be excluded from
+the set of candiates. The second step consists in retrieving from the
+distributed library all the candidates and attempt the actual matching.
+
+In the last we years we have progressively improved this technique.
+Our achievements can be found in \cite{query1,query2,query3}.
+
+The technique and tools already developed have been integrated in \MATITA{},
+that is able to contact a remote \WHELP{} search engine \cite{whelp} or that
+can be directly linked to the code of the \WHELP. In either case the database
+used to store the metadata can be local or remote.
+
+Our current challenge consists in the exploitation of \WHELP{} inside of
+\MATITA. In particular we are developing a set of tactics, for instance
+based on paramodulation \cite{paramodulation}, that perform queries to \WHELP{}
+to restrict the scope on the library to a set of interesting candidates,
+greatly reducing the search space. Moreover, queries to \WHELP{} are performed
+during parsing of user provided terms to disambiguate them.
+
+In Sect.~\ref{sec:metadata} we describe the technique adopted in \MATITA.
+
+\subsubsection{Library management}
+
+
+\subsection{ricerca e indicizzazione}
+\label{sec:metadata}
+\ASSIGNEDTO{andrea}
+
+\subsection{auto}
+\ASSIGNEDTO{andrea}
+
+\subsection{sostituzioni esplicite vs moduli}
+\ASSIGNEDTO{csc}
+
+\subsection{xml / gestione della libreria}
+\ASSIGNEDTO{gares}
+
+
+\section{User Interface (da cambiare)}
+
+\subsection{assenza di proof tree / resa in linguaggio naturale}
+\ASSIGNEDTO{andrea}
+
+\subsection{Disambiguation}
+\label{sec:disambiguation}
+\ASSIGNEDTO{zack}
+
+ \begin{table}
+ \caption{\label{tab:termsyn} Concrete syntax of CIC terms: built-in
+ notation\strut}
+ \hrule
+ \[
+ \begin{array}{@{}rcll@{}}
+ \NT{term} & ::= & & \mbox{\bf terms} \\
+ & & x & \mbox{(identifier)} \\
+ & | & n & \mbox{(number)} \\
+ & | & s & \mbox{(symbol)} \\
+ & | & \mathrm{URI} & \mbox{(URI)} \\
+ & | & \verb+_+ & \mbox{(implicit)}\TODO{sync} \\
+ & | & \verb+?+n~[\verb+[+~\{\NT{subst}\}~\verb+]+] & \mbox{(meta)} \\
+ & | & \verb+let+~\NT{ptname}~\verb+\def+~\NT{term}~\verb+in+~\NT{term} \\
+ & | & \verb+let+~\NT{kind}~\NT{defs}~\verb+in+~\NT{term} \\
+ & | & \NT{binder}~\{\NT{ptnames}\}^{+}~\verb+.+~\NT{term} \\
+ & | & \NT{term}~\NT{term} & \mbox{(application)} \\
+ & | & \verb+Prop+ \mid \verb+Set+ \mid \verb+Type+ \mid \verb+CProp+ & \mbox{(sort)} \\
+ & | & \verb+match+~\NT{term}~ & \mbox{(pattern matching)} \\
+ & & ~ ~ [\verb+[+~\verb+in+~x~\verb+]+]
+ ~ [\verb+[+~\verb+return+~\NT{term}~\verb+]+] \\
+ & & ~ ~ \verb+with [+~[\NT{rule}~\{\verb+|+~\NT{rule}\}]~\verb+]+ & \\
+ & | & \verb+(+~\NT{term}~\verb+:+~\NT{term}~\verb+)+ & \mbox{(cast)} \\
+ & | & \verb+(+~\NT{term}~\verb+)+ \\
+ \NT{defs} & ::= & & \mbox{\bf mutual definitions} \\
+ & & \NT{fun}~\{\verb+and+~\NT{fun}\} \\
+ \NT{fun} & ::= & & \mbox{\bf functions} \\
+ & & \NT{arg}~\{\NT{ptnames}\}^{+}~[\verb+on+~x]~\verb+\def+~\NT{term} \\
+ \NT{binder} & ::= & & \mbox{\bf binders} \\
+ & & \verb+\forall+ \mid \verb+\lambda+ \\
+ \NT{arg} & ::= & & \mbox{\bf single argument} \\
+ & & \verb+_+ \mid x \\
+ \NT{ptname} & ::= & & \mbox{\bf possibly typed name} \\
+ & & \NT{arg} \\
+ & | & \verb+(+~\NT{arg}~\verb+:+~\NT{term}~\verb+)+ \\
+ \NT{ptnames} & ::= & & \mbox{\bf bound variables} \\
+ & & \NT{arg} \\
+ & | & \verb+(+~\NT{arg}~\{\verb+,+~\NT{arg}\}~[\verb+:+~\NT{term}]~\verb+)+ \\
+ \NT{kind} & ::= & & \mbox{\bf induction kind} \\
+ & & \verb+rec+ \mid \verb+corec+ \\
+ \NT{rule} & ::= & & \mbox{\bf rules} \\
+ & & x~\{\NT{ptname}\}~\verb+\Rightarrow+~\NT{term}
+ \end{array}
+ \]
+ \hrule
+ \end{table}
+
+
+\subsubsection{Term input}
+
+The primary form of user interaction employed by \MATITA{} is textual script
+editing: the user modifies it and evaluate step by step its composing
+\emph{statements}. Examples of statements are inductive type definitions,
+theorem declarations, LCF-style tacticals, and macros (e.g. \texttt{Check} can
+be used to ask the system to refine a given term and pretty print the result).
+Since many statements refer to terms of the underlying calculus, \MATITA{} needs
+a concrete syntax able to encode terms of the Calculus of Inductive
+Constructions.
+
+Two of the requirements in the design of such a syntax are apparently in
+contrast:
+\begin{enumerate}
+ \item the syntax should be as close as possible to common mathematical practice
+ and implement widespread mathematical notations;
+ \item each term described by the syntax should be non-ambiguous meaning that it
+ should exists a function which associates to it a CIC term.
+\end{enumerate}
+
+These two requirements are addressed in \MATITA{} by the mean of two mechanisms
+which work together: \emph{term disambiguation} and \emph{extensible notation}.
+Their interaction is visible in the architecture of the \MATITA{} input phase,
+depicted in Fig.~\ref{fig:inputphase}. The architecture is articulated as a
+pipline of three levels: the concrete syntax level (level 0) is the one the user
+has to deal with when inserting CIC terms; the abstract syntax level (level 2)
+is an internal representation which intuitively encodes mathematical formulae at
+the content level~\cite{adams}\cite{mkm-structure}; the last level is that of
+CIC terms.
+
+\begin{figure}[ht]
+ \begin{center}
+ \includegraphics[width=0.9\textwidth]{input_phase}
+ \caption{\MATITA{} input phase}
+ \end{center}
+ \label{fig:inputphase}
+\end{figure}
+
+Requirement (1) is addressed by a built-in concrete syntax for terms, described
+in Tab.~\ref{tab:termsyn}, and the extensible notation mechanisms which offers a
+way for extending available mathematical notations. Extensible notation, which
+is also in charge of providing a parsing function mapping concrete syntax terms
+to content level terms, is described in Sect.~\ref{sec:notation}. Requirement
+(2) is addressed by the conjunct action of that parsing function and
+disambiguation which provides a function from content level terms to CIC terms.
+
+\subsubsection{Sources of ambiguity}
+
+The translation from content level terms to CIC terms is not straightforward
+because some nodes of the content encoding admit more that one CIC encoding,
+invalidating requirement (2).
+
+\begin{example}
+ \label{ex:disambiguation}
+
+ Consider the term at the concrete syntax level \texttt{\TEXMACRO{forall} x. x +
+ ln 1 = x} of Fig.~\ref{fig:inputphase}(a), it can be the type of a lemma the
+ user may want to prove. Assuming that both \texttt{+} and \texttt{=} are parsed
+ as infix operators, all the following questions are legitimate and must be
+ answered before obtaining a CIC term from its content level encoding
+ (Fig.~\ref{fig:inputphase}(b)):
+
+ \begin{enumerate}
+
+ \item Since \texttt{ln} is an unbound identifier, which CIC constants does it
+ represent? Many different theorems in the library may share its (rather
+ short) name \dots
+
+ \item Which kind of number (\IN, \IR, \dots) the \texttt{1} literal stand for?
+ Which encoding is used in CIC to represent it? E.g., assuming $1\in\IN$, is
+ it an unary or a binary encoding?