+\subsection{Content level terms}
+\label{sec:contentintro}
+
+The language used to communicate proofs and especially formulae with the
+user does not only need to be extendible and accommodate the usual mathematical
+notation. It must also reflect the comfortable and suggestive degree of
+notational abuse so typical of the mathematical language.
+
+For instance, it is common practice in mathematics to speak of a generic
+equality that can be used to compare any two terms. However, it is well known
+that several equalities can be distinguished as soon as we care for decidability
+or for their computational properties. For instance equality over real
+numbers is well known to be undecidable, whereas it is decidable over
+rational numbers.
+
+Similarly, we usually speak of natural numbers and their operations and
+properties without caring about their representation. However the computational
+properties of addition over the binary representation are very different from
+those of addition over the unary representation. And addition over two natural
+numbers is definitely different from addition over two real numbers.
+
+Formalized mathematics cannot hide these differences and forces the user to be
+very precise on the types he is using and their representation. However,
+to communicate formulae with the user and with external tools, it seems good
+practice to stick to the usual imprecise mathematical ontology. In the
+Mathematical Knowledge Management community this imprecise language is called
+the \emph{content level}~\cite{adams} representation of formulae.
+
+In \MATITA{} we provide translations from partially specified terms
+to content level terms and the other way around.
+The former translation must
+discriminate between terms used to represent proofs and terms used to represent
+formulae. Using techniques inspired by~\cite{natural,YANNTHESIS}, the firsts
+are translated to a content level representation of
+proof steps that can in turn easily be rendered in natural language.
+The representation
+adopted has greatly influenced the OMDoc~\cite{omdoc} proof format that is now
+isomorphic to it. Terms that represent formulae are translated to \MATHML{}
+Content formulae. \MATHML{} Content~\cite{mathml} is a W3C standard
+for the representation of content level formulae in an extensible XML format.
+
+The translation to content level is implemented in the
+\texttt{acic\_content} \component. Its input are \emph{annotated partially
+specified terms}, that are
+partially specified terms enriched with additional typing information for each
+subterm. This information is used to discriminate between terms that represent
+proofs and terms that represent formulae. Part of it is also stored at the
+content level since it is required to generate the natural language rendering
+of proofs.
+The \texttt{cic\_acic} \component{} annotates terms. It is used
+by the \texttt{library} \component{} since fully specified terms are stored
+in the library in their annotated form.
+
+We do not provide yet a reverse translation from content level proofs to
+partially specified terms. But in \texttt{cic\_disambiguation} we do provide
+the reverse translation for formulae. The mapping from
+content level formulae to partially specified terms is not unique due to
+the ambiguity of the content level. As a consequence the translation
+is guided by an \emph{interpretation}, that is a function that chooses for
+every ambiguous formula one partially specified term. The
+\texttt{cic\_disambiguation} \component{} implements the
+disambiguation algorithm presented in~\cite{disambiguation} that is
+responsible of building in an efficient way the set of all correct
+interpretations. An interpretation is correct if the partially specified term
+obtained using the interpretation is refinable.
+
+In Sect.~\ref{sec:partiallyintro} we described the semantics of
+a command as a
+function from status to status. We also hinted that the formulae in a
+command are encoded as partially specified terms. However, consider the
+command ``\texttt{replace} $x$ \texttt{with} $y^2$''. Until the occurrence
+of $x$ to be replaced is located, its context is unknown. Since $y^2$ must
+replace $x$ in that context, its encoding as a term cannot be computed
+until $x$ is located. In other words, $y^2$ must be disambiguated in the
+context of the occurrence $x$ it must replace.
+
+The elegant solution we have implemented consists in representing terms
+in a command as functions from a context to a partially refined term. The
+function is obtained by partially applying our disambiguation function to
+the content level term to be disambiguated. Our solution should be compared with
+the one adopted in the \COQ{} system (where ambiguity is only relative to
+De Bruijn indexes).
+In \COQ, variables can be bound either by name or by position. A term
+occurring in a command has all its variables bound by name to avoid the need of
+a context during disambiguation. This makes more complex every
+operation over terms (i.e. according to our architecture every module that
+depends on \texttt{cic}) since the code must deal consistently with both kinds
+of binding. Moreover, this solution cannot cope with other forms of ambiguity
+(as the context dependent meaning of the exponent in the previous example).
+
+\subsection{Presentation level terms}
+\label{sec:presentationintro}
+
+Content level terms are a sort of abstract syntax trees for mathematical
+formulae and proofs. The concrete syntax given to these abstract trees
+is called \emph{presentation level}.
+
+The main important difference between the content level language and the
+presentation level language is that only the former is extendible. Indeed,
+the presentation level language is a finite language that comprises all
+the usual mathematical symbols. Mathematicians invent new notions every
+single day, but they stick to a set of symbols that is more or less fixed.
+
+The fact that the presentation language is finite allows the definition of
+standard languages. In particular, for formulae we have adopt \MATHML{}
+Presentation~\cite{mathml} that is an XML dialect standardized by the W3C. To
+visually
+represent proofs it is enough to embed formulae in plain text enriched with
+formatting boxes. Since the language of formatting boxes is very simple,
+many equivalent specifications exist and we have adopted our own, called
+\BOXML.
+
+The \texttt{content\_pres} \component{} contains the implementation of the
+translation from content level terms to presentation level terms. The
+rendering of presentation level terms is left to the application that uses
+the \component. However, in the \texttt{hgdome} \component{} we provide a few
+utility functions to build a \GDOME~\cite{gdome2} \MATHML+\BOXML{} tree from our
+presentation
+level terms. \GDOME{} \MATHML+\BOXML{} trees can be rendered by the
+\GTKMATHVIEW{}
+widget developed by Luca Padovani~\cite{padovani}. The widget is
+particularly interesting since it allows the implementation of \emph{semantic
+selection}.
+
+Semantic selection is a technique that consists in enriching the presentation
+level terms with pointers to the content level terms and to the partially
+specified terms they correspond to. Highlight of formulae in the widget is
+constrained to selection of meaningful expressions, i.e. expressions that
+correspond to a lower level term, that is a content term or a partially or
+fully specified term.
+Once the rendering of a lower level term is
+selected it is possible for the application to retrieve the pointer to the
+lower level term. An example of applications of semantic selection is
+\emph{semantic copy \& paste}: the user can select an expression and paste it
+elsewhere preserving its semantics (i.e. the partially specified term),
+possibly performing some semantic transformation over it (e.g. renaming
+variables that would be captured or lambda-lifting free variables).
+
+The reverse translation from presentation level terms to content level terms
+is implemented by a parser that is also found in \texttt{content\_pres}.
+Differently from the translation from content level terms to partially
+refined terms, this translation is not ambiguous. The reason is that the
+parsing tool we have adopted (CamlP4) is not able to parse ambiguous
+grammars. Thus we require the mapping from presentation level terms
+(concrete syntax) to content level terms (abstract syntax) to be unique.
+This means that the user must fix once and for all the associativity and
+precedence level of every operator he is using. In practice this limitation
+does not seem too strong. The reason is that the target of the
+translation is an ambiguous language and the user is free to associate
+to every content level term several different interpretations (as a
+partially specified term).
+
+Both the direct and reverse translation from presentation to content level
+terms are parameterized over the user provided mathematical notation.
+The \texttt{lexicon} \component{} is responsible of managing the lexicon,
+that is the set of active notations. It defines an abstract syntax tree
+of commands to declare and activate new notations and it implements the
+semantics of these commands. It also implements undoing of the semantic
+actions. Among the commands there are hints to the
+disambiguation algorithm that are used to control and speed up disambiguation.
+These mechanisms will be further discussed in Sect.~\ref{sec:disambiguation}.
+
+Finally, the \texttt{grafite\_parser} \component{} implements a parser for
+the concrete syntax of the commands of \MATITA. The parser processes a stream
+of characters and returns a stream of abstract syntax trees (the ones
+defined by the \texttt{grafite} component and whose semantics is given
+by \texttt{grafite\_engine}). When the parser meets a command that changes
+the lexicon, it invokes the \texttt{lexicon} \component{} to immediately
+process the command. When the parser needs to parse a term at the presentation
+level, it invokes the already described parser for terms contained in
+\texttt{content\_pres}.
+
+The \MATITA{} proof assistant and the \WHELP{} search engine are both linked
+against the \texttt{grafite\_parser} \components{}
+since they provide an interface to the user. In both cases the formulae
+written by the user are parsed using the \texttt{content\_pres} \component{} and
+then disambiguated using the \texttt{cic\_disambiguation} \component. However,
+only \MATITA{} is linked against the \texttt{grafite\_engine} and
+\texttt{tactics} components (summing up to a total of 11'200 lines of code)
+since \WHELP{} can only execute those ASTs that correspond to queries
+(implemented in the \texttt{whelp} component).
+
+The \UWOBO{} Web service wraps the \texttt{content\_pres} \component,
+providing a rendering service for the documents in the distributed library.
+To render a document given its URI, \UWOBO{} retrieves it using the
+\GETTER{} obtaining a document with fully specified terms. Then it translates
+it to the presentation level passing through the content level. Finally
+it returns the result document to be rendered by the user's
+browser.
+
+The \components{} not yet described (\texttt{extlib}, \texttt{xml},
+\texttt{logger}, \texttt{registry} and \texttt{utf8\_macros}) are
+minor \components{} that provide a core of useful functions and basic
+services missing from the standard library of the programming language.
+%In particular, the \texttt{xml} \component{} is used to easily represent,
+%parse and pretty-print XML files.
+
+\section{The interface to the library}
+\label{sec:library}
+
+A proof assistant provides both an interface to interact with its library and
+an authoring interface to develop new proofs and theories. According
+to its historical origins, \MATITA{} strives to provide innovative
+functionalities for the interaction with the library. It is more traditional
+in its script based authoring interface. In the remaining part of the paper we
+focus on the user view of \MATITA.
+
+The library of \MATITA{} comprises mathematical concepts (theorems,
+axioms, definitions) and notation. The concepts are authored sequentially
+using scripts that are (ordered) sequences of procedural commands.
+Once they are produced we store them independently in the library.
+The only relation implicitly kept between the concepts are the logical,
+acyclic dependencies among them. This way the library forms a global (and
+distributed) hypertext.
+
+\begin{figure}[!ht]
+ \begin{center}
+ \includegraphics[width=0.45\textwidth]{pics/cicbrowser-screenshot-browsing}
+ \hspace{0.05\textwidth}
+ \includegraphics[width=0.45\textwidth]{pics/cicbrowser-screenshot-query}
+ \caption{Browsing and searching the library\strut}
+ \label{fig:cicbrowser1}
+ \end{center}
+\end{figure}
+
+\begin{figure}[!ht]
+ \begin{center}
+ \includegraphics[width=0.70\textwidth]{pics/cicbrowser-screenshot-con}
+ \caption[Natural language rendering]{Natural language rendering of a theorem
+ from the library\strut}
+ \label{fig:cicbrowser2}
+ \end{center}
+\end{figure}
+
+Several useful operations can be implemented on the library only,
+regardless of the scripts. For instance, searching and browsing is
+implemented by the ``cicBrowser'' window available from the \MATITA{}
+GUI. Using it, the hierarchical structure of the library can be
+explored (on the left of Fig.~\ref{fig:cicbrowser1}), the natural
+language rendering of proofs can be inspected
+(Fig.~\ref{fig:cicbrowser2}), and content based searches on the
+library can be performed (on the right of Fig.~\ref{fig:cicbrowser1}).
+Content based searches are described in
+Sect.~\ref{sec:indexing}. Other examples of library operations are
+disambiguation of content level terms (see
+Sect.~\ref{sec:disambiguation}) and automatic proof searching (see
+Sect.~\ref{sec:automation}).
+
+The key requisite for the previous operations is that the library must
+be fully accessible and in a logically consistent state. To preserve
+consistency, a concept cannot be altered or removed unless the part of the
+library that depends on it is modified accordingly. To allow incremental
+changes and cooperative development, consistent revisions are necessary.
+For instance, to modify a definition, the user could fork a new version
+of the library where the definition is updated and all the concepts that
+used to rely on it are absent. The user is then responsible to restore
+the removed part in the new branch, merging the branch when the library is
+fully restored.
+
+To implement the proposed versioning system on top of a standard one
+it is necessary to implement \emph{invalidation} first. Invalidation
+is the operation that locates and removes from the library all the concepts
+that depend on a given one. As described in Sect.~\ref{sec:libmanagement} removing
+a concept from the library also involves deleting its metadata from the
+database.
+
+For non collaborative development, full versioning can be avoided, but
+invalidation is still required. Since nobody else is relying on the
+user development, the user is free to change and invalidate part of the library
+without branching. Invalidation is still necessary to avoid using a
+concept that is no longer valid.
+So far, in \MATITA{} we address only this non collaborative scenario
+(see Sect.~\ref{sec:libmanagement}). Collaborative development and versioning
+is still under design.
+
+Scripts are not seen as constituents of the library. They are not published
+and indexed, so they cannot be searched or browsed using \HELM{} tools.
+However, they play a central role for the maintenance of the library.
+Indeed, once a concept is invalidated, the only way to restore it is to
+fix the possibly broken script that used to generate it.
+Moreover, during the authoring phase, scripts are a natural way to
+group concepts together. They also constitute a less fine grained clustering
+of concepts for invalidation.
+
+In the rest of this section we present in more details the functionalities of
+\MATITA{} related to library management and exploitation.
+Sect.~\ref{sec:authoring} is devoted to the description of the peculiarities of
+the \MATITA{} authoring interface.
+
+\subsection{Indexing and searching}
+\label{sec:indexing}
+
+The \MATITA{} system is first of all an interface between the user and
+the mathematical library. For this reason, it is important to be
+able to search and retrieve mathematical concepts in a quick and
+effective way, assuming as little knowledge as possible about the
+library. To this aim, \MATITA{} uses a sophisticated indexing mechanism
+for mathematical concepts, based on a rich metadata set that has been
+tuned along the European project \MOWGLIIST{} \MOWGLI. The metadata
+set, and the searching facilities built on top of them --- collected
+in the so called \WHELP{} search engine --- have been
+extensively described in~\cite{whelp}. Let us just recall here that
+the \WHELP{} metadata model is essentially based on a single ternary relation
+\REF{p}{s}{t} stating that a concept $s$ refers a concept $t$ at a
+given position $p$, where the position specify the place of the
+occurrence of $t$ inside $s$ (we currently work with a fixed set of
+positions, discriminating the hypothesis from the conclusion and
+outermost form innermost occurrences). This approach is extremely
+flexible, since extending the set of positions
+we may improve the granularity and the precision of our indexing technique,
+with no additional architectural impact.
+
+Every time a new mathematical concept is created and saved by the user it gets
+indexed, and becomes immediately visible in the library. Several
+interesting and innovative features of \MATITA{} described in the following
+sections rely in a direct or indirect way on its metadata system and
+the search functionalities. Here, we shall just recall some of its most
+direct applications.
+
+A first, very simple but not negligible feature is the \emph{duplicate check}.
+As soon as a theorem is stated, just before starting its proof,
+the library is searched
+to check that no other equivalent statement has been already proved
+(based on the pattern matching functionality of \WHELP); if this is the case,
+a warning is raised to the user. At present, the notion of equivalence
+adopted by \MATITA{} is convertibility, but we may imagine to weaken it
+in the future, covering for instance isomorphisms~\cite{delahaye}.
+
+Another useful \WHELP{} operation is \HINT; we may invoke this query
+at any moment during the authoring of a proof, resulting in the list
+of all theorems of the library which can be applied to the current
+goal. In practice, this is mostly used not really to discover what theorems
+can be applied to a given goal, but to actually retrieve a theorem that
+we wish to apply, but whose name we have forgotten.
+In fact, even if \MATITA{} adopts a semi-rigid naming convention for
+statements (see Sect.~\ref{sec:naming}) that greatly simplifies the effort
+of recalling names, the naming discipline remains one of the most
+annoying aspects of formal developments, and \HINT{} provides
+a very friendly solution.
+
+In the near future, we expect to extend the \HINT{} query to
+a \REWRITEHINT, resulting in all equational statements that
+can be applied to rewrite the current goal.
+
+\subsection{Disambiguation}
+\label{sec:disambiguation}
+
+Software applications that involve input of mathematical content should strive
+to require the user as less drift from informal mathematics as possible. We
+believe this to be a fundamental aspect of such applications user interfaces.
+The drift is still very large for
+proofs~\cite{debrujinfactor}, while better results may be achieved for
+mathematical formulae. In \MATITA{} formulae can be written using a
+\TeX-like syntax (corresponding to presentation level terms) and are then
+translated (in multiple steps) to partially specified terms as sketched in
+Sect.~\ref{sec:contentintro}.
+
+The key ingredient of the translation is the generic disambiguation algorithm
+implemented in the \texttt{disambiguation} component of Fig.~\ref{fig:libraries}
+and presented in~\cite{disambiguation}. In this section we detail how to use
+that algorithm in the context of the development of a library of formalized
+mathematics. We will see that using multiple passes of the algorithm, varying
+some of its parameters, helps in keeping the input terse without sacrificing
+expressiveness.
+
+\subsubsection{Disambiguation preferences}
+\label{sec:disambaliases}
+
+Consider the following command that states a theorem over integer numbers:
+
+\begin{grafite}
+theorem Zlt_compat:
+ \forall x,y,z. x < y \to y < z \to x < z.
+\end{grafite}
+
+The symbol \OP{<} is likely to be overloaded in the library
+(at least over natural numbers).
+Thus, according to the disambiguation algorithm, two different
+refinable partially specified terms could be associated to it.
+\MATITA{} asks the user what interpretation he meant. However, to avoid
+posing the same question in case of a future re-execution (e.g. undo/redo),
+the choice must be recorded. Since scripts need to be re-executed after
+invalidation, the choice record must be permanently stored somewhere. The most
+natural place is the script itself.
+
+In \MATITA{} disambiguation is governed by \emph{disambiguation aliases}.
+They are mappings, stored in the library, from ambiguity sources
+(identifiers, symbols and literal numbers at the content level) to partially
+specified terms. In case of overloaded sources there exists multiple aliases
+with the same source. It is possible to record \emph{disambiguation
+preferences} to select one of the aliases of an overloaded source.
+
+Preferences can be explicitly given in the script (using the \texttt{alias}
+command), but
+are also implicitly added when a new concept is introduced (\emph{implicit
+preferences}) or after a successful disambiguation that did not require
+user interaction. Explicit preferences are added automatically by \MATITA{} to
+record the disambiguation choices of the user. For instance, after the
+disambiguation of the command above, the script is altered as follows:
+
+\begin{grafite}
+alias symbol "lt" = "integer 'less than'".
+theorem Zlt_compat:
+ \forall x,y,z. x < y \to y < z \to x < z.
+\end{grafite}
+
+The ``alias'' command in the example sets the preferred alias for the
+\OP{lt} symbol.
+
+Implicit preferences for new concepts are set since a concept just defined is
+likely to be the preferred one in the rest of the script. Implicit preferences
+learned from disambiguation of previous commands grant the coherence of
+the disambiguation in the rest of the script and speed up disambiguation
+reducing the search space.
+
+Disambiguation preferences are included in the lexicon status
+(see Sect.~\ref{sec:presentationintro}) that is part of the authoring interface
+status. Unlike aliases, they are not part of the library.
+
+When starting a new authoring session the set of disambiguation preferences
+is empty. Until it contains a preference for each overloaded symbol to be
+used in the script, the user can be faced with questions from the disambiguator.
+To reduce the likelihood of user interactions, we introduced
+the \texttt{include} command. With \texttt{include} it is possible to import
+at once in the current session the set of preferences that was in effect
+at the end of the execution of a given script.
+The inclusion mechanism is thus sensibly different from that of other systems
+where concepts are effectively loaded and made visibible by inclusion; in \MATITA{}
+all concepts are always visible, and inclusion, that is optional, is only used
+to set up preferences.
+
+Preferences can be changed. For instance, at the beginning of the development
+of integer numbers the preference for the symbol \OP{<} is likely
+to be the one over natural numbers; sooner or later it will be set to the one
+over integer numbers.
+
+Nothing forbids the set of preferences to become incoherent. For this reason
+the disambiguator cannot always respect the user preferences.
+Consider, for example:
+\begin{grafite}
+theorem Zlt_mono:
+ \forall x,y,k. x < y \to x < y + k.
+\end{grafite}
+
+No refinable partially specified term corresponds to the preferences:
+\OP{+} over natural numbers, \OP{<} over integer numbers. To overcome this
+limitation we organized disambiguation in \emph{multiple passes}: when the
+disambiguator fails, disambiguation is tried again with a less restrictive set of
+preferences.
+
+Several disambiguation parameters can vary among passes. With respect to
+preference handling we implemented three passes. In the first pass, called
+\emph{mono-preferences}, we consider only the aliases corresponding to the
+current set of preferences. In the second pass, called
+\emph{multi-preferences}, we
+consider every alias corresponding to a current or past preference. For
+instance, in the example above disambiguation succeeds in the multi-preference
+pass. In the third pass, called \emph{library-preferences}, all aliases
+available in the library are considered.
+
+The rationale behind this choice is trying to respect user preferences in early
+passes that complete quickly in case of failure; later passes are slower but
+have more chances of success.
+
+\subsubsection{Operator instances}
+\label{sec:disambinstances}
+
+Consider now the following theorem, where \texttt{pos} injects natural numbers
+into positive integers:
+
+\begin{grafite}
+theorem lt_to_Zlt_pos_pos:
+ \forall n,m: nat. n < m \to pos n < pos m.
+\end{grafite}
+and assume that there exist in the library aliases for \OP{<} over natural
+numbers and over integer numbers. None of the passes described above is able to
+disambiguate \texttt{lt\_to\_Zlt\_pos\_pos}, no matter how preferences are set.
+This is because the \OP{<} operator occurs twice in the content level term (it
+has two \emph{instances}) and two different interpretations for it have to be
+used in order to obtain a refinable partially specified term.
+
+To address this issue, we have the ability to consider each instance of a single
+symbol as a different ambiguous expression in the content level term,
+enabling the use of a different alias for each of them.
+Exploiting or not this feature is
+one of the disambiguation pass parameters. A disambiguation pass which exploit
+it is said to be using \emph{fresh instances} (opposed to a \emph{shared
+instances} pass).
+
+Fresh instances lead to a non negligible performance loss (since the choice of
+an alias for one instance does not constraint the choice of the others). For
+this reason we always attempt a fresh instances pass only after attempting a
+shared instances pass.
+
+%\paragraph{One-shot preferences} Disambiguation preferences as seen so far are
+%instance-independent. However, implicit preferences obtained as a result of a
+%disambiguation pass which uses fresh instances ought to be instance-dependent.
+%Informally, the set of preferences that can be respected by the disambiguator on
+%the theorem above is: ``the first instance of the \OP{<} symbol is over natural
+%numbers, while the second is on integer numbers''.
+
+Instance-dependent preferences are meaningful only for the term whose
+disambiguation generated them. For this reason we call them \emph{one-shot
+preferences} and \MATITA{} does not use them to disambiguate further terms in
+the script.
+
+\subsubsection{Implicit coercions}
+\label{sec:disambcoercions}