diff -r a42c773ec8ab -r 841f7b9c0a6a thys/Paper/Paper.thy --- a/thys/Paper/Paper.thy Tue May 17 14:28:22 2016 +0100 +++ b/thys/Paper/Paper.thy Wed May 18 15:57:46 2016 +0100 @@ -97,7 +97,7 @@ derivatives is that they are neatly expressible in any functional language, and easily definable and reasoned about in theorem provers---the definitions just consist of inductive datatypes and simple recursive functions. A -completely formalised correctness proof of this matcher in for example HOL4 +mechanised correctness proof of Brzozowski's matcher in for example HOL4 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given by Coquand and Siles \cite{Coquand2012}. @@ -135,7 +135,7 @@ that can match determines the token. \end{itemize} -\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords +\noindent Consider for example a regular expression @{text "r\<^bsub>key\<^esub>"} for recognising keywords such as @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising identifiers (say, a single character followed by characters or numbers). Then we can form the regular expression @@ -269,10 +269,15 @@ "s\<^sub>1 @ s\<^sub>2"} is in the star of this language. It will also be convenient to use the following notion of a \emph{semantic derivative} (or \emph{left quotient}) of a language defined as - @{thm (lhs) Der_def} $\dn$ @{thm (rhs) Der_def}. + % + \begin{center} + @{thm Der_def}\;. + \end{center} + + \noindent For semantic derivatives we have the following equations (for example mechanically proved in \cite{Krauss2011}): - + % \begin{equation}\label{SemDer} \begin{array}{lcl} @{thm (lhs) Der_null} & \dn & @{thm (rhs) Der_null}\\ @@ -322,9 +327,6 @@ \begin{center} \begin{tabular}{lcl} @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\ - \end{tabular} - \hspace{20mm} - \begin{tabular}{lcl} @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\ \end{tabular} \end{center} @@ -405,10 +407,10 @@ \begin{tabular}{c} \\[-8mm] @{thm[mode=Axiom] Prf.intros(4)} \qquad - @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[3mm] + @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm] @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad - @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[3mm] - @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[3mm] + @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm] + @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[4mm] @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]} \end{tabular} @@ -472,7 +474,7 @@ left to right) is \Brz's matcher building successive derivatives. If the last regular expression is @{term nullable}, then the functions of the second phase are called (the top-down and right-to-left arrows): first -@{term mkeps} calculates a value witnessing +@{term mkeps} calculates a value @{term "v\<^sub>4"} witnessing how the empty string has been recognised by @{term "r\<^sub>4"}. After that the function @{term inj} ``injects back'' the characters of the string into the values. @@ -502,12 +504,12 @@ The most interesting idea from Sulzmann and Lu \cite{Sulzmann2014} is the construction of a value for how @{term "r\<^sub>1"} can match the string @{term "[a,b,c]"} from the value how the last derivative, @{term - "r\<^sub>4"} in Fig~\ref{Sulz}, can match the empty string. Sulzmann and + "r\<^sub>4"} in Fig.~\ref{Sulz}, can match the empty string. Sulzmann and Lu achieve this by stepwise ``injecting back'' the characters into the values thus inverting the operation of building derivatives, but on the level of values. The corresponding function, called @{term inj}, takes three arguments, a regular expression, a character and a value. For example in - the first (or right-most) @{term inj}-step in Fig~\ref{Sulz} the regular + the first (or right-most) @{term inj}-step in Fig.~\ref{Sulz} the regular expression @{term "r\<^sub>3"}, the character @{term c} from the last derivative step and @{term "v\<^sub>4"}, which is the value corresponding to the derivative regular expression @{term "r\<^sub>4"}. The result is @@ -538,7 +540,7 @@ might be instructive to look first at the three sequence cases (clauses (4)--(6)). In each case we need to construct an ``injected value'' for @{term "SEQ r\<^sub>1 r\<^sub>2"}. This must be a value of the form @{term - "Seq DUMMY DUMMY"}. Recall the clause of the @{text derivative}-function + "Seq DUMMY DUMMY"}\,. Recall the clause of the @{text derivative}-function for sequence regular expressions: \begin{center} @@ -674,7 +676,7 @@ "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"} respectively. Consider now the third premise and note that the POSIX value - of this rule should match the string @{term "s\<^sub>1 @ s\<^sub>2"}. According to the + of this rule should match the string \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}}. According to the longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} such that @{term "s\<^sub>2"} is still recognised by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there @@ -685,7 +687,7 @@ matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would \emph{not} be the longest initial split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} and therefore @{term "Seq v\<^sub>1 v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. - The main point is that this side-condition ensures the longest + The main point is that our side-condition ensures the longest match rule is satisfied. A similar condition is imposed on the POSIX value in the @{text @@ -716,8 +718,10 @@ \end{lemma} \begin{proof} + By induction on @{text r}. We explain two cases. - By induction on @{text r}. Suppose @{term "r = ALT r\<^sub>1 r\<^sub>2"}. There are + \begin{itemize} + \item[$\bullet$] Case @{term "r = ALT r\<^sub>1 r\<^sub>2"}. There are two subcases, namely @{text "(a)"} \mbox{@{term "v = Left v'"}} and @{term "s \ der c r\<^sub>1 \ v'"}; and @{text "(b)"} @{term "v = Right v'"}, @{term "s \ L (der c r\<^sub>1)"} and @{term "s \ der c r\<^sub>2 \ v'"}. In @{text "(a)"} we @@ -728,7 +732,7 @@ Prop.~\ref{derprop}(2) in order to infer @{term "c # s \ L r\<^sub>1"} from @{term "s \ L (der c r\<^sub>1)"}. - Suppose @{term "r = SEQ r\<^sub>1 r\<^sub>2"}. There are three subcases: + \item[$\bullet$] Case @{term "r = SEQ r\<^sub>1 r\<^sub>2"}. There are three subcases: \begin{quote} \begin{description} @@ -768,6 +772,7 @@ c v\<^sub>1) \ []"}. This follows from @{term "(c # s\<^sub>1) \ r\<^sub>1 \ injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \ der c r\<^sub>1 \ v\<^sub>1"} and the induction hypothesis).\qed + \end{itemize} \end{proof} \noindent @@ -787,11 +792,14 @@ By induction on @{term s} using Lem.~\ref{lemmkeps} and \ref{Posix2}.\qed \end{proof} - \noindent This concludes our correctness proof. Note that we have not - changed the algorithm of Sulzmann and Lu,\footnote{All deviations we - introduced are harmless.} but introduced our own specification for what a - correct result---a POSIX value---should be. A strong point in favour of - Sulzmann and Lu's algorithm is that it can be extended in various ways. + \noindent In (2) we further know by Thm.~\ref{posixdeterm} that the + value returned by the lexer must be unique. This concludes our + correctness proof. Note that we have not changed the algorithm of + Sulzmann and Lu,\footnote{All deviations we introduced are + harmless.} but introduced our own specification for what a correct + result---a POSIX value---should be. A strong point in favour of + Sulzmann and Lu's algorithm is that it can be extended in various + ways. *} @@ -843,9 +851,7 @@ algorithms considerably, as noted in \cite{Sulzmann2014}. One of the advantages of having a simple specification and correctness proof is that the latter can be refined to prove the correctness of such simplification - steps. - - While the simplification of regular expressions according to + steps. While the simplification of regular expressions according to rules like \begin{equation}\label{Simpl} @@ -930,7 +936,7 @@ @{term v} (that is construct @{term "f\<^sub>r v"}). Before we can establish the correctness of @{term "slexer"}, we need to show that simplification preserves the language and simplification preserves our POSIX relation once the value is rectified - (recall @{const "simp"} generates a regular expression / rectification function pair): + (recall @{const "simp"} generates a (regular expression, rectification function) pair): \begin{lemma}\mbox{}\smallskip\\\label{slexeraux} \begin{tabular}{ll} @@ -1108,7 +1114,7 @@ Although they do not give an explicit proof of the transitivity property, they give a closely related property about the existence of maximal - elements. They state that this can be verified by an induction on $r$. We + elements. They state that this can be verified by an induction on @{term r}. We disagree with this as we shall show next in case of transitivity. The case where the reasoning breaks down is the sequence case, say @{term "SEQ r\<^sub>1 r\<^sub>2"}. The induction hypotheses in this case are