lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:a42c773ec8ab
+:841f7b9c0a6a
 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
 derivatives is that they are neatly expressible in any functional language,
 and easily definable and reasoned about in theorem provers---the definitions
 just consist of inductive datatypes and simple recursive functions. A
-completely formalised correctness proof of this matcher in for example HOL4
+mechanised correctness proof of Brzozowski's matcher in for example HOL4
 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given
 by Coquand and Siles \cite{Coquand2012}.
 If a regular expression matches a string, then in general there is more than
 \item[$\bullet$] \emph{Priority Rule:}
 For a particular longest initial substring, the first regular expression
 that can match determines the token.
 \end{itemize}
-\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords
+\noindent Consider for example a regular expression @{text "r\<^bsub>key\<^esub>"} for recognising keywords
 such as @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"}
 recognising identifiers (say, a single character followed by
 characters or numbers).  Then we can form the regular expression
 @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX matching to tokenise strings,
 say @{text "iffoo"} and @{text "if"}.  For @{text "iffoo"} we obtain
 the star of a language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a
 language and @{term "s\<^sub>2"} in the star of this language, then also @{term
 "s\<^sub>1 @ s\<^sub>2"} is in the star of this language. It will also be convenient
 to use the following notion of a \emph{semantic derivative} (or \emph{left
 quotient}) of a language defined as
-@{thm (lhs) Der_def} $\dn$ @{thm (rhs) Der_def}.
+%
+\begin{center}
+@{thm Der_def}\;.
+\end{center}
+\noindent
 For semantic derivatives we have the following equations (for example
 mechanically proved in \cite{Krauss2011}):
+%
 \begin{equation}\label{SemDer}
 \begin{array}{lcl}
 @{thm (lhs) Der_null}  & \dn & @{thm (rhs) Der_null}\\
 @{thm (lhs) Der_empty}  & \dn & @{thm (rhs) Der_empty}\\
 @{thm (lhs) Der_char}  & \dn & @{thm (rhs) Der_char}\\
 We may extend this definition to give derivatives w.r.t.~strings:
 \begin{center}
 \begin{tabular}{lcl}
 @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\
-\end{tabular}
-\hspace{20mm}
-\begin{tabular}{lcl}
 @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\
 \end{tabular}
 \end{center}
 \noindent Given the equations in \eqref{SemDer}, it is a relatively easy
 \begin{center}
 \begin{tabular}{c}
 \\[-8mm]
 @{thm[mode=Axiom] Prf.intros(4)} \qquad
-@{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[3mm]
+@{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm]
 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad
-@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[3mm]
+@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm]
-@{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[3mm]
+@{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[4mm]
 @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad
 @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}
 \end{tabular}
 \end{center}
 \caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014},
 matching the string @{term "[a,b,c]"}. The first phase (the arrows from
 left to right) is \Brz's matcher building successive derivatives. If the
 last regular expression is @{term nullable}, then the functions of the
 second phase are called (the top-down and right-to-left arrows): first
-@{term mkeps} calculates a value witnessing
+@{term mkeps} calculates a value @{term "v\<^sub>4"} witnessing
 how the empty string has been recognised by @{term "r\<^sub>4"}. After
 that the function @{term inj} ``injects back'' the characters of the string into
 the values.
 \label{Sulz}}
 \end{figure}
 string.
 The most interesting idea from Sulzmann and Lu \cite{Sulzmann2014} is
 the construction of a value for how @{term "r\<^sub>1"} can match the
 string @{term "[a,b,c]"} from the value how the last derivative, @{term
-"r\<^sub>4"} in Fig~\ref{Sulz}, can match the empty string. Sulzmann and
+"r\<^sub>4"} in Fig.~\ref{Sulz}, can match the empty string. Sulzmann and
 Lu achieve this by stepwise ``injecting back'' the characters into the
 values thus inverting the operation of building derivatives, but on the level
 of values. The corresponding function, called @{term inj}, takes three
 arguments, a regular expression, a character and a value. For example in
-the first (or right-most) @{term inj}-step in Fig~\ref{Sulz} the regular
+the first (or right-most) @{term inj}-step in Fig.~\ref{Sulz} the regular
 expression @{term "r\<^sub>3"}, the character @{term c} from the last
 derivative step and @{term "v\<^sub>4"}, which is the value corresponding
 to the derivative regular expression @{term "r\<^sub>4"}. The result is
 the new value @{term "v\<^sub>3"}. The final result of the algorithm is
 the value @{term "v\<^sub>1"}. The @{term inj} function is defined by recursion on regular
 \noindent To better understand what is going on in this definition it
 might be instructive to look first at the three sequence cases (clauses
 (4)--(6)). In each case we need to construct an ``injected value'' for
 @{term "SEQ r\<^sub>1 r\<^sub>2"}. This must be a value of the form @{term
-"Seq DUMMY DUMMY"}. Recall the clause of the @{text derivative}-function
+"Seq DUMMY DUMMY"}\,. Recall the clause of the @{text derivative}-function
 for sequence regular expressions:
 \begin{center}
 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}
 \end{center}
 is a @{text Right}-value (see the side-condition in @{text "P+R"}).
 Interesting is also the rule for sequence regular expressions (@{text
 "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"}
 are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
 respectively. Consider now the third premise and note that the POSIX value
-of this rule should match the string @{term "s\<^sub>1 @ s\<^sub>2"}. According to the
+of this rule should match the string \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}}. According to the
 longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial
 split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} such that @{term "s\<^sub>2"} is still recognised
 by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there
 \emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
 can be split up into a non-empty string @{term "s\<^sub>3"} and a possibly empty
 string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be
 matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be
 matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would \emph{not} be the
 longest initial split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} and therefore @{term "Seq v\<^sub>1
 v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}.
-The main point is that this side-condition ensures the longest
+The main point is that our side-condition ensures the longest
 match rule is satisfied.
 A similar condition is imposed on the POSIX value in the @{text
 "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial
 split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value
 \begin{lemma}\label{Posix2}
 @{thm[mode=IfThen] Posix_injval}
 \end{lemma}
 \begin{proof}
+By induction on @{text r}. We explain two cases.
-By induction on @{text r}. Suppose @{term "r = ALT r\<^sub>1 r\<^sub>2"}. There are
+\begin{itemize}
+\item[$\bullet$] Case @{term "r = ALT r\<^sub>1 r\<^sub>2"}. There are
 two subcases, namely @{text "(a)"} \mbox{@{term "v = Left v'"}} and @{term
 "s \<in> der c r\<^sub>1 \<rightarrow> v'"}; and @{text "(b)"} @{term "v = Right v'"}, @{term
 "s \<notin> L (der c r\<^sub>1)"} and @{term "s \<in> der c r\<^sub>2 \<rightarrow> v'"}. In @{text "(a)"} we
 know @{term "s \<in> der c r\<^sub>1 \<rightarrow> v'"}, from which we can infer @{term "(c # s)
 \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v'"} by induction hypothesis and hence @{term "(c #
 s) \<in> ALT r\<^sub>1 r\<^sub>2 \<rightarrow> injval (ALT r\<^sub>1 r\<^sub>2) c (Left v')"} as needed. Similarly
 in subcase @{text "(b)"} where, however, in addition we have to use
 Prop.~\ref{derprop}(2) in order to infer @{term "c # s \<notin> L r\<^sub>1"} from @{term
 "s \<notin> L (der c r\<^sub>1)"}.
-Suppose @{term "r = SEQ r\<^sub>1 r\<^sub>2"}. There are three subcases:
+\item[$\bullet$] Case @{term "r = SEQ r\<^sub>1 r\<^sub>2"}. There are three subcases:
 \begin{quote}
 \begin{description}
 \item[@{text "(a)"}] @{term "v = Left (Seq v\<^sub>1 v\<^sub>2)"} and @{term "nullable r\<^sub>1"}
 \item[@{text "(b)"}] @{term "v = Right v\<^sub>1"} and @{term "nullable r\<^sub>1"}
 Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the
 sequence case, except that we need to also ensure that @{term "flat (injval r\<^sub>1
 c v\<^sub>1) \<noteq> []"}. This follows from @{term "(c # s\<^sub>1)
 \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}  (which in turn follows from @{term "s\<^sub>1 \<in> der c
 r\<^sub>1 \<rightarrow> v\<^sub>1"} and the induction hypothesis).\qed
+\end{itemize}
 \end{proof}
 \noindent
 With Lem.~\ref{Posix2} in place, it is completely routine to establish
 that the Sulzmann and Lu lexer satisfies our specification (returning
 \begin{proof}
 By induction on @{term s} using Lem.~\ref{lemmkeps} and \ref{Posix2}.\qed
 \end{proof}
-\noindent This concludes our correctness proof. Note that we have not
+\noindent In (2) we further know by Thm.~\ref{posixdeterm} that the
-changed the algorithm of Sulzmann and Lu,\footnote{All deviations we
+value returned by the lexer must be unique.  This concludes our
-introduced are harmless.} but introduced our own specification for what a
+correctness proof. Note that we have not changed the algorithm of
-correct result---a POSIX value---should be. A strong point in favour of
+Sulzmann and Lu,\footnote{All deviations we introduced are
-Sulzmann and Lu's algorithm is that it can be extended in various ways.
+harmless.} but introduced our own specification for what a correct
+result---a POSIX value---should be. A strong point in favour of
+Sulzmann and Lu's algorithm is that it can be extended in various
+ways.
 *}
 section {* Extensions and Optimisations*}
 of @{term "ALT ZERO r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and
 @{term "SEQ r ONE"} to @{term r}. These simplifications can speed up the
 algorithms considerably, as noted in \cite{Sulzmann2014}. One of the
 advantages of having a simple specification and correctness proof is that
 the latter can be refined to prove the correctness of such simplification
-steps.
+steps. While the simplification of regular expressions according to
-While the simplification of regular expressions according to
 rules like
 \begin{equation}\label{Simpl}
 \begin{array}{lcllcllcllcl}
 @{term "ALT ZERO r"} & @{text "\<Rightarrow>"} & @{term r} \hspace{8mm}%\\
 is then recursively called with the simplified derivative, but before
 we inject the character @{term c} into the value @{term v}, we need to rectify
 @{term v} (that is construct @{term "f\<^sub>r v"}). Before we can establish the correctness
 of @{term "slexer"}, we need to show that simplification preserves the language
 and simplification preserves our POSIX relation once the value is rectified
-(recall @{const "simp"} generates a regular expression / rectification function pair):
+(recall @{const "simp"} generates a (regular expression, rectification function) pair):
 \begin{lemma}\mbox{}\smallskip\\\label{slexeraux}
 \begin{tabular}{ll}
 (1) & @{thm L_fst_simp[symmetric]}\\
 (2) & @{thm[mode=IfThen] Posix_simp}
 cannot be maintained as one descends into the induction. This is a problem
 that occurs in a number of places in the proofs by Sulzmann and Lu.
 Although they do not give an explicit proof of the transitivity property,
 they give a closely related property about the existence of maximal
-elements. They state that this can be verified by an induction on $r$. We
+elements. They state that this can be verified by an induction on @{term r}. We
 disagree with this as we shall show next in case of transitivity. The case
 where the reasoning breaks down is the sequence case, say @{term "SEQ r\<^sub>1 r\<^sub>2"}.
 The induction hypotheses in this case are
 \begin{center}

changeset 185	841f7b9c0a6a
parent 184	a42c773ec8ab
child 186	0b94800eb616