lexing: comparison thys/Paper/Paper.thy

equal deleted inserted replaced

-:fa0d8aa5d7de
+:4178b7e71809
 an induction on @{term r}. There are no interesting cases.\qed
 \end{proof}
 Having defined the @{const mkeps} and @{text inj} function we can extend
 \Brz's matcher so that a [lexical] value is constructed (assuming the
-regular expression matches the string). The clauses of the lexer are
+regular expression matches the string). The clauses of the Sulzmann and Lu lexer are
 \begin{center}
 \begin{tabular}{lcl}
 @{thm (lhs) matcher.simps(1)} & $\dn$ & @{thm (rhs) matcher.simps(1)}\\
 @{thm (lhs) matcher.simps(2)} & $\dn$ & @{text "case"} @{term "matcher (der c r) s"} @{text of}\\
 A similar condition is imposed on the POSIX value in the @{text
 "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial
 split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value
 @{term v} cannot be flattened to the empty string. In effect, we require
-that in each ``iteration'' of the star, some non-empty substring need to
+that in each ``iteration'' of the star, some non-empty substring needs to
 be ``chipped'' away; only in case of the empty string we accept @{term
 "Stars []"} as the POSIX value.
 We can prove that given a string @{term s} and regular expression @{term
 r}, the POSIX value @{term v} is uniquely determined by @{term "s \<in> r \<rightarrow>
 \noindent From the latter we can infer by Prop.~\ref{derprop}(2):
 \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> (c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
 \noindent We can use the induction hypothesis for @{text "r\<^sub>1"} to obtain
-@{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}. This allows us to infer
+@{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}. This all put together allows us to infer
 @{term "((c # s\<^sub>1) @ s\<^sub>2) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (injval r\<^sub>1 c v\<^sub>1) v\<^sub>2"}. The case @{text "(c)"}
 is similarly.
 For @{text "(b)"} we know @{term "s \<in> der c r\<^sub>2 \<rightarrow> v\<^sub>1"} and
 @{term "s\<^sub>1 @ s\<^sub>2 \<notin> L (SEQ (der c r\<^sub>1) r\<^sub>2)"}. From the former
 \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = c # s \<and> s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
 \noindent By Lem.~\ref{lemmkeps} we know @{term "[] \<in> r\<^sub>1 \<rightarrow> (mkeps r\<^sub>1)"}
 holds. Putting this all together, we can conclude with @{term "(c #
-s) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}.
+s) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}, as required.
 Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the
 sequence case, except that we need to ensure that @{term "flat (injval r\<^sub>1
 c v\<^sub>1) \<noteq> []"}. This follows from @{term "(c # s\<^sub>1)
-\<in> r' \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}  (which in turn follows from @{term "s\<^sub>1 \<in> der c
+\<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}  (which in turn follows from @{term "s\<^sub>1 \<in> der c
 r\<^sub>1 \<rightarrow> v\<^sub>1"} and the induction hypothesis).\qed
 \end{proof}
 \noindent
 With Lem.~\ref{PMatch2} in place, it is completely routine to establish
 \begin{proof}
 By induction on @{term s} using Lem.~\ref{lemmkeps} and \ref{PMatch2}.\qed
 \end{proof}
 \noindent This concludes our correctness proof. Note that we have not
-changed the algorithm by Sulzmann and Lu, but introduced our own
+changed the algorithm by Sulzmann and Lu,\footnote{Any deviation we introduced is harmless.}
+but introduced our own
 specification for what a correct result---a POSIX value---should be.
 A strong point in favour of Sulzmann and Lu's algorithm is that it
 can be extended in various ways.
 *}
 section {* Extensions and Optimisations*}
 text {*
-If we are interested in tokenising string, then we need to not just
+If we are interested in tokenising strings, then we need to not just
 split up the string into tokens, but also ``classify'' the tokens (for
 example whether it is a keyword or an identifier). This can be
-done with only minor modifications by introducing \emph{record regular
+done with only minor modifications to the algorithm by introducing \emph{record regular
 expressions} and \emph{record values} (for example \cite{Sulzmann2014b}):
 \begin{center}
 @{text "r :="}
 @{text "..."} $\mid$
 @{text "(l : v)"}
 \end{center}
 \noindent where @{text l} is a label, say a string, @{text r} a regular
 expression and @{text v} a value. All functions can be smoothly extended
-to these regular expressions and values. For example @{text "(l : r)"} is
+to these regular expressions and values. For example \mbox{@{text "(l : r)"}} is
 nullable iff @{term r} is, and so on. The purpose of the record regular
 expression is to mark certain parts of a regular expression and then
 record in the calculated value which parts of the string were matched by
-this part. The label can then serve for classifying tokens. Recall the
+this part. The label can then serve as classification of the tokens. For this recall the
 regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for keywords and
 identifiers from the Introduction. With record regular expression we can
-form @{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"} and then traverse the
+form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}} and then traverse the
 calculated value and only collect the underlying strings in record values.
 With this we obtain finite sequences of pairs of labels and strings, for
 example
 \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
 Derivatives as calculated by \Brz's method are usually more complex
 regular expressions than the initial one; the result is that the matching
 and lexing algorithms are often abysmally slow. However, various
 optimisations are possible, such as the simplifications of @{term "ALT
 ZERO r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r
-ONE"} to @{term r}. One of the advantages of having a simple specification
+ONE"} to @{term r}. These simplifications can speed up the algorithms considerably,
-and correctness proof is that the latter can be refined to allow for such
+as noted in \cite{Sulzmann2014}.
-optimisations and simple correctness proof.
+One of the advantages of having a simple specification
+and correctness proof is that the latter can be refined to prove the
+correctness of such simplification steps.
 While the simplification of regular expressions according to
 rules like
 \begin{center}
 \end{center}
 \noindent is well understood, there is an obstacle with the POSIX value
 calculation algorithm by Sulzmann and Lu: if we build a derivative regular
 expression and then simplify it, we will calculate a POSIX value for this
-simplified regular expression, \emph{not} for the original (unsimplified)
+simplified derivative regular expression, \emph{not} for the original (unsimplified)
 derivative regular expression. Sulzmann and Lu overcome this obstacle by
 not just calculating a simplified regular expression, but also calculating
 a \emph{rectification function} that ``repairs'' the incorrect value.
 The rectification functions can be (slightly clumsily) implemented  in
 by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains
 unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors
 already acknowledge some small problems, but our experience suggests
 there are more serious problems.
-Closesly related to our work is an automata-based lexer formalised by
+Closely related to our work is an automata-based lexer formalised by
 Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
 initial substrings, but Nipkow's algorithm is not completely
 computational. The algorithm by Sulzmann and Lu, in contrast, can be
 implemented easily in functional languages. A bespoke lexer for the
 Imp-Language is formalised in Coq as part of the Software Foundations book

changeset 137	4178b7e71809
parent 136	fa0d8aa5d7de
child 138	a87b8a09ffe8