--- a/thys/Paper/Paper.thy Tue Mar 08 11:09:12 2016 +0000
+++ b/thys/Paper/Paper.thy Tue Mar 08 11:24:56 2016 +0000
@@ -554,7 +554,7 @@
Having defined the @{const mkeps} and @{text inj} function we can extend
\Brz's matcher so that a [lexical] value is constructed (assuming the
- regular expression matches the string). The clauses of the lexer are
+ regular expression matches the string). The clauses of the Sulzmann and Lu lexer are
\begin{center}
\begin{tabular}{lcl}
@@ -638,7 +638,7 @@
"P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial
split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value
@{term v} cannot be flattened to the empty string. In effect, we require
- that in each ``iteration'' of the star, some non-empty substring need to
+ that in each ``iteration'' of the star, some non-empty substring needs to
be ``chipped'' away; only in case of the empty string we accept @{term
"Stars []"} as the POSIX value.
@@ -709,7 +709,7 @@
\[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> (c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
\noindent We can use the induction hypothesis for @{text "r\<^sub>1"} to obtain
- @{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}. This allows us to infer
+ @{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}. This all put together allows us to infer
@{term "((c # s\<^sub>1) @ s\<^sub>2) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (injval r\<^sub>1 c v\<^sub>1) v\<^sub>2"}. The case @{text "(c)"}
is similarly.
@@ -722,12 +722,12 @@
\noindent By Lem.~\ref{lemmkeps} we know @{term "[] \<in> r\<^sub>1 \<rightarrow> (mkeps r\<^sub>1)"}
holds. Putting this all together, we can conclude with @{term "(c #
- s) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}.
+ s) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}, as required.
Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the
sequence case, except that we need to ensure that @{term "flat (injval r\<^sub>1
c v\<^sub>1) \<noteq> []"}. This follows from @{term "(c # s\<^sub>1)
- \<in> r' \<rightarrow> injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \<in> der c
+ \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \<in> der c
r\<^sub>1 \<rightarrow> v\<^sub>1"} and the induction hypothesis).\qed
\end{proof}
@@ -749,7 +749,8 @@
\end{proof}
\noindent This concludes our correctness proof. Note that we have not
- changed the algorithm by Sulzmann and Lu, but introduced our own
+ changed the algorithm by Sulzmann and Lu,\footnote{Any deviation we introduced is harmless.}
+ but introduced our own
specification for what a correct result---a POSIX value---should be.
A strong point in favour of Sulzmann and Lu's algorithm is that it
can be extended in various ways.
@@ -760,10 +761,10 @@
text {*
- If we are interested in tokenising string, then we need to not just
+ If we are interested in tokenising strings, then we need to not just
split up the string into tokens, but also ``classify'' the tokens (for
example whether it is a keyword or an identifier). This can be
- done with only minor modifications by introducing \emph{record regular
+ done with only minor modifications to the algorithm by introducing \emph{record regular
expressions} and \emph{record values} (for example \cite{Sulzmann2014b}):
\begin{center}
@@ -777,14 +778,14 @@
\noindent where @{text l} is a label, say a string, @{text r} a regular
expression and @{text v} a value. All functions can be smoothly extended
- to these regular expressions and values. For example @{text "(l : r)"} is
+ to these regular expressions and values. For example \mbox{@{text "(l : r)"}} is
nullable iff @{term r} is, and so on. The purpose of the record regular
expression is to mark certain parts of a regular expression and then
record in the calculated value which parts of the string were matched by
- this part. The label can then serve for classifying tokens. Recall the
+ this part. The label can then serve as classification of the tokens. For this recall the
regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for keywords and
identifiers from the Introduction. With record regular expression we can
- form @{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"} and then traverse the
+ form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}} and then traverse the
calculated value and only collect the underlying strings in record values.
With this we obtain finite sequences of pairs of labels and strings, for
example
@@ -799,9 +800,11 @@
and lexing algorithms are often abysmally slow. However, various
optimisations are possible, such as the simplifications of @{term "ALT
ZERO r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r
- ONE"} to @{term r}. One of the advantages of having a simple specification
- and correctness proof is that the latter can be refined to allow for such
- optimisations and simple correctness proof.
+ ONE"} to @{term r}. These simplifications can speed up the algorithms considerably,
+ as noted in \cite{Sulzmann2014}.
+ One of the advantages of having a simple specification
+ and correctness proof is that the latter can be refined to prove the
+ correctness of such simplification steps.
While the simplification of regular expressions according to
rules like
@@ -818,7 +821,7 @@
\noindent is well understood, there is an obstacle with the POSIX value
calculation algorithm by Sulzmann and Lu: if we build a derivative regular
expression and then simplify it, we will calculate a POSIX value for this
- simplified regular expression, \emph{not} for the original (unsimplified)
+ simplified derivative regular expression, \emph{not} for the original (unsimplified)
derivative regular expression. Sulzmann and Lu overcome this obstacle by
not just calculating a simplified regular expression, but also calculating
a \emph{rectification function} that ``repairs'' the incorrect value.
@@ -1124,7 +1127,7 @@
already acknowledge some small problems, but our experience suggests
there are more serious problems.
- Closesly related to our work is an automata-based lexer formalised by
+ Closely related to our work is an automata-based lexer formalised by
Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
initial substrings, but Nipkow's algorithm is not completely
computational. The algorithm by Sulzmann and Lu, in contrast, can be