thys/Paper/Paper.thy
changeset 137 4178b7e71809
parent 136 fa0d8aa5d7de
child 138 a87b8a09ffe8
--- a/thys/Paper/Paper.thy	Tue Mar 08 11:09:12 2016 +0000
+++ b/thys/Paper/Paper.thy	Tue Mar 08 11:24:56 2016 +0000
@@ -554,7 +554,7 @@
 
   Having defined the @{const mkeps} and @{text inj} function we can extend
   \Brz's matcher so that a [lexical] value is constructed (assuming the
-  regular expression matches the string). The clauses of the lexer are
+  regular expression matches the string). The clauses of the Sulzmann and Lu lexer are
 
   \begin{center}
   \begin{tabular}{lcl}
@@ -638,7 +638,7 @@
   "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial
   split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value
   @{term v} cannot be flattened to the empty string. In effect, we require
-  that in each ``iteration'' of the star, some non-empty substring need to
+  that in each ``iteration'' of the star, some non-empty substring needs to
   be ``chipped'' away; only in case of the empty string we accept @{term
   "Stars []"} as the POSIX value.
 
@@ -709,7 +709,7 @@
   \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> (c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
 
   \noindent We can use the induction hypothesis for @{text "r\<^sub>1"} to obtain
-  @{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}. This allows us to infer
+  @{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}. This all put together allows us to infer
   @{term "((c # s\<^sub>1) @ s\<^sub>2) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (injval r\<^sub>1 c v\<^sub>1) v\<^sub>2"}. The case @{text "(c)"}
   is similarly.
 
@@ -722,12 +722,12 @@
 
   \noindent By Lem.~\ref{lemmkeps} we know @{term "[] \<in> r\<^sub>1 \<rightarrow> (mkeps r\<^sub>1)"}
   holds. Putting this all together, we can conclude with @{term "(c #
-  s) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}.
+  s) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}, as required.
 
   Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the
   sequence case, except that we need to ensure that @{term "flat (injval r\<^sub>1
   c v\<^sub>1) \<noteq> []"}. This follows from @{term "(c # s\<^sub>1)
-  \<in> r' \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}  (which in turn follows from @{term "s\<^sub>1 \<in> der c
+  \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}  (which in turn follows from @{term "s\<^sub>1 \<in> der c
   r\<^sub>1 \<rightarrow> v\<^sub>1"} and the induction hypothesis).\qed
   \end{proof}
 
@@ -749,7 +749,8 @@
   \end{proof}
 
   \noindent This concludes our correctness proof. Note that we have not
-  changed the algorithm by Sulzmann and Lu, but introduced our own
+  changed the algorithm by Sulzmann and Lu,\footnote{Any deviation we introduced is harmless.} 
+  but introduced our own
   specification for what a correct result---a POSIX value---should be.
   A strong point in favour of Sulzmann and Lu's algorithm is that it
   can be extended in various ways.
@@ -760,10 +761,10 @@
 
 text {*
 
-  If we are interested in tokenising string, then we need to not just
+  If we are interested in tokenising strings, then we need to not just
   split up the string into tokens, but also ``classify'' the tokens (for
   example whether it is a keyword or an identifier). This can be
-  done with only minor modifications by introducing \emph{record regular
+  done with only minor modifications to the algorithm by introducing \emph{record regular
   expressions} and \emph{record values} (for example \cite{Sulzmann2014b}):
 
   \begin{center}  
@@ -777,14 +778,14 @@
   
   \noindent where @{text l} is a label, say a string, @{text r} a regular
   expression and @{text v} a value. All functions can be smoothly extended
-  to these regular expressions and values. For example @{text "(l : r)"} is
+  to these regular expressions and values. For example \mbox{@{text "(l : r)"}} is
   nullable iff @{term r} is, and so on. The purpose of the record regular
   expression is to mark certain parts of a regular expression and then
   record in the calculated value which parts of the string were matched by
-  this part. The label can then serve for classifying tokens. Recall the
+  this part. The label can then serve as classification of the tokens. For this recall the
   regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for keywords and
   identifiers from the Introduction. With record regular expression we can
-  form @{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"} and then traverse the
+  form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}} and then traverse the
   calculated value and only collect the underlying strings in record values.
   With this we obtain finite sequences of pairs of labels and strings, for 
   example
@@ -799,9 +800,11 @@
   and lexing algorithms are often abysmally slow. However, various
   optimisations are possible, such as the simplifications of @{term "ALT
   ZERO r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r
-  ONE"} to @{term r}. One of the advantages of having a simple specification
-  and correctness proof is that the latter can be refined to allow for such
-  optimisations and simple correctness proof.
+  ONE"} to @{term r}. These simplifications can speed up the algorithms considerably,
+  as noted in \cite{Sulzmann2014}.
+  One of the advantages of having a simple specification
+  and correctness proof is that the latter can be refined to prove the 
+  correctness of such simplification steps.
 
   While the simplification of regular expressions according to 
   rules like
@@ -818,7 +821,7 @@
   \noindent is well understood, there is an obstacle with the POSIX value
   calculation algorithm by Sulzmann and Lu: if we build a derivative regular
   expression and then simplify it, we will calculate a POSIX value for this
-  simplified regular expression, \emph{not} for the original (unsimplified)
+  simplified derivative regular expression, \emph{not} for the original (unsimplified)
   derivative regular expression. Sulzmann and Lu overcome this obstacle by
   not just calculating a simplified regular expression, but also calculating
   a \emph{rectification function} that ``repairs'' the incorrect value.
@@ -1124,7 +1127,7 @@
   already acknowledge some small problems, but our experience suggests
   there are more serious problems. 
   
-  Closesly related to our work is an automata-based lexer formalised by
+  Closely related to our work is an automata-based lexer formalised by
   Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest
   initial substrings, but Nipkow's algorithm is not completely
   computational. The algorithm by Sulzmann and Lu, in contrast, can be