lexing: changeset 126:e866678c29cb

--- a/thys/Paper/Paper.thy	Tue Mar 08 00:45:09 2016 +0000
+++ b/thys/Paper/Paper.thy	Tue Mar 08 05:03:47 2016 +0000
@@ -155,32 +155,29 @@
 the priority rule a keyword token, not an identifier token---even if @{text
 "r\<^bsub>id\<^esub>"} matches also.\bigskip
 
-\noindent {\bf Contributions:} (NOT DONE YET) We have implemented in
-Isabelle/HOL the derivative-based regular expression matching algorithm as
-described by Sulzmann and Lu \cite{Sulzmann2014}. We have proved the
-correctness of this algorithm according to our specification of what a POSIX
-value is. Sulzmann and Lu sketch in \cite{Sulzmann2014} an informal
-correctness proof: but to us it contains unfillable gaps.
-
-informal correctness proof given in \cite{Sulzmann2014} is in final
-form\footnote{} and to us contains unfillable gaps.
-
+\noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
+derivative-based regular expression matching algorithm as described by
+Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
+algorithm according to our specification of what a POSIX value is. Sulzmann
+and Lu sketch in \cite{Sulzmann2014} an informal correctness proof: but to
+us it contains unfillable gaps.\footnote{An extended version of
+\cite{Sulzmann2014} is available at the website of its first author; this
+extended version already includes remarks in the appendix that their
+informal proof contains gaps, and possible fixes are not fully worked out.}
 Our specification of a POSIX value consists of a simple inductive definition
 that given a string and a regular expression uniquely determines this value.
 Derivatives as calculated by Brzozowski's method are usually more complex
 regular expressions than the initial one; various optimisations are
-possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r
-ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the
-advantages of having a simple specification and correctness proof is that
-the latter can be refined to allow for such optimisations and simple
-correctness proof.
+possible. We prove the correctness when simplifications of @{term "ALT ZERO
+r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
+@{term r} are applied.
 
-An extended version of \cite{Sulzmann2014} is available at the website of
-its first author; this includes some ``proofs'', claimed in
-\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
-final form, we make no comment thereon, preferring to give general reasons
-for our belief that the approach of \cite{Sulzmann2014} is problematic
-rather than to discuss details of unpublished work.
+%An extended version of \cite{Sulzmann2014} is available at the website of
+%its first author; this includes some ``proofs'', claimed in
+%\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
+%final form, we make no comment thereon, preferring to give general reasons
+%for our belief that the approach of \cite{Sulzmann2014} is problematic
+%rather than to discuss details of unpublished work.
 
 *}
 
@@ -271,6 +268,11 @@
   @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
   @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
   @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
+  \end{tabular}
+  \end{center}
+
+  \begin{center}
+  \begin{tabular}{lcl}
   @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
   @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
   @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
@@ -526,10 +528,13 @@
   injval.simps(1)[of "c" "c"]} $\dn$ @{thm (rhs) injval.simps(1)[of "c"]},
   but our deviation is harmless.}
 
-  The idea of @{term inj} to ``inject back'' a character into a value can
-  be made precise by the first part of the following lemma; the second
-  part shows that the underlying string of an @{const mkeps}-value is
-  always the empty string.
+  The idea of the @{term inj}-function to ``inject'' a character, say
+  @{term c}, into a value can be made precise by the first part of the
+  following lemma, which shows that the underlying string of an injected
+  value has a prepend character @{term c}; the second part shows that the
+  underlying string of an @{const mkeps}-value is always the empty string
+  (given the regular expression is nullable since otherwise @{text mkeps}
+  might not be defined).
 
   \begin{lemma}\mbox{}\smallskip\\\label{Prf_injval_flat}
   \begin{tabular}{ll}
@@ -539,9 +544,9 @@
   \end{lemma}
 
   \begin{proof}
-  Both properties are by routine inductions: the first one, for example,
-  by an induction over the definition of @{term derivatives}; the second by
-  induction on @{term r}. There are no interesting cases.\qed
+  Both properties are by routine inductions: the first one can, for example,
+  be proved by an induction over the definition of @{term derivatives}; the second by
+  an induction on @{term r}. There are no interesting cases.\qed
   \end{proof}
 
   Having defined the @{const mkeps} and @{text inj} function we can extend
@@ -558,7 +563,7 @@
   \end{center}
 
   \noindent If the regular expression does not match the string, @{const None} is
-  returned, indicating an error is raised. If the regular expression does
+  returned, indicating an error is raised. If the regular expression \emph{does}
   match the string, then @{const Some} value is returned. One important
   virtue of this algorithm is that it can be implemented with ease in a
   functional programming language and also in Isabelle/HOL. In the remaining
@@ -603,46 +608,56 @@
 
   \noindent We claim that this relation captures the idea behind the two
   informal POSIX rules shown in the Introduction: Consider for example the
-  rules @{text "P+L"} and @{text "P+R"} where the POSIX value for an
-  alternative regular expression is specified---it is always a @{text
-  "Left"}-value, \emph{except} when the string to be matched is not in the
-  language of @{term "r\<^sub>1"}; only then it is a @{text Right}-value (see the
-  side-condition in @{text "P+R"}). Interesting is also the rule for
-  sequence regular expressions (@{text "PS"}). The first two premises state
-  that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} are the POSIX values for
-  @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
+  rules @{text "P+L"} and @{text "P+R"} where the POSIX value for a string
+  and an alternative regular expression, that is @{term "(s, ALT r\<^sub>1 r\<^sub>2)"},
+  is specified---it is always a @{text "Left"}-value, \emph{except} when the
+  string to be matched is not in the language of @{term "r\<^sub>1"}; only then it
+  is a @{text Right}-value (see the side-condition in @{text "P+R"}).
+  Interesting is also the rule for sequence regular expressions (@{text
+  "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"}
+  are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
   respectively. Consider now the third premise and note that the POSIX value
   of this rule should match the string @{term "s\<^sub>1 @ s\<^sub>2"}. According to the
   longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial
   split of @{term "s\<^sub>1 @ s\<^sub>2"} such that @{term "s\<^sub>2"} is still recognised
   by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there
-  \emph{exists} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
-  can be split up into a non-empty @{term "s\<^sub>3"} and @{term "s\<^sub>4"}. Moreover
-  the longer @{term "s\<^sub>1 @ s\<^sub>3"} can be matched by @{text "r\<^sub>1"} and the
-  shorter @{term "s\<^sub>4"} can still be matched by @{term "r\<^sub>2"}. In this case
-  @{term "s\<^sub>1"} would not be the longest initial split of @{term "s\<^sub>1 @
-  s\<^sub>2"} and therefore @{term "Seq v\<^sub>1 v\<^sub>2"} cannot be a POSIX value
-  for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. A similar condition is imposed
-  onto the POSIX value in the @{text "P\<star>"}-rule. Also there we want that
-   @{term "s\<^sub>1"} is the longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and
-   furthermore the corresponding value @{term v} cannot be flatten to
-   the empty string. In effect, we require that in each ``iteration''
-   of the star, some parts of the string need to be ``nibbled'' away; only
-   in case of the empty string weBy accept @{term "Stars []"} as the 
-   POSIX value.
+  \emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
+  can be split up into a non-empty string @{term "s\<^sub>3"} and possibly empty
+  string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be
+  matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be
+  matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would not be the
+  longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and therefore @{term "Seq v\<^sub>1
+  v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. 
+  The main point is that this side-condition ensures the longest 
+  match rule is satisfied.
+
+  A similar condition is imposed on the POSIX value in the @{text
+  "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial
+  split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value
+  @{term v} cannot be flatten to the empty string. In effect, we require
+  that in each ``iteration'' of the star, some non-empty substring need to
+  be ``chipped'' away; only in case of the empty string we accept @{term
+  "Stars []"} as the POSIX value.
 
    We can prove that given a string @{term s} and regular expression @{term
    r}, the POSIX value @{term v} is uniquely determined by @{term "s \<in> r \<rightarrow>
-   v"}.
+   v"} (albeilt in an uncomputable fashion---for example rule @{term "P+R"}
+   would require the calculation of the potentially infinite set @{term "L
+   r\<^sub>1"}).
 
   \begin{theorem}
   @{thm[mode=IfThen] PMatch_determ(1)[of _ _ "v\<^sub>1" "v\<^sub>2"]}
   \end{theorem}
 
-  \begin{proof}
-  By induction on the definition of @{term "s \<in> r \<rightarrow> v\<^sub>1"} and a case
-  analysis of @{term "s \<in> r \<rightarrow> v\<^sub>2"}.\qed
-  \end{proof}
+  \begin{proof} By induction on the definition of @{term "s \<in> r \<rightarrow> v\<^sub>1"} and
+  a case analysis of @{term "s \<in> r \<rightarrow> v\<^sub>2"}. This proof requires the
+  auxiliary lemma that @{thm (prem 1) PMatch1(1)} implies @{thm (concl)
+  PMatch1(1)} and @{thm (concl) PMatch1(2)}, which are both easily
+  established by inductions.\qed \end{proof}
+
+  \noindent
+  Next is the lemma that shows the function @{term "mkeps"} calculates
+  the posix value for the empty string and a nullable regular expression.
 
   \begin{lemma}\label{lemmkeps}
   @{thm[mode=IfThen] PMatch_mkeps}
@@ -710,14 +725,14 @@
 
   Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the
   sequence case, except that we need to ensure that @{term "flat (injval r\<^sub>1
-  c v\<^sub>1) \<noteq> []"}. This follows by Lem.~\ref{posixbasic} from @{term "(c # s\<^sub>1)
-  \<in> r' \<rightarrow> injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \<in> der c
+  c v\<^sub>1) \<noteq> []"}. This follows from @{term "(c # s\<^sub>1)
+  \<in> r' \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}  (which in turn follows from @{term "s\<^sub>1 \<in> der c
   r\<^sub>1 \<rightarrow> v\<^sub>1"} and the induction hypothesis).\qed
   \end{proof}
 
   \noindent
   With Lem.~\ref{PMatch2} in place, it is completely routine to establish
-  that the Sulzmann and Lu lexer satisfies its specification (returning
+  that the Sulzmann and Lu lexer satisfies our specification (returning
   an ``error'' iff the string is not in the language of the regular expression,
   and returning a unique POSIX value iff the string \emph{is} in the language):
 
@@ -729,30 +744,77 @@
   \end{theorem}
 
   \begin{proof}
-  By induction on @{term s}.\qed  
+  By induction on @{term s} using Lem.~\ref{lemmkeps} and \ref{PMatch2}.\qed  
   \end{proof}
 
-  This concludes our correctness proof. Note that we have not changed the
-  algorithm by Sulzmann and Lu, but introduced our own specification for
-  what a correct result---a POSIX value---should be.
+  \noindent This concludes our correctness proof. Note that we have not
+  changed the algorithm by Sulzmann and Lu, but introduced our own
+  specification for what a correct result---a POSIX value---should be.
+  A strong point in favour of Sulzmann and Lu's algorithm is that it
+  can be extended in various ways.
+
 *}
 
-section {* Extensions *}
+section {* Extensions and Optimisations*}
 
 text {*
 
+  If we are interested in tokenising string, then we need to not just
+  split up the string into tokens, but also ``classify'' the tokens (for
+  example whether it is a keyword or an identifier). This can be
+  done with only minor modifications by introducing \emph{record regular
+  expressions} and \emph{record values} (for example \cite{Sulzmann2014b}):
+
+  \begin{center}  
+  @{text "r :="}
+  @{text "..."} $\mid$
+  @{text "(l : r)"} \qquad\qquad
+  @{text "v :="}
+  @{text "..."} $\mid$
+  @{text "(l : v)"}
+  \end{center}
+  
+  \noindent where @{text l} is a label, say a string, @{text r} a regular
+  expression and @{text v} a value. All functions can be smoothly extended
+  to these regular expressions and values. For example @{text "(l : r)"} is
+  nullable iff @{term r} is, and so on. The purpose of the record regular
+  expression is to mark certain parts of a regular expression and then
+  record in the calculated value which parts of the string were matched by
+  this part. The label can then serve for classifying tokens. Recall the
+  regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for keywords and
+  identifiers from the Introduction. With record regular expression we can
+  form @{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"} and then traverse the
+  calculated value and only collect the underlying strings in record values.
+  With this we obtain finite sequences of pairs of labels and strings, for 
+  example
+
+  \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
+  
+  \noindent from which tokens with classifications (keyword-token,
+  identifier-token and so on) can be extracted.
+
   Derivatives as calculated by \Brz's method are usually more complex
   regular expressions than the initial one; the result is that the matching
-  and lexing algorithms are often absymally slow. 
+  and lexing algorithms are often abysmally slow. However, various
+  optimisations are possible, such as the simplifications of @{term "ALT
+  ZERO r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r
+  ONE"} to @{term r}. One of the advantages of having a simple specification
+  and correctness proof is that the latter can be refined to allow for such
+  optimisations and simple correctness proof.
+
+  While the simplification of regular expressions according to 
+  simplification rules 
 
 
-  various optimisations are
-  possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT
-  r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of
-  the advantages of having a simple specification and correctness proof is
-  that the latter can be refined to allow for such optimisations and simple
-  correctness proof.
-
+  \noindent
+  is well understood, there is a problem with the POSIX
+  value calculation algorithm by Sulzmann and Lu: if we build a derivative
+  regular expression and then simplify it, we will calculate a POSIX value
+  for this simplified regular expression, \emph{not} for the original
+  (unsimplified) derivative regular expression. Sulzmann and Lu overcome
+  this problem by not just calculating a simplified regular expression, but
+  also calculating a \emph{rectification function} that ``repairs'' the
+  incorrect value.
 
 *}

--- a/thys/Paper/document/root.bib	Tue Mar 08 00:45:09 2016 +0000
+++ b/thys/Paper/document/root.bib	Tue Mar 08 05:03:47 2016 +0000
@@ -9,6 +9,17 @@
   series =    {LNCS}
 }
 
+@inproceedings{Sulzmann2014b,
+  author    = {M.~Sulzmann and P.~van Steenhoven},
+  title     = {{A} {F}lexible and {E}fficient {ML} {L}exer {T}ool {B}ased on {E}xtended {R}egular
+               {E}xpression {S}ubmatching},
+  booktitle = {Proc.~of the 23rd International Conference on Compiler Construction (CC)},
+  pages     = {174--191},
+  year      = {2014},
+  volume    = {8409},
+  series =    {LNCS}
+}
+
 @Misc{Kuklewicz,
   author = 	 {C.~Kuklewicz},
   title = 	 {{R}egex {P}osix},

--- a/thys/ReStar.thy	Tue Mar 08 00:45:09 2016 +0000
+++ b/thys/ReStar.thy	Tue Mar 08 05:03:47 2016 +0000
@@ -797,6 +797,51 @@
 apply(simp)
 done
 
+fun F_RIGHT where
+  "F_RIGHT f v = Right (f v)"
 
+fun F_LEFT where
+  "F_LEFT f v = Left (f v)"
+
+fun F_ALT where
+  "F_ALT f1 f2 (Right v) = Right (f2 v)"
+| "F_ALT f1 f2 (Left v) = Left (f1 v)" 
+
+fun F_SEQ1 where
+  "F_SEQ1 f1 f2 v = Seq (f1 Void) (f2 v)"
+
+fun F_SEQ2 where 
+  "F_SEQ2 f1 f2 v = Seq (f1 v) (f2 Void)"
+
+fun F_SEQ where 
+  "F_SEQ f1 f2 (Seq v1 v2) = Seq (f1 v1) (f2 v2)"
+
+fun simp_ALT where
+  "simp_ALT (ZERO, _) (r2, f2) = (r2, F_RIGHT f2)"
+| "simp_ALT (r1, f1) (ZERO, _) = (r1, F_LEFT f1)"
+| "simp_ALT (r1, f1) (r2, f2) = (ALT r1 r2, F_ALT f1 f2)"
+
+fun simp_SEQ where
+  "simp_SEQ (ONE, f1) (r2, f2) = (r2, F_SEQ1 f1 f2)"
+| "simp_SEQ (r1, f1) (ONE, f2) = (r1, F_SEQ2 f1 f2)"
+| "simp_SEQ (ZERO, f1) (r2, f2) = (ZERO, f1)"
+| "simp_SEQ (r1, f1) (ZERO, f2) = (ZERO, f2)"
+| "simp_SEQ (r1, f1) (r2, f2) = (SEQ r1 r2, F_SEQ f1 f2)"  
+ 
+fun 
+  simp :: "rexp \<Rightarrow> rexp * (val \<Rightarrow> val)"
+where
+  "simp (ALT r1 r2) = simp_ALT (simp r1) (simp r2)" 
+| "simp (SEQ r1 r2) = simp_SEQ (simp r1) (simp r2)" 
+| "simp r = (r, id)"
+
+fun 
+  matcher3 :: "rexp \<Rightarrow> string \<Rightarrow> val option"
+where
+  "matcher3 r [] = (if nullable r then Some(mkeps r) else None)"
+| "matcher3 r (c#s) = (let (rs, fr) = simp (der c r) in
+                         (case (matcher3 rs s) of  
+                            None \<Rightarrow> None
+                          | Some(v) \<Rightarrow> Some(injval r c (fr v))))"
 
 end
\ No newline at end of file

Binary file thys/paper.pdf has changed

author	Christian Urban <christian dot urban at kcl dot ac dot uk>
	Tue, 08 Mar 2016 05:03:47 +0000
changeset 126	e866678c29cb
parent 125	ff0844860981
child 127	b208bc047eed

thys/Paper/Paper.thy		file \| annotate \| diff \| comparison \| revisions
thys/Paper/document/root.bib		file \| annotate \| diff \| comparison \| revisions
thys/ReStar.thy		file \| annotate \| diff \| comparison \| revisions
thys/paper.pdf		file \| annotate \| diff \| comparison \| revisions