--- a/thys/Paper/Paper.thy Tue Mar 08 00:45:09 2016 +0000
+++ b/thys/Paper/Paper.thy Tue Mar 08 05:03:47 2016 +0000
@@ -155,32 +155,29 @@
the priority rule a keyword token, not an identifier token---even if @{text
"r\<^bsub>id\<^esub>"} matches also.\bigskip
-\noindent {\bf Contributions:} (NOT DONE YET) We have implemented in
-Isabelle/HOL the derivative-based regular expression matching algorithm as
-described by Sulzmann and Lu \cite{Sulzmann2014}. We have proved the
-correctness of this algorithm according to our specification of what a POSIX
-value is. Sulzmann and Lu sketch in \cite{Sulzmann2014} an informal
-correctness proof: but to us it contains unfillable gaps.
-
-informal correctness proof given in \cite{Sulzmann2014} is in final
-form\footnote{} and to us contains unfillable gaps.
-
+\noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
+derivative-based regular expression matching algorithm as described by
+Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
+algorithm according to our specification of what a POSIX value is. Sulzmann
+and Lu sketch in \cite{Sulzmann2014} an informal correctness proof: but to
+us it contains unfillable gaps.\footnote{An extended version of
+\cite{Sulzmann2014} is available at the website of its first author; this
+extended version already includes remarks in the appendix that their
+informal proof contains gaps, and possible fixes are not fully worked out.}
Our specification of a POSIX value consists of a simple inductive definition
that given a string and a regular expression uniquely determines this value.
Derivatives as calculated by Brzozowski's method are usually more complex
regular expressions than the initial one; various optimisations are
-possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r
-ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the
-advantages of having a simple specification and correctness proof is that
-the latter can be refined to allow for such optimisations and simple
-correctness proof.
+possible. We prove the correctness when simplifications of @{term "ALT ZERO
+r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
+@{term r} are applied.
-An extended version of \cite{Sulzmann2014} is available at the website of
-its first author; this includes some ``proofs'', claimed in
-\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
-final form, we make no comment thereon, preferring to give general reasons
-for our belief that the approach of \cite{Sulzmann2014} is problematic
-rather than to discuss details of unpublished work.
+%An extended version of \cite{Sulzmann2014} is available at the website of
+%its first author; this includes some ``proofs'', claimed in
+%\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
+%final form, we make no comment thereon, preferring to give general reasons
+%for our belief that the approach of \cite{Sulzmann2014} is problematic
+%rather than to discuss details of unpublished work.
*}
@@ -271,6 +268,11 @@
@{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
+ \end{tabular}
+ \end{center}
+
+ \begin{center}
+ \begin{tabular}{lcl}
@{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
@{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
@{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
@@ -526,10 +528,13 @@
injval.simps(1)[of "c" "c"]} $\dn$ @{thm (rhs) injval.simps(1)[of "c"]},
but our deviation is harmless.}
- The idea of @{term inj} to ``inject back'' a character into a value can
- be made precise by the first part of the following lemma; the second
- part shows that the underlying string of an @{const mkeps}-value is
- always the empty string.
+ The idea of the @{term inj}-function to ``inject'' a character, say
+ @{term c}, into a value can be made precise by the first part of the
+ following lemma, which shows that the underlying string of an injected
+ value has a prepend character @{term c}; the second part shows that the
+ underlying string of an @{const mkeps}-value is always the empty string
+ (given the regular expression is nullable since otherwise @{text mkeps}
+ might not be defined).
\begin{lemma}\mbox{}\smallskip\\\label{Prf_injval_flat}
\begin{tabular}{ll}
@@ -539,9 +544,9 @@
\end{lemma}
\begin{proof}
- Both properties are by routine inductions: the first one, for example,
- by an induction over the definition of @{term derivatives}; the second by
- induction on @{term r}. There are no interesting cases.\qed
+ Both properties are by routine inductions: the first one can, for example,
+ be proved by an induction over the definition of @{term derivatives}; the second by
+ an induction on @{term r}. There are no interesting cases.\qed
\end{proof}
Having defined the @{const mkeps} and @{text inj} function we can extend
@@ -558,7 +563,7 @@
\end{center}
\noindent If the regular expression does not match the string, @{const None} is
- returned, indicating an error is raised. If the regular expression does
+ returned, indicating an error is raised. If the regular expression \emph{does}
match the string, then @{const Some} value is returned. One important
virtue of this algorithm is that it can be implemented with ease in a
functional programming language and also in Isabelle/HOL. In the remaining
@@ -603,46 +608,56 @@
\noindent We claim that this relation captures the idea behind the two
informal POSIX rules shown in the Introduction: Consider for example the
- rules @{text "P+L"} and @{text "P+R"} where the POSIX value for an
- alternative regular expression is specified---it is always a @{text
- "Left"}-value, \emph{except} when the string to be matched is not in the
- language of @{term "r\<^sub>1"}; only then it is a @{text Right}-value (see the
- side-condition in @{text "P+R"}). Interesting is also the rule for
- sequence regular expressions (@{text "PS"}). The first two premises state
- that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} are the POSIX values for
- @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
+ rules @{text "P+L"} and @{text "P+R"} where the POSIX value for a string
+ and an alternative regular expression, that is @{term "(s, ALT r\<^sub>1 r\<^sub>2)"},
+ is specified---it is always a @{text "Left"}-value, \emph{except} when the
+ string to be matched is not in the language of @{term "r\<^sub>1"}; only then it
+ is a @{text Right}-value (see the side-condition in @{text "P+R"}).
+ Interesting is also the rule for sequence regular expressions (@{text
+ "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"}
+ are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
respectively. Consider now the third premise and note that the POSIX value
of this rule should match the string @{term "s\<^sub>1 @ s\<^sub>2"}. According to the
longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial
split of @{term "s\<^sub>1 @ s\<^sub>2"} such that @{term "s\<^sub>2"} is still recognised
by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there
- \emph{exists} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
- can be split up into a non-empty @{term "s\<^sub>3"} and @{term "s\<^sub>4"}. Moreover
- the longer @{term "s\<^sub>1 @ s\<^sub>3"} can be matched by @{text "r\<^sub>1"} and the
- shorter @{term "s\<^sub>4"} can still be matched by @{term "r\<^sub>2"}. In this case
- @{term "s\<^sub>1"} would not be the longest initial split of @{term "s\<^sub>1 @
- s\<^sub>2"} and therefore @{term "Seq v\<^sub>1 v\<^sub>2"} cannot be a POSIX value
- for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. A similar condition is imposed
- onto the POSIX value in the @{text "P\<star>"}-rule. Also there we want that
- @{term "s\<^sub>1"} is the longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and
- furthermore the corresponding value @{term v} cannot be flatten to
- the empty string. In effect, we require that in each ``iteration''
- of the star, some parts of the string need to be ``nibbled'' away; only
- in case of the empty string weBy accept @{term "Stars []"} as the
- POSIX value.
+ \emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
+ can be split up into a non-empty string @{term "s\<^sub>3"} and possibly empty
+ string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be
+ matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be
+ matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would not be the
+ longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and therefore @{term "Seq v\<^sub>1
+ v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}.
+ The main point is that this side-condition ensures the longest
+ match rule is satisfied.
+
+ A similar condition is imposed on the POSIX value in the @{text
+ "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial
+ split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value
+ @{term v} cannot be flatten to the empty string. In effect, we require
+ that in each ``iteration'' of the star, some non-empty substring need to
+ be ``chipped'' away; only in case of the empty string we accept @{term
+ "Stars []"} as the POSIX value.
We can prove that given a string @{term s} and regular expression @{term
r}, the POSIX value @{term v} is uniquely determined by @{term "s \<in> r \<rightarrow>
- v"}.
+ v"} (albeilt in an uncomputable fashion---for example rule @{term "P+R"}
+ would require the calculation of the potentially infinite set @{term "L
+ r\<^sub>1"}).
\begin{theorem}
@{thm[mode=IfThen] PMatch_determ(1)[of _ _ "v\<^sub>1" "v\<^sub>2"]}
\end{theorem}
- \begin{proof}
- By induction on the definition of @{term "s \<in> r \<rightarrow> v\<^sub>1"} and a case
- analysis of @{term "s \<in> r \<rightarrow> v\<^sub>2"}.\qed
- \end{proof}
+ \begin{proof} By induction on the definition of @{term "s \<in> r \<rightarrow> v\<^sub>1"} and
+ a case analysis of @{term "s \<in> r \<rightarrow> v\<^sub>2"}. This proof requires the
+ auxiliary lemma that @{thm (prem 1) PMatch1(1)} implies @{thm (concl)
+ PMatch1(1)} and @{thm (concl) PMatch1(2)}, which are both easily
+ established by inductions.\qed \end{proof}
+
+ \noindent
+ Next is the lemma that shows the function @{term "mkeps"} calculates
+ the posix value for the empty string and a nullable regular expression.
\begin{lemma}\label{lemmkeps}
@{thm[mode=IfThen] PMatch_mkeps}
@@ -710,14 +725,14 @@
Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the
sequence case, except that we need to ensure that @{term "flat (injval r\<^sub>1
- c v\<^sub>1) \<noteq> []"}. This follows by Lem.~\ref{posixbasic} from @{term "(c # s\<^sub>1)
- \<in> r' \<rightarrow> injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \<in> der c
+ c v\<^sub>1) \<noteq> []"}. This follows from @{term "(c # s\<^sub>1)
+ \<in> r' \<rightarrow> injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \<in> der c
r\<^sub>1 \<rightarrow> v\<^sub>1"} and the induction hypothesis).\qed
\end{proof}
\noindent
With Lem.~\ref{PMatch2} in place, it is completely routine to establish
- that the Sulzmann and Lu lexer satisfies its specification (returning
+ that the Sulzmann and Lu lexer satisfies our specification (returning
an ``error'' iff the string is not in the language of the regular expression,
and returning a unique POSIX value iff the string \emph{is} in the language):
@@ -729,30 +744,77 @@
\end{theorem}
\begin{proof}
- By induction on @{term s}.\qed
+ By induction on @{term s} using Lem.~\ref{lemmkeps} and \ref{PMatch2}.\qed
\end{proof}
- This concludes our correctness proof. Note that we have not changed the
- algorithm by Sulzmann and Lu, but introduced our own specification for
- what a correct result---a POSIX value---should be.
+ \noindent This concludes our correctness proof. Note that we have not
+ changed the algorithm by Sulzmann and Lu, but introduced our own
+ specification for what a correct result---a POSIX value---should be.
+ A strong point in favour of Sulzmann and Lu's algorithm is that it
+ can be extended in various ways.
+
*}
-section {* Extensions *}
+section {* Extensions and Optimisations*}
text {*
+ If we are interested in tokenising string, then we need to not just
+ split up the string into tokens, but also ``classify'' the tokens (for
+ example whether it is a keyword or an identifier). This can be
+ done with only minor modifications by introducing \emph{record regular
+ expressions} and \emph{record values} (for example \cite{Sulzmann2014b}):
+
+ \begin{center}
+ @{text "r :="}
+ @{text "..."} $\mid$
+ @{text "(l : r)"} \qquad\qquad
+ @{text "v :="}
+ @{text "..."} $\mid$
+ @{text "(l : v)"}
+ \end{center}
+
+ \noindent where @{text l} is a label, say a string, @{text r} a regular
+ expression and @{text v} a value. All functions can be smoothly extended
+ to these regular expressions and values. For example @{text "(l : r)"} is
+ nullable iff @{term r} is, and so on. The purpose of the record regular
+ expression is to mark certain parts of a regular expression and then
+ record in the calculated value which parts of the string were matched by
+ this part. The label can then serve for classifying tokens. Recall the
+ regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for keywords and
+ identifiers from the Introduction. With record regular expression we can
+ form @{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"} and then traverse the
+ calculated value and only collect the underlying strings in record values.
+ With this we obtain finite sequences of pairs of labels and strings, for
+ example
+
+ \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
+
+ \noindent from which tokens with classifications (keyword-token,
+ identifier-token and so on) can be extracted.
+
Derivatives as calculated by \Brz's method are usually more complex
regular expressions than the initial one; the result is that the matching
- and lexing algorithms are often absymally slow.
+ and lexing algorithms are often abysmally slow. However, various
+ optimisations are possible, such as the simplifications of @{term "ALT
+ ZERO r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r
+ ONE"} to @{term r}. One of the advantages of having a simple specification
+ and correctness proof is that the latter can be refined to allow for such
+ optimisations and simple correctness proof.
+
+ While the simplification of regular expressions according to
+ simplification rules
- various optimisations are
- possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT
- r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of
- the advantages of having a simple specification and correctness proof is
- that the latter can be refined to allow for such optimisations and simple
- correctness proof.
-
+ \noindent
+ is well understood, there is a problem with the POSIX
+ value calculation algorithm by Sulzmann and Lu: if we build a derivative
+ regular expression and then simplify it, we will calculate a POSIX value
+ for this simplified regular expression, \emph{not} for the original
+ (unsimplified) derivative regular expression. Sulzmann and Lu overcome
+ this problem by not just calculating a simplified regular expression, but
+ also calculating a \emph{rectification function} that ``repairs'' the
+ incorrect value.
*}
--- a/thys/Paper/document/root.bib Tue Mar 08 00:45:09 2016 +0000
+++ b/thys/Paper/document/root.bib Tue Mar 08 05:03:47 2016 +0000
@@ -9,6 +9,17 @@
series = {LNCS}
}
+@inproceedings{Sulzmann2014b,
+ author = {M.~Sulzmann and P.~van Steenhoven},
+ title = {{A} {F}lexible and {E}fficient {ML} {L}exer {T}ool {B}ased on {E}xtended {R}egular
+ {E}xpression {S}ubmatching},
+ booktitle = {Proc.~of the 23rd International Conference on Compiler Construction (CC)},
+ pages = {174--191},
+ year = {2014},
+ volume = {8409},
+ series = {LNCS}
+}
+
@Misc{Kuklewicz,
author = {C.~Kuklewicz},
title = {{R}egex {P}osix},
--- a/thys/ReStar.thy Tue Mar 08 00:45:09 2016 +0000
+++ b/thys/ReStar.thy Tue Mar 08 05:03:47 2016 +0000
@@ -797,6 +797,51 @@
apply(simp)
done
+fun F_RIGHT where
+ "F_RIGHT f v = Right (f v)"
+fun F_LEFT where
+ "F_LEFT f v = Left (f v)"
+
+fun F_ALT where
+ "F_ALT f1 f2 (Right v) = Right (f2 v)"
+| "F_ALT f1 f2 (Left v) = Left (f1 v)"
+
+fun F_SEQ1 where
+ "F_SEQ1 f1 f2 v = Seq (f1 Void) (f2 v)"
+
+fun F_SEQ2 where
+ "F_SEQ2 f1 f2 v = Seq (f1 v) (f2 Void)"
+
+fun F_SEQ where
+ "F_SEQ f1 f2 (Seq v1 v2) = Seq (f1 v1) (f2 v2)"
+
+fun simp_ALT where
+ "simp_ALT (ZERO, _) (r2, f2) = (r2, F_RIGHT f2)"
+| "simp_ALT (r1, f1) (ZERO, _) = (r1, F_LEFT f1)"
+| "simp_ALT (r1, f1) (r2, f2) = (ALT r1 r2, F_ALT f1 f2)"
+
+fun simp_SEQ where
+ "simp_SEQ (ONE, f1) (r2, f2) = (r2, F_SEQ1 f1 f2)"
+| "simp_SEQ (r1, f1) (ONE, f2) = (r1, F_SEQ2 f1 f2)"
+| "simp_SEQ (ZERO, f1) (r2, f2) = (ZERO, f1)"
+| "simp_SEQ (r1, f1) (ZERO, f2) = (ZERO, f2)"
+| "simp_SEQ (r1, f1) (r2, f2) = (SEQ r1 r2, F_SEQ f1 f2)"
+
+fun
+ simp :: "rexp \<Rightarrow> rexp * (val \<Rightarrow> val)"
+where
+ "simp (ALT r1 r2) = simp_ALT (simp r1) (simp r2)"
+| "simp (SEQ r1 r2) = simp_SEQ (simp r1) (simp r2)"
+| "simp r = (r, id)"
+
+fun
+ matcher3 :: "rexp \<Rightarrow> string \<Rightarrow> val option"
+where
+ "matcher3 r [] = (if nullable r then Some(mkeps r) else None)"
+| "matcher3 r (c#s) = (let (rs, fr) = simp (der c r) in
+ (case (matcher3 rs s) of
+ None \<Rightarrow> None
+ | Some(v) \<Rightarrow> Some(injval r c (fr v))))"
end
\ No newline at end of file
Binary file thys/paper.pdf has changed