# HG changeset patch # User Christian Urban # Date 1457413427 0 # Node ID e866678c29cb51d973e189c10dd1f81dd629bbb9 # Parent ff084486098171a9121787040df719ee4e339fb1 updated diff -r ff0844860981 -r e866678c29cb thys/Paper/Paper.thy --- a/thys/Paper/Paper.thy Tue Mar 08 00:45:09 2016 +0000 +++ b/thys/Paper/Paper.thy Tue Mar 08 05:03:47 2016 +0000 @@ -155,32 +155,29 @@ the priority rule a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"} matches also.\bigskip -\noindent {\bf Contributions:} (NOT DONE YET) We have implemented in -Isabelle/HOL the derivative-based regular expression matching algorithm as -described by Sulzmann and Lu \cite{Sulzmann2014}. We have proved the -correctness of this algorithm according to our specification of what a POSIX -value is. Sulzmann and Lu sketch in \cite{Sulzmann2014} an informal -correctness proof: but to us it contains unfillable gaps. - -informal correctness proof given in \cite{Sulzmann2014} is in final -form\footnote{} and to us contains unfillable gaps. - +\noindent {\bf Contributions:} We have implemented in Isabelle/HOL the +derivative-based regular expression matching algorithm as described by +Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this +algorithm according to our specification of what a POSIX value is. Sulzmann +and Lu sketch in \cite{Sulzmann2014} an informal correctness proof: but to +us it contains unfillable gaps.\footnote{An extended version of +\cite{Sulzmann2014} is available at the website of its first author; this +extended version already includes remarks in the appendix that their +informal proof contains gaps, and possible fixes are not fully worked out.} Our specification of a POSIX value consists of a simple inductive definition that given a string and a regular expression uniquely determines this value. Derivatives as calculated by Brzozowski's method are usually more complex regular expressions than the initial one; various optimisations are -possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r -ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the -advantages of having a simple specification and correctness proof is that -the latter can be refined to allow for such optimisations and simple -correctness proof. +possible. We prove the correctness when simplifications of @{term "ALT ZERO +r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to +@{term r} are applied. -An extended version of \cite{Sulzmann2014} is available at the website of -its first author; this includes some ``proofs'', claimed in -\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in -final form, we make no comment thereon, preferring to give general reasons -for our belief that the approach of \cite{Sulzmann2014} is problematic -rather than to discuss details of unpublished work. +%An extended version of \cite{Sulzmann2014} is available at the website of +%its first author; this includes some ``proofs'', claimed in +%\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in +%final form, we make no comment thereon, preferring to give general reasons +%for our belief that the approach of \cite{Sulzmann2014} is problematic +%rather than to discuss details of unpublished work. *} @@ -271,6 +268,11 @@ @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\ + \end{tabular} + \end{center} + + \begin{center} + \begin{tabular}{lcl} @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\ @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\ @@ -526,10 +528,13 @@ injval.simps(1)[of "c" "c"]} $\dn$ @{thm (rhs) injval.simps(1)[of "c"]}, but our deviation is harmless.} - The idea of @{term inj} to ``inject back'' a character into a value can - be made precise by the first part of the following lemma; the second - part shows that the underlying string of an @{const mkeps}-value is - always the empty string. + The idea of the @{term inj}-function to ``inject'' a character, say + @{term c}, into a value can be made precise by the first part of the + following lemma, which shows that the underlying string of an injected + value has a prepend character @{term c}; the second part shows that the + underlying string of an @{const mkeps}-value is always the empty string + (given the regular expression is nullable since otherwise @{text mkeps} + might not be defined). \begin{lemma}\mbox{}\smallskip\\\label{Prf_injval_flat} \begin{tabular}{ll} @@ -539,9 +544,9 @@ \end{lemma} \begin{proof} - Both properties are by routine inductions: the first one, for example, - by an induction over the definition of @{term derivatives}; the second by - induction on @{term r}. There are no interesting cases.\qed + Both properties are by routine inductions: the first one can, for example, + be proved by an induction over the definition of @{term derivatives}; the second by + an induction on @{term r}. There are no interesting cases.\qed \end{proof} Having defined the @{const mkeps} and @{text inj} function we can extend @@ -558,7 +563,7 @@ \end{center} \noindent If the regular expression does not match the string, @{const None} is - returned, indicating an error is raised. If the regular expression does + returned, indicating an error is raised. If the regular expression \emph{does} match the string, then @{const Some} value is returned. One important virtue of this algorithm is that it can be implemented with ease in a functional programming language and also in Isabelle/HOL. In the remaining @@ -603,46 +608,56 @@ \noindent We claim that this relation captures the idea behind the two informal POSIX rules shown in the Introduction: Consider for example the - rules @{text "P+L"} and @{text "P+R"} where the POSIX value for an - alternative regular expression is specified---it is always a @{text - "Left"}-value, \emph{except} when the string to be matched is not in the - language of @{term "r\<^sub>1"}; only then it is a @{text Right}-value (see the - side-condition in @{text "P+R"}). Interesting is also the rule for - sequence regular expressions (@{text "PS"}). The first two premises state - that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} are the POSIX values for - @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"} + rules @{text "P+L"} and @{text "P+R"} where the POSIX value for a string + and an alternative regular expression, that is @{term "(s, ALT r\<^sub>1 r\<^sub>2)"}, + is specified---it is always a @{text "Left"}-value, \emph{except} when the + string to be matched is not in the language of @{term "r\<^sub>1"}; only then it + is a @{text Right}-value (see the side-condition in @{text "P+R"}). + Interesting is also the rule for sequence regular expressions (@{text + "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} + are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"} respectively. Consider now the third premise and note that the POSIX value of this rule should match the string @{term "s\<^sub>1 @ s\<^sub>2"}. According to the longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} such that @{term "s\<^sub>2"} is still recognised by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there - \emph{exists} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"} - can be split up into a non-empty @{term "s\<^sub>3"} and @{term "s\<^sub>4"}. Moreover - the longer @{term "s\<^sub>1 @ s\<^sub>3"} can be matched by @{text "r\<^sub>1"} and the - shorter @{term "s\<^sub>4"} can still be matched by @{term "r\<^sub>2"}. In this case - @{term "s\<^sub>1"} would not be the longest initial split of @{term "s\<^sub>1 @ - s\<^sub>2"} and therefore @{term "Seq v\<^sub>1 v\<^sub>2"} cannot be a POSIX value - for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. A similar condition is imposed - onto the POSIX value in the @{text "P\"}-rule. Also there we want that - @{term "s\<^sub>1"} is the longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and - furthermore the corresponding value @{term v} cannot be flatten to - the empty string. In effect, we require that in each ``iteration'' - of the star, some parts of the string need to be ``nibbled'' away; only - in case of the empty string weBy accept @{term "Stars []"} as the - POSIX value. + \emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"} + can be split up into a non-empty string @{term "s\<^sub>3"} and possibly empty + string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be + matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be + matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would not be the + longest initial split of @{term "s\<^sub>1 @ s\<^sub>2"} and therefore @{term "Seq v\<^sub>1 + v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. + The main point is that this side-condition ensures the longest + match rule is satisfied. + + A similar condition is imposed on the POSIX value in the @{text + "P\"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial + split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value + @{term v} cannot be flatten to the empty string. In effect, we require + that in each ``iteration'' of the star, some non-empty substring need to + be ``chipped'' away; only in case of the empty string we accept @{term + "Stars []"} as the POSIX value. We can prove that given a string @{term s} and regular expression @{term r}, the POSIX value @{term v} is uniquely determined by @{term "s \ r \ - v"}. + v"} (albeilt in an uncomputable fashion---for example rule @{term "P+R"} + would require the calculation of the potentially infinite set @{term "L + r\<^sub>1"}). \begin{theorem} @{thm[mode=IfThen] PMatch_determ(1)[of _ _ "v\<^sub>1" "v\<^sub>2"]} \end{theorem} - \begin{proof} - By induction on the definition of @{term "s \ r \ v\<^sub>1"} and a case - analysis of @{term "s \ r \ v\<^sub>2"}.\qed - \end{proof} + \begin{proof} By induction on the definition of @{term "s \ r \ v\<^sub>1"} and + a case analysis of @{term "s \ r \ v\<^sub>2"}. This proof requires the + auxiliary lemma that @{thm (prem 1) PMatch1(1)} implies @{thm (concl) + PMatch1(1)} and @{thm (concl) PMatch1(2)}, which are both easily + established by inductions.\qed \end{proof} + + \noindent + Next is the lemma that shows the function @{term "mkeps"} calculates + the posix value for the empty string and a nullable regular expression. \begin{lemma}\label{lemmkeps} @{thm[mode=IfThen] PMatch_mkeps} @@ -710,14 +725,14 @@ Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the sequence case, except that we need to ensure that @{term "flat (injval r\<^sub>1 - c v\<^sub>1) \ []"}. This follows by Lem.~\ref{posixbasic} from @{term "(c # s\<^sub>1) - \ r' \ injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \ der c + c v\<^sub>1) \ []"}. This follows from @{term "(c # s\<^sub>1) + \ r' \ injval r\<^sub>1 c v\<^sub>1"} (which in turn follows from @{term "s\<^sub>1 \ der c r\<^sub>1 \ v\<^sub>1"} and the induction hypothesis).\qed \end{proof} \noindent With Lem.~\ref{PMatch2} in place, it is completely routine to establish - that the Sulzmann and Lu lexer satisfies its specification (returning + that the Sulzmann and Lu lexer satisfies our specification (returning an ``error'' iff the string is not in the language of the regular expression, and returning a unique POSIX value iff the string \emph{is} in the language): @@ -729,30 +744,77 @@ \end{theorem} \begin{proof} - By induction on @{term s}.\qed + By induction on @{term s} using Lem.~\ref{lemmkeps} and \ref{PMatch2}.\qed \end{proof} - This concludes our correctness proof. Note that we have not changed the - algorithm by Sulzmann and Lu, but introduced our own specification for - what a correct result---a POSIX value---should be. + \noindent This concludes our correctness proof. Note that we have not + changed the algorithm by Sulzmann and Lu, but introduced our own + specification for what a correct result---a POSIX value---should be. + A strong point in favour of Sulzmann and Lu's algorithm is that it + can be extended in various ways. + *} -section {* Extensions *} +section {* Extensions and Optimisations*} text {* + If we are interested in tokenising string, then we need to not just + split up the string into tokens, but also ``classify'' the tokens (for + example whether it is a keyword or an identifier). This can be + done with only minor modifications by introducing \emph{record regular + expressions} and \emph{record values} (for example \cite{Sulzmann2014b}): + + \begin{center} + @{text "r :="} + @{text "..."} $\mid$ + @{text "(l : r)"} \qquad\qquad + @{text "v :="} + @{text "..."} $\mid$ + @{text "(l : v)"} + \end{center} + + \noindent where @{text l} is a label, say a string, @{text r} a regular + expression and @{text v} a value. All functions can be smoothly extended + to these regular expressions and values. For example @{text "(l : r)"} is + nullable iff @{term r} is, and so on. The purpose of the record regular + expression is to mark certain parts of a regular expression and then + record in the calculated value which parts of the string were matched by + this part. The label can then serve for classifying tokens. Recall the + regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\"} for keywords and + identifiers from the Introduction. With record regular expression we can + form @{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\"} and then traverse the + calculated value and only collect the underlying strings in record values. + With this we obtain finite sequences of pairs of labels and strings, for + example + + \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\] + + \noindent from which tokens with classifications (keyword-token, + identifier-token and so on) can be extracted. + Derivatives as calculated by \Brz's method are usually more complex regular expressions than the initial one; the result is that the matching - and lexing algorithms are often absymally slow. + and lexing algorithms are often abysmally slow. However, various + optimisations are possible, such as the simplifications of @{term "ALT + ZERO r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r + ONE"} to @{term r}. One of the advantages of having a simple specification + and correctness proof is that the latter can be refined to allow for such + optimisations and simple correctness proof. + + While the simplification of regular expressions according to + simplification rules - various optimisations are - possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT - r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of - the advantages of having a simple specification and correctness proof is - that the latter can be refined to allow for such optimisations and simple - correctness proof. - + \noindent + is well understood, there is a problem with the POSIX + value calculation algorithm by Sulzmann and Lu: if we build a derivative + regular expression and then simplify it, we will calculate a POSIX value + for this simplified regular expression, \emph{not} for the original + (unsimplified) derivative regular expression. Sulzmann and Lu overcome + this problem by not just calculating a simplified regular expression, but + also calculating a \emph{rectification function} that ``repairs'' the + incorrect value. *} diff -r ff0844860981 -r e866678c29cb thys/Paper/document/root.bib --- a/thys/Paper/document/root.bib Tue Mar 08 00:45:09 2016 +0000 +++ b/thys/Paper/document/root.bib Tue Mar 08 05:03:47 2016 +0000 @@ -9,6 +9,17 @@ series = {LNCS} } +@inproceedings{Sulzmann2014b, + author = {M.~Sulzmann and P.~van Steenhoven}, + title = {{A} {F}lexible and {E}fficient {ML} {L}exer {T}ool {B}ased on {E}xtended {R}egular + {E}xpression {S}ubmatching}, + booktitle = {Proc.~of the 23rd International Conference on Compiler Construction (CC)}, + pages = {174--191}, + year = {2014}, + volume = {8409}, + series = {LNCS} +} + @Misc{Kuklewicz, author = {C.~Kuklewicz}, title = {{R}egex {P}osix}, diff -r ff0844860981 -r e866678c29cb thys/ReStar.thy --- a/thys/ReStar.thy Tue Mar 08 00:45:09 2016 +0000 +++ b/thys/ReStar.thy Tue Mar 08 05:03:47 2016 +0000 @@ -797,6 +797,51 @@ apply(simp) done +fun F_RIGHT where + "F_RIGHT f v = Right (f v)" +fun F_LEFT where + "F_LEFT f v = Left (f v)" + +fun F_ALT where + "F_ALT f1 f2 (Right v) = Right (f2 v)" +| "F_ALT f1 f2 (Left v) = Left (f1 v)" + +fun F_SEQ1 where + "F_SEQ1 f1 f2 v = Seq (f1 Void) (f2 v)" + +fun F_SEQ2 where + "F_SEQ2 f1 f2 v = Seq (f1 v) (f2 Void)" + +fun F_SEQ where + "F_SEQ f1 f2 (Seq v1 v2) = Seq (f1 v1) (f2 v2)" + +fun simp_ALT where + "simp_ALT (ZERO, _) (r2, f2) = (r2, F_RIGHT f2)" +| "simp_ALT (r1, f1) (ZERO, _) = (r1, F_LEFT f1)" +| "simp_ALT (r1, f1) (r2, f2) = (ALT r1 r2, F_ALT f1 f2)" + +fun simp_SEQ where + "simp_SEQ (ONE, f1) (r2, f2) = (r2, F_SEQ1 f1 f2)" +| "simp_SEQ (r1, f1) (ONE, f2) = (r1, F_SEQ2 f1 f2)" +| "simp_SEQ (ZERO, f1) (r2, f2) = (ZERO, f1)" +| "simp_SEQ (r1, f1) (ZERO, f2) = (ZERO, f2)" +| "simp_SEQ (r1, f1) (r2, f2) = (SEQ r1 r2, F_SEQ f1 f2)" + +fun + simp :: "rexp \ rexp * (val \ val)" +where + "simp (ALT r1 r2) = simp_ALT (simp r1) (simp r2)" +| "simp (SEQ r1 r2) = simp_SEQ (simp r1) (simp r2)" +| "simp r = (r, id)" + +fun + matcher3 :: "rexp \ string \ val option" +where + "matcher3 r [] = (if nullable r then Some(mkeps r) else None)" +| "matcher3 r (c#s) = (let (rs, fr) = simp (der c r) in + (case (matcher3 rs s) of + None \ None + | Some(v) \ Some(injval r c (fr v))))" end \ No newline at end of file diff -r ff0844860981 -r e866678c29cb thys/paper.pdf Binary file thys/paper.pdf has changed