# HG changeset patch # User Christian Urban # Date 1457150470 0 # Node ID 90fe1a1d7d0e0a50d5330bb62f97c7d390555cc7 # Parent 698967eceaf152efee5b152f276e7a9e6e135d0d updated diff -r 698967eceaf1 -r 90fe1a1d7d0e thys/Paper/Paper.thy --- a/thys/Paper/Paper.thy Thu Mar 03 08:22:39 2016 +0000 +++ b/thys/Paper/Paper.thy Sat Mar 05 04:01:10 2016 +0000 @@ -19,23 +19,27 @@ SEQ ("_ \ _" [77,77] 78) and STAR ("_\<^sup>\" [1000] 78) and - val.Void ("'(')" 78) and - val.Char ("Char _" [1000] 78) and + val.Void ("'(')" 79) and + val.Char ("Char _" [1000] 79) and val.Left ("Left _" [1000] 78) and val.Right ("Right _" [1000] 78) and L ("L'(_')" [10] 78) and der_syn ("_\\_" [79, 1000] 76) and - flat ("|_|" [70] 73) and + flat ("|_|" [75] 73) and Sequ ("_ @ _" [78,77] 63) and - injval ("inj _ _ _" [1000,77,1000] 77) and + injval ("inj _ _ _" [78,77,78] 77) and projval ("proj _ _ _" [1000,77,1000] 77) and - length ("len _" [78] 73) + length ("len _" [78] 73) and + + Prf ("\ _ : _" [75,75] 75) and + PMatch ("_ : _ \ _" [75,75,75] 75) (* and ValOrd ("_ \\<^bsub>_\<^esub> _" [78,77,77] 73) *) (*>*) section {* Introduction *} + text {* Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em @@ -59,7 +63,7 @@ Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow generation not just of a YES/NO answer but of an actual matching, called a [lexical] {\em value}. They give a simple algorithm to calculate a value -that appears to be the value associated with POSIX lexing +that appears to be the value associated with POSIX matching \cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that value, in an algorithm-independent fashion, and to show that Sulzamann and Lu's derivative-based algorithm does indeed calculate a value that is @@ -86,17 +90,17 @@ noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. -If a regular expression matches a string, then in general there is more -than one way of how the string is matched. There are two commonly used +If a regular expression matches a string, then in general there is more than +one way of how the string is matched. There are two commonly used disambiguation strategies to generate a unique answer: one is called GREEDY matching \cite{Frisch2004} and the other is POSIX -matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string -@{term xy} and the regular expression \mbox{@{term "STAR (ALT (ALT x y) -xy)"}}. Either the string can be matched in two `iterations' by the single -letter-regular expressions @{term x} and @{term y}, or directly in one -iteration by @{term xy}. The first case corresponds to GREEDY matching, -which first matches with the left-most symbol and only matches the next -symbol in case of a mismatch (this is greedy in the sense of preferring +matching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example consider +the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT +(ALT x y) xy)"}}. Either the string can be matched in two `iterations' by +the single letter-regular expressions @{term x} and @{term y}, or directly +in one iteration by @{term xy}. The first case corresponds to GREEDY +matching, which first matches with the left-most symbol and only matches the +next symbol in case of a mismatch (this is greedy in the sense of preferring instant gratification to delayed repletion). The second case is POSIX matching, which prefers the longest match. @@ -120,38 +124,37 @@ that can match determines the token. \end{itemize} -\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising -keywords such as @{text "if"}, @{text "then"} and so on; and @{text -"r\<^bsub>id\<^esub>"} recognising identifiers (a single character followed -by characters or numbers). Then we can form the regular expression @{text -"(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\"} and use POSIX +\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as +@{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising +identifiers (a single character followed by characters or numbers). Then we +can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\"} and use POSIX matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the first case we obtain by the longest match rule a single identifier token, -not a keyword followed by identifier. In the second case we obtain by rule -priority a keyword token, not an identifier token---even if @{text -"r\<^bsub>id\<^esub>"} matches also.\bigskip - -\noindent\textcolor{green}{Not Done Yet} - - -\medskip\noindent -{\bf Contributions:} +not a keyword followed by an identifier. In the second case we obtain by rule +priority a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"} +matches also.\bigskip -Derivatives as calculated by Brzozowski's method are usually more complex -regular expressions than the initial one; various optimisations are -possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r -ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the -advantages of having a simple specification and correctness proof is that -the latter can be refined to allow for such optimisations and simple -correctness proof. +\noindent {\bf Contributions:} We have implemented in Isabelle/HOL the +derivative-based regular expression matching algorithm as described by +Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this +algorithms according to our specification what a POSIX value is. The +informal correctness proof given in \cite{Sulzmann2014} is in final +form\footnote{} and to us contains unfillable gaps. Our specification of a +POSIX value consists of a simple inductive definition that given a string +and a regular expression uniquely determines this value. Derivatives as +calculated by Brzozowski's method are usually more complex regular +expressions than the initial one; various optimisations are possible, such +as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r ZERO"}, @{term +"SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the advantages of +having a simple specification and correctness proof is that the latter can +be refined to allow for such optimisations and simple correctness proof. -An extended version of \cite{Sulzmann2014} is available at the website -of its first author; this includes some ``proofs'', claimed in +An extended version of \cite{Sulzmann2014} is available at the website of +its first author; this includes some ``proofs'', claimed in \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in -final form, we make no comment thereon, preferring to give general -reasons for our belief that the approach of \cite{Sulzmann2014} is -problematic rather than to discuss details of unpublished work. - +final form, we make no comment thereon, preferring to give general reasons +for our belief that the approach of \cite{Sulzmann2014} is problematic +rather than to discuss details of unpublished work. *} @@ -160,11 +163,11 @@ text {* \noindent Strings in Isabelle/HOL are lists of characters with the empty string being represented by the empty list, written @{term "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. Often we use the usual -bracket notation for strings; for example a string consisting of just a -single character is written @{term "[c]"}. By using the type @{type char} -for characters we have a supply of finitely many characters roughly -corresponding to the ASCII character set. Regular expressions are defined as -usual as the following inductive datatype: +bracket notation for lists also for strings; for example a string consisting +of just a single character @{term c} is written @{term "[c]"}. By using the +type @{type char} for characters we have a supply of finitely many +characters roughly corresponding to the ASCII character set. Regular +expressions are defined as usual as the following inductive datatype: \begin{center} @{text "r :="} @@ -176,11 +179,11 @@ @{term "STAR r"} \end{center} - \noindent where @{const ZERO} stands for the regular expression that - does not match any string and @{const ONE} for the regular - expression that matches only the empty string. The language of a regular expression - is again defined routinely by the recursive function @{term L} with the - clauses: + \noindent where @{const ZERO} stands for the regular expression that does + not match any string, @{const ONE} for the regular expression that matches + only the empty string and @{term c} for matching a character literal. The + language of a regular expression is again defined as usual by the + recursive function @{term L} with the clauses: \begin{center} \begin{tabular}{rcl} @@ -195,13 +198,14 @@ \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the concatenation of two languages (it is also list-append for strings). We - use the star-notation for regular expressions and also for languages (in - the last clause). The star for languages is defined inductively as usual - by two clauses for the empty string being in the star of a language and if - @{term "s\<^sub>1"} is in a language and @{term "s\<^sub>2"} in the star of this - language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in the star of this language. - - \emph{Semantic derivatives} of sets of strings are defined as + use the star-notation for regular expressions and languages (in the last + clause above). The star on languages is defined inductively by two + clauses: @{text "(i)"} for the empty string being in the star of a + language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term + "s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in + the star of this language. It will also be convenient to use the following + notion of a \emph{semantic derivative} (or left quotient) of a language, + say @{text A}, defined as: \begin{center} \begin{tabular}{lcl} @@ -209,10 +213,23 @@ \end{tabular} \end{center} - + \noindent + For semantic derivatives we have the following equations (for example + \cite{Krauss2011}): - \noindent - The nullable function + \begin{equation} + \begin{array}{lcl} + @{thm (lhs) Der_null} & \dn & @{thm (rhs) Der_null}\\ + \end{array} + \end{equation} + + + \noindent \emph{\Brz's derivatives} of regular expressions + \cite{Brzozowski1964} can be easily defined by two recursive functions: + the first is from regular expressions to booleans (implementing a test + when a regular expression can match the empty string), and the second + takes a regular expression and a character to a (derivative) regular + expression: \begin{center} \begin{tabular}{lcl} @@ -221,15 +238,7 @@ @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\ @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ - @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\ - \end{tabular} - \end{center} - - \noindent - The derivative function for characters and strings - - \begin{center} - \begin{tabular}{lcp{8cm}} + @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\ @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\ @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\ @@ -239,27 +248,78 @@ \end{tabular} \end{center} - It is a relatively easy exercise to prove that + \noindent + Given the equations in ???, + it is a relatively easy exercise in mechanical reasoning to establish that + + \begin{proposition} + \begin{tabular}{ll} + @{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if + @{thm (rhs) nullable_correctness} \\ + @{text "(2)"} & @{thm[mode=IfThen] der_correctness} + \end{tabular} + \end{proposition} + + \noindent With this in place it is also very routine to prove that the + regular expression matcher defined as \begin{center} - \begin{tabular}{l} - @{thm[mode=IfThen] nullable_correctness}\\ - @{thm[mode=IfThen] der_correctness}\\ - \end{tabular} + \end{center} + + While the matcher above calculates a provably correct a YES/NO answer for + whether a regular expression matches a string, the novel idea of Sulzmann + and Lu \cite{Sulzmann2014} is to append another phase to calculate a + value. + *} section {* POSIX Regular Expression Matching *} text {* -The clever idea in \cite{Sulzmann2014} is to define a function on values that mirrors -(but inverts) the construction of the derivative on regular expressions. We -begin with the case of a nullable regular expression: from the nullability -we need to construct a value that witnesses the nullability. This is as -follows. The @{const mkeps} function (from \cite{Sulzmann2014}) is a partial (but -unambiguous) function from regular expressions to values, total on exactly -the set of nullable regular expressions. +The clever idea in \cite{Sulzmann2014} is to introduce values for encoding +\emph{how} a regular expression matches a string and then define a function +on values that mirrors (but inverts) the construction of the derivative on +regular expressions. Values are defined as the inductive datatype + + \begin{center} + @{text "v :="} + @{const "Void"} $\mid$ + @{term "val.Char c"} $\mid$ + @{term "Left v"} $\mid$ + @{term "Right v"} $\mid$ + @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ + @{term "Stars vs"} + \end{center} + +The inhabitation relation: + + \begin{center} + \begin{tabular}{c} + @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ + @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad + @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\ + @{thm[mode=Axiom] Prf.intros(4)} \qquad + @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\ + @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad + @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\ + \end{tabular} + \end{center} + + +Note that no values are associated with the regular expression $Zero$, and +that the only value associated with the regular expression $One$ is $Void$, +pronounced (if one must) as {\em ``Void''}. We use $\in$ for the usual +membership relation from set theory and take $[]$ as the standard name for +the empty string (rather than, as in \cite{Sulzmann2014}, the regular +expression that we call $One$). + +We begin with the case of a nullable regular expression: from +the nullability we need to construct a value that witnesses the nullability. +The @{const mkeps} function (from \cite{Sulzmann2014}) +is a partial (but unambiguous) function from regular expressions to values, +total on exactly the set of nullable regular expressions. \begin{center} \begin{tabular}{lcl} @@ -323,15 +383,7 @@ \noindent Values - \begin{center} - @{text "v :="} - @{const "Void"} $\mid$ - @{term "val.Char c"} $\mid$ - @{term "Left v"} $\mid$ - @{term "Right v"} $\mid$ - @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ - @{term "Stars vs"} - \end{center} + diff -r 698967eceaf1 -r 90fe1a1d7d0e thys/ReStar.thy --- a/thys/ReStar.thy Thu Mar 03 08:22:39 2016 +0000 +++ b/thys/ReStar.thy Sat Mar 05 04:01:10 2016 +0000 @@ -81,8 +81,7 @@ lemma Der_star [simp]: shows "Der c (A\) = (Der c A) ;; A\" proof - - have "Der c (A\) = Der c ({[]} \ A ;; A\)" - + have "Der c (A\) = Der c ({[]} \ A ;; A\)" by (simp only: star_cases[symmetric]) also have "... = Der c (A ;; A\)" by (simp only: Der_union Der_empty) (simp) diff -r 698967eceaf1 -r 90fe1a1d7d0e thys/paper.pdf Binary file thys/paper.pdf has changed