diff -r 6adda4a667b1 -r 73f7dc60c285 thys/Paper/Paper.thy --- a/thys/Paper/Paper.thy Sun Feb 28 14:01:12 2016 +0000 +++ b/thys/Paper/Paper.thy Tue Mar 01 12:10:11 2016 +0000 @@ -5,19 +5,23 @@ declare [[show_question_marks = false]] +abbreviation + "der_syn r c \ der c r" + notation (latex output) If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and ZERO ("\<^raw:\textrm{0}>" 78) and ONE ("\<^raw:\textrm{1}>" 78) and CHAR ("_" [1000] 10) and - ALT ("_ + _" [1000,1000] 78) and - SEQ ("_ \ _" [1000,1000] 78) and + ALT ("_ + _" [77,77] 78) and + SEQ ("_ \ _" [77,77] 78) and STAR ("_\<^sup>\" [1000] 78) and val.Char ("Char _" [1000] 78) and val.Left ("Left _" [1000] 78) and val.Right ("Right _" [1000] 78) and - L ("L _" [1000] 0) and + L ("L'(_')" [10] 78) and + der_syn ("_\\_" [1000, 1000] 78) and flat ("|_|" [70] 73) and Sequ ("_ @ _" [78,77] 63) and injval ("inj _ _ _" [1000,77,1000] 77) and @@ -29,9 +33,74 @@ section {* Introduction *} text {* - + +Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em +derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a +character~@{text c}, and showed that it gave a simple solution to the +problem of matching a string @{term s} with a regular expression @{term r}: +if the derivative of @{term r} w.r.t.\ (in succession) all the characters of +the string matches the empty string $\mts$, then @{term r} matches @{term s} +(and {\em vice versa}). The derivative has the property (which may be +regarded as its specification) that, for every string @{term s} and regular +expression @{term r} and character @{term c}, one has @{term "cs \ L(r)"} if +and only if \mbox{@{term "s \ L(der c r)"}}. The beauty of Brzozowski's +derivatives is that they are neatly expressible in any functional language, and +very easy to be defined and reasoned about in a theorem prover---the +definitions consist just of inductive datatypes and recursive functions. A +completely formalised proof of this matcher has for example been given in +\cite{Owens2008}. + +One limitation of Brzozowski's matcher is that it only generates a YES/NO +answer for a string being matched by a regular expression. Sulzmann and Lu +\cite{Sulzmann2014} extended this matcher to allow generation not just of a +YES/NO answer but of an actual matching, called a [lexical] {\em value}. +They give a still very simple algorithm to calculate a value that appears to +be the value associated with POSIX lexing (posix) %%\cite{posix}. The +challenge then is to specify that value, in an algorithm-independent +fashion, and to show that Sulzamman and Lu's derivative-based algorithm does +indeed calculate a value that is correct according to the specification. + +Inspired by work of Frisch and Cardelli \cite{Frisch2004} on a GREEDY +regular expression matching algorithm, the answer given in +\cite{Sulzmann2014} is to define a relation (called an ``Order Relation'') +on the set of values of @{term r}, and to show that (once a string to be +matched is chosen) there is a maximum element and that it is computed by the +derivative-based algorithm. Beginning with our observations that, without +evidence that it is transitive, it cannot be called an ``order relation'', +and that the relation is called a ``total order'' despite being evidently +not total\footnote{\textcolor{green}{Why is it not total?}}, we identify +problems with this approach (of which some of the proofs are not published +in \cite{Sulzmann2014}); perhaps more importantly, we give a simple +inductive (and algorithm-independent) definition of what we call being a +{\em POSIX value} for a regular expression @{term r} and a string @{term s}; +we show that the algorithm computes such a value and that such a value is +unique. Proofs are both done by hand and checked in {\em Isabelle} +%\cite{isabelle}. Our experience of doing our proofs has been that this +mechanical checking was absolutely essential: this subject area has hidden +snares. This was also noted by Kuklewitz \cite{Kuklewicz} who found that +nearly all POSIX matching implementations are ``buggy'' \cite[Page +203]{Sulzmann2014}. + +\textcolor{green}{Say something about POSIX lexing} + +An extended version of \cite{Sulzmann2014} is available at the website +of its first author; this includes some ``proofs'', claimed in +\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in +final form, we make no comment thereon, preferring to give general +reasons for our belief that the approach of \cite{Sulzmann2014} is +problematic rather than to discuss details of unpublished work. + +Derivatives as calculated by Brzozowski's method are usually more +complex regular expressions than the initial one; various optimisations +are possible, such as the simplifications of @{term "ALT ZERO r"}, +@{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to +@{term r}. One of the advantages of having a simple specification and +correctness proof is that the latter can be refined to allow for such +optimisations and simple correctness proof. + Sulzmann and Lu \cite{Sulzmann2014} +\cite{Brzozowski1964} there are two commonly used disambiguation strategies to create a unique matching tree: @@ -96,7 +165,7 @@ many gaps and attempting to formalise the proof in Isabelle. - +\medskip\noindent {\bf Contributions:} *} @@ -105,10 +174,13 @@ text {* \noindent Strings in Isabelle/HOL are lists of characters with the empty string being represented by the empty list, written @{term - "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. By - using the type char we have a supply of finitely many characters - roughly corresponding to the ASCII character set. - Regular exprtessions + "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. + Often we use the usual bracket notation for strings; for example a + string consiting of a single character is written @{term "[c]"}. By + using the type @{type char} for characters we have a supply of + finitely many characters roughly corresponding to the ASCII + character set. Regular expression are as usual and defined as the + following inductive datatype: \begin{center} @{text "r :="} @@ -120,6 +192,76 @@ @{term "STAR r"} \end{center} + \noindent where @{const ZERO} stands for the regular expression that + does not macth any string and @{const ONE} for the regular + expression that matches only the empty string. The language of a regular expression + is again defined as usual by the following clauses + + \begin{center} + \begin{tabular}{rcl} + @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\ + @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\ + @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\ + @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ + @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ + @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\ + \end{tabular} + \end{center} + + \noindent We use the star-notation for regular expressions and sets of strings. + The Kleene-star on sets is defined inductively. + + \emph{Semantic derivatives} of sets of strings are defined as + + \begin{center} + \begin{tabular}{lcl} + @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\ + @{thm (lhs) Ders_def} & $\dn$ & @{thm (rhs) Ders_def}\\ + \end{tabular} + \end{center} + + \noindent where the second definitions lifts the notion of semantic + derivatives from characters to strings. + + \noindent + The nullable function + + \begin{center} + \begin{tabular}{lcl} + @{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\ + @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\ + @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\ + @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ + @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ + @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\ + \end{tabular} + \end{center} + + \noindent + The derivative function for characters and strings + + \begin{center} + \begin{tabular}{lcp{7.5cm}} + @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ + @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\ + @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\ + @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\ + @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\ + @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\medskip\\ + + @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\ + @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\ + \end{tabular} + \end{center} + + It is a relatively easy exercise to prove that + + \begin{center} + \begin{tabular}{l} + @{thm[mode=IfThen] nullable_correctness}\\ + @{thm[mode=IfThen] der_correctness}\\ + \end{tabular} + \end{center} *} section {* POSIX Regular Expression Matching *} @@ -167,36 +309,7 @@ \end{tabular} \end{center} - \noindent - The nullable function - - \begin{center} - \begin{tabular}{lcl} - @{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\ - @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\ - @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\ - @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ - @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ - @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\ - \end{tabular} - \end{center} - - \noindent - The derivative function for characters and strings - - \begin{center} - \begin{tabular}{lcp{7.5cm}} - @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ - @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\ - @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\ - @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\ - @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\ - @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\medskip\\ - - @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\ - @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\ - \end{tabular} - \end{center} + \noindent The @{const flat} function for values