thys/Paper/Paper.thy
changeset 108 73f7dc60c285
parent 107 6adda4a667b1
child 109 2c38f10643ae
--- a/thys/Paper/Paper.thy	Sun Feb 28 14:01:12 2016 +0000
+++ b/thys/Paper/Paper.thy	Tue Mar 01 12:10:11 2016 +0000
@@ -5,19 +5,23 @@
 
 declare [[show_question_marks = false]]
 
+abbreviation 
+ "der_syn r c \<equiv> der c r"
+
 notation (latex output)
    If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
   Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and
   ZERO ("\<^raw:\textrm{0}>" 78) and 
   ONE ("\<^raw:\textrm{1}>" 78) and 
   CHAR ("_" [1000] 10) and
-  ALT ("_ + _" [1000,1000] 78) and
-  SEQ ("_ \<cdot> _" [1000,1000] 78) and
+  ALT ("_ + _" [77,77] 78) and
+  SEQ ("_ \<cdot> _" [77,77] 78) and
   STAR ("_\<^sup>\<star>" [1000] 78) and
   val.Char ("Char _" [1000] 78) and
   val.Left ("Left _" [1000] 78) and
   val.Right ("Right _" [1000] 78) and
-  L ("L _" [1000] 0) and
+  L ("L'(_')" [10] 78) and
+  der_syn ("_\\_" [1000, 1000] 78) and
   flat ("|_|" [70] 73) and
   Sequ ("_ @ _" [78,77] 63) and
   injval ("inj _ _ _" [1000,77,1000] 77) and 
@@ -29,9 +33,74 @@
 section {* Introduction *}
 
 text {*
-  
+
+Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
+derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
+character~@{text c}, and showed that it gave a simple solution to the
+problem of matching a string @{term s} with a regular expression @{term r}:
+if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
+the string matches the empty string $\mts$, then @{term r} matches @{term s}
+(and {\em vice versa}). The derivative has the property (which may be
+regarded as its specification) that, for every string @{term s} and regular
+expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
+and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
+derivatives is that they are neatly expressible in any functional language, and
+very easy to be defined and reasoned about in a theorem prover---the
+definitions consist just of inductive datatypes and recursive functions. A
+completely formalised proof of this matcher has for example been given in
+\cite{Owens2008}.
+
+One limitation of Brzozowski's matcher is that it only generates a YES/NO
+answer for a string being matched by a regular expression. Sulzmann and Lu
+\cite{Sulzmann2014} extended this matcher to allow generation not just of a
+YES/NO answer but of an actual matching, called a [lexical] {\em value}.
+They give a still very simple algorithm to calculate a value that appears to
+be the value associated with POSIX lexing (posix) %%\cite{posix}. The
+challenge then is to specify that value, in an algorithm-independent
+fashion, and to show that Sulzamman and Lu's derivative-based algorithm does
+indeed calculate a value that is correct according to the specification.
+
+Inspired by work of Frisch and Cardelli \cite{Frisch2004} on a GREEDY
+regular expression matching algorithm, the answer given in
+\cite{Sulzmann2014} is to define a relation (called an ``Order Relation'')
+on the set of values of @{term r}, and to show that (once a string to be
+matched is chosen) there is a maximum element and that it is computed by the
+derivative-based algorithm. Beginning with our observations that, without
+evidence that it is transitive, it cannot be called an ``order relation'',
+and that the relation is called a ``total order'' despite being evidently
+not total\footnote{\textcolor{green}{Why is it not total?}}, we identify
+problems with this approach (of which some of the proofs are not published
+in \cite{Sulzmann2014}); perhaps more importantly, we give a simple
+inductive (and algorithm-independent) definition of what we call being a
+{\em POSIX value} for a regular expression @{term r} and a string @{term s};
+we show that the algorithm computes such a value and that such a value is
+unique. Proofs are both done by hand and checked in {\em Isabelle}
+%\cite{isabelle}. Our experience of doing our proofs has been that this
+mechanical checking was absolutely essential: this subject area has hidden
+snares. This was also noted by Kuklewitz \cite{Kuklewicz} who found that
+nearly all POSIX matching implementations are ``buggy'' \cite[Page
+203]{Sulzmann2014}.
+
+\textcolor{green}{Say something about POSIX lexing}
+
+An extended version of \cite{Sulzmann2014} is available at the website
+of its first author; this includes some ``proofs'', claimed in
+\cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
+final form, we make no comment thereon, preferring to give general
+reasons for our belief that the approach of \cite{Sulzmann2014} is
+problematic rather than to discuss details of unpublished work.
+
+Derivatives as calculated by Brzozowski's method are usually more
+complex regular expressions than the initial one; various optimisations
+are possible, such as the simplifications of @{term "ALT ZERO r"},
+@{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
+@{term r}. One of the advantages of having a simple specification and
+correctness proof is that the latter can be refined to allow for such
+optimisations and simple correctness proof.
+
 
 Sulzmann and Lu \cite{Sulzmann2014}
+\cite{Brzozowski1964}
 
 there are two commonly used
 disambiguation strategies to create a unique matching tree:
@@ -96,7 +165,7 @@
 many gaps and attempting to formalise the proof in Isabelle. 
 
 
-
+\medskip\noindent
 {\bf Contributions:}
 
 *}
@@ -105,10 +174,13 @@
 
 text {* \noindent Strings in Isabelle/HOL are lists of characters with
   the empty string being represented by the empty list, written @{term
-  "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. By
-  using the type char we have a supply of finitely many characters
-  roughly corresponding to the ASCII character set.
-  Regular exprtessions
+  "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}.
+  Often we use the usual bracket notation for strings; for example a
+  string consiting of a single character is written @{term "[c]"}.  By
+  using the type @{type char} for characters we have a supply of
+  finitely many characters roughly corresponding to the ASCII
+  character set.  Regular expression are as usual and defined as the
+  following inductive datatype:
 
   \begin{center}
   @{text "r :="}
@@ -120,6 +192,76 @@
   @{term "STAR r"} 
   \end{center}
 
+  \noindent where @{const ZERO} stands for the regular expression that
+  does not macth any string and @{const ONE} for the regular
+  expression that matches only the empty string. The language of a regular expression
+  is again defined as usual by the following clauses
+
+  \begin{center}
+  \begin{tabular}{rcl}
+  @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
+  @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\
+  @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\
+  @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
+  @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
+  @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
+  \end{tabular}
+  \end{center}
+  
+  \noindent We use the star-notation for regular expressions and sets of strings.
+  The Kleene-star on sets is defined inductively.
+  
+  \emph{Semantic derivatives} of sets of strings are defined as
+
+  \begin{center}
+  \begin{tabular}{lcl}
+  @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
+  @{thm (lhs) Ders_def} & $\dn$ & @{thm (rhs) Ders_def}\\
+  \end{tabular}
+  \end{center}
+  
+  \noindent where the second definitions lifts the notion of semantic
+  derivatives from characters to strings.  
+
+  \noindent
+  The nullable function
+
+  \begin{center}
+  \begin{tabular}{lcl}
+  @{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\
+  @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\
+  @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
+  @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
+  @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
+  @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\
+  \end{tabular}
+  \end{center}
+
+  \noindent
+  The derivative function for characters and strings
+
+  \begin{center}
+  \begin{tabular}{lcp{7.5cm}}
+  @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
+  @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
+  @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
+  @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
+  @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
+  @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\medskip\\
+
+  @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\
+  @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\
+  \end{tabular}
+  \end{center}
+
+  It is a relatively easy exercise to prove that
+
+  \begin{center}
+  \begin{tabular}{l}
+  @{thm[mode=IfThen] nullable_correctness}\\
+  @{thm[mode=IfThen] der_correctness}\\
+  \end{tabular}
+  \end{center}
 *}
 
 section {* POSIX Regular Expression Matching *}
@@ -167,36 +309,7 @@
   \end{tabular}
   \end{center}
 
-  \noindent
-  The nullable function
-
-  \begin{center}
-  \begin{tabular}{lcl}
-  @{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\
-  @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\
-  @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
-  @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
-  @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
-  @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\
-  \end{tabular}
-  \end{center}
-
-  \noindent
-  The derivative function for characters and strings
-
-  \begin{center}
-  \begin{tabular}{lcp{7.5cm}}
-  @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
-  @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
-  @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
-  @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
-  @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
-  @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\medskip\\
-
-  @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\
-  @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\
-  \end{tabular}
-  \end{center}
+ 
 
   \noindent
   The @{const flat} function for values