thys/Paper/Paper.thy
changeset 113 90fe1a1d7d0e
parent 112 698967eceaf1
child 114 8b41d01b5e5d
--- a/thys/Paper/Paper.thy	Thu Mar 03 08:22:39 2016 +0000
+++ b/thys/Paper/Paper.thy	Sat Mar 05 04:01:10 2016 +0000
@@ -19,23 +19,27 @@
   SEQ ("_ \<cdot> _" [77,77] 78) and
   STAR ("_\<^sup>\<star>" [1000] 78) and
   
-  val.Void ("'(')" 78) and
-  val.Char ("Char _" [1000] 78) and
+  val.Void ("'(')" 79) and
+  val.Char ("Char _" [1000] 79) and
   val.Left ("Left _" [1000] 78) and
   val.Right ("Right _" [1000] 78) and
   
   L ("L'(_')" [10] 78) and
   der_syn ("_\\_" [79, 1000] 76) and
-  flat ("|_|" [70] 73) and
+  flat ("|_|" [75] 73) and
   Sequ ("_ @ _" [78,77] 63) and
-  injval ("inj _ _ _" [1000,77,1000] 77) and 
+  injval ("inj _ _ _" [78,77,78] 77) and 
   projval ("proj _ _ _" [1000,77,1000] 77) and 
-  length ("len _" [78] 73) 
+  length ("len _" [78] 73) and
+
+  Prf ("\<triangleright> _ : _" [75,75] 75) and  
+  PMatch ("_ : _ \<rightarrow> _" [75,75,75] 75)
   (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *)
 (*>*)
 
 section {* Introduction *}
 
+
 text {*
 
 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
@@ -59,7 +63,7 @@
 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
 generation not just of a YES/NO answer but of an actual matching, called a
 [lexical] {\em value}. They give a simple algorithm to calculate a value
-that appears to be the value associated with POSIX lexing
+that appears to be the value associated with POSIX matching
 \cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that
 value, in an algorithm-independent fashion, and to show that Sulzamann and
 Lu's derivative-based algorithm does indeed calculate a value that is
@@ -86,17 +90,17 @@
 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
 
-If a regular expression matches a string, then in general there is more
-than one way of how the string is matched. There are two commonly used
+If a regular expression matches a string, then in general there is more than
+one way of how the string is matched. There are two commonly used
 disambiguation strategies to generate a unique answer: one is called GREEDY
 matching \cite{Frisch2004} and the other is POSIX
-matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string
-@{term xy} and the regular expression \mbox{@{term "STAR (ALT (ALT x y)
-xy)"}}. Either the string can be matched in two `iterations' by the single
-letter-regular expressions @{term x} and @{term y}, or directly in one
-iteration by @{term xy}. The first case corresponds to GREEDY matching,
-which first matches with the left-most symbol and only matches the next
-symbol in case of a mismatch (this is greedy in the sense of preferring
+matching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example consider
+the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT
+(ALT x y) xy)"}}. Either the string can be matched in two `iterations' by
+the single letter-regular expressions @{term x} and @{term y}, or directly
+in one iteration by @{term xy}. The first case corresponds to GREEDY
+matching, which first matches with the left-most symbol and only matches the
+next symbol in case of a mismatch (this is greedy in the sense of preferring
 instant gratification to delayed repletion). The second case is POSIX
 matching, which prefers the longest match.
 
@@ -120,38 +124,37 @@
 that can match determines the token.
 \end{itemize}
  
-\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising
-keywords such as @{text "if"}, @{text "then"} and so on; and @{text
-"r\<^bsub>id\<^esub>"} recognising identifiers (a single character followed
-by characters or numbers). Then we can form the regular expression @{text
-"(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
+\noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as
+@{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising
+identifiers (a single character followed by characters or numbers). Then we
+can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
 matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the
 first case we obtain by the longest match rule a single identifier token,
-not a keyword followed by identifier. In the second case we obtain by rule
-priority a keyword token, not an identifier token---even if @{text
-"r\<^bsub>id\<^esub>"} matches also.\bigskip
-
-\noindent\textcolor{green}{Not Done Yet}
-
-
-\medskip\noindent
-{\bf Contributions:}
+not a keyword followed by an identifier. In the second case we obtain by rule
+priority a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
+matches also.\bigskip
 
-Derivatives as calculated by Brzozowski's method are usually more complex
-regular expressions than the initial one; various optimisations are
-possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r
-ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the
-advantages of having a simple specification and correctness proof is that
-the latter can be refined to allow for such optimisations and simple
-correctness proof.
+\noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
+derivative-based regular expression matching algorithm as described by
+Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
+algorithms according to our specification what a POSIX value is. The
+informal correctness proof given in \cite{Sulzmann2014} is in final
+form\footnote{} and to us contains unfillable gaps. Our specification of a
+POSIX value consists of a simple inductive definition that given a string
+and a regular expression uniquely determines this value. Derivatives as
+calculated by Brzozowski's method are usually more complex regular
+expressions than the initial one; various optimisations are possible, such
+as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r ZERO"}, @{term
+"SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the advantages of
+having a simple specification and correctness proof is that the latter can
+be refined to allow for such optimisations and simple correctness proof.
 
-An extended version of \cite{Sulzmann2014} is available at the website
-of its first author; this includes some ``proofs'', claimed in
+An extended version of \cite{Sulzmann2014} is available at the website of
+its first author; this includes some ``proofs'', claimed in
 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
-final form, we make no comment thereon, preferring to give general
-reasons for our belief that the approach of \cite{Sulzmann2014} is
-problematic rather than to discuss details of unpublished work.
-
+final form, we make no comment thereon, preferring to give general reasons
+for our belief that the approach of \cite{Sulzmann2014} is problematic
+rather than to discuss details of unpublished work.
 
 *}
 
@@ -160,11 +163,11 @@
 text {* \noindent Strings in Isabelle/HOL are lists of characters with the
 empty string being represented by the empty list, written @{term "[]"}, and
 list-cons being written as @{term "DUMMY # DUMMY"}. Often we use the usual
-bracket notation for strings; for example a string consisting of just a
-single character is written @{term "[c]"}. By using the type @{type char}
-for characters we have a supply of finitely many characters roughly
-corresponding to the ASCII character set. Regular expressions are defined as
-usual as the following inductive datatype:
+bracket notation for lists also for strings; for example a string consisting
+of just a single character @{term c} is written @{term "[c]"}. By using the
+type @{type char} for characters we have a supply of finitely many
+characters roughly corresponding to the ASCII character set. Regular
+expressions are defined as usual as the following inductive datatype:
 
   \begin{center}
   @{text "r :="}
@@ -176,11 +179,11 @@
   @{term "STAR r"} 
   \end{center}
 
-  \noindent where @{const ZERO} stands for the regular expression that
-  does not match any string and @{const ONE} for the regular
-  expression that matches only the empty string. The language of a regular expression
-  is again defined routinely by the recursive function @{term L} with the
-  clauses:
+  \noindent where @{const ZERO} stands for the regular expression that does
+  not match any string, @{const ONE} for the regular expression that matches
+  only the empty string and @{term c} for matching a character literal. The
+  language of a regular expression is again defined as usual by the
+  recursive function @{term L} with the clauses:
 
   \begin{center}
   \begin{tabular}{rcl}
@@ -195,13 +198,14 @@
   
   \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the
   concatenation of two languages (it is also list-append for strings). We
-  use the star-notation for regular expressions and also for languages (in
-  the last clause). The star for languages is defined inductively as usual
-  by two clauses for the empty string being in the star of a language and if
-  @{term "s\<^sub>1"} is in a language and @{term "s\<^sub>2"} in the star of this
-  language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in the star of this language.
-  
-  \emph{Semantic derivatives} of sets of strings are defined as
+  use the star-notation for regular expressions and languages (in the last
+  clause above). The star on languages is defined inductively by two
+  clauses: @{text "(i)"} for the empty string being in the star of a
+  language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term
+  "s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in
+  the star of this language. It will also be convenient to use the following
+  notion of a \emph{semantic derivative} (or left quotient) of a language,
+  say @{text A}, defined as:
 
   \begin{center}
   \begin{tabular}{lcl}
@@ -209,10 +213,23 @@
   \end{tabular}
   \end{center}
   
-  
+  \noindent 
+  For semantic derivatives we have the following equations (for example
+  \cite{Krauss2011}):
 
-  \noindent
-  The nullable function
+  \begin{equation}
+  \begin{array}{lcl}
+  @{thm (lhs) Der_null}  & \dn & @{thm (rhs) Der_null}\\
+  \end{array}
+  \end{equation}
+
+
+  \noindent \emph{\Brz's derivatives} of regular expressions
+  \cite{Brzozowski1964} can be easily defined by two recursive functions:
+  the first is from regular expressions to booleans (implementing a test
+  when a regular expression can match the empty string), and the second
+  takes a regular expression and a character to a (derivative) regular
+  expression:
 
   \begin{center}
   \begin{tabular}{lcl}
@@ -221,15 +238,7 @@
   @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
   @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
   @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
-  @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\
-  \end{tabular}
-  \end{center}
-
-  \noindent
-  The derivative function for characters and strings
-
-  \begin{center}
-  \begin{tabular}{lcp{8cm}}
+  @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
   @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
   @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
   @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
@@ -239,27 +248,78 @@
   \end{tabular}
   \end{center}
 
-  It is a relatively easy exercise to prove that
+  \noindent
+  Given the equations in ???, 
+  it is a relatively easy exercise in mechanical reasoning to establish that
+
+  \begin{proposition} 
+  \begin{tabular}{ll}
+  @{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if
+  @{thm (rhs) nullable_correctness} \\ 
+  @{text "(2)"} & @{thm[mode=IfThen] der_correctness}
+  \end{tabular}
+  \end{proposition}
+
+  \noindent With this in place it is also very routine to prove that the
+  regular expression matcher defined as
 
   \begin{center}
-  \begin{tabular}{l}
-  @{thm[mode=IfThen] nullable_correctness}\\
-  @{thm[mode=IfThen] der_correctness}\\
-  \end{tabular}
+
   \end{center}
+
+  While the matcher above calculates a provably correct a YES/NO answer for
+  whether a regular expression matches a string, the novel idea of Sulzmann
+  and Lu \cite{Sulzmann2014} is to append another phase to calculate a
+  value.
+
 *}
 
 section {* POSIX Regular Expression Matching *}
 
 text {* 
 
-The clever idea in \cite{Sulzmann2014} is to define a function on values that mirrors
-(but inverts) the construction of the derivative on regular expressions. We
-begin with the case of a nullable regular expression: from the nullability
-we need to construct a value that witnesses the nullability. This is as
-follows. The @{const mkeps} function (from \cite{Sulzmann2014}) is a partial (but
-unambiguous) function from regular expressions to values, total on exactly
-the set of nullable regular expressions. 
+The clever idea in \cite{Sulzmann2014} is to introduce values for encoding
+\emph{how} a regular expression matches a string and then define a function
+on values that mirrors (but inverts) the construction of the derivative on
+regular expressions. Values are defined as the inductive datatype
+
+  \begin{center}
+  @{text "v :="}
+  @{const "Void"} $\mid$
+  @{term "val.Char c"} $\mid$
+  @{term "Left v"} $\mid$
+  @{term "Right v"} $\mid$
+  @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
+  @{term "Stars vs"} 
+  \end{center}  
+
+The inhabitation relation:
+
+  \begin{center}
+  \begin{tabular}{c}
+  @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ 
+  @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
+  @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\
+  @{thm[mode=Axiom] Prf.intros(4)} \qquad 
+  @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\
+  @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad 
+  @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\
+  \end{tabular}
+  \end{center}
+
+
+Note that no values are associated with the regular expression $Zero$, and
+that the only value associated with the regular expression $One$ is $Void$,
+pronounced (if one must) as {\em ``Void''}. We use $\in$ for the usual
+membership relation from set theory and take $[]$ as the standard name for
+the empty string (rather than, as in \cite{Sulzmann2014}, the regular
+expression that we call $One$).
+
+We begin with the case of a nullable regular expression: from
+the nullability we need to construct a value that witnesses the nullability.
+The @{const mkeps} function (from \cite{Sulzmann2014})
+is a partial (but unambiguous) function from regular expressions to values,
+total on exactly the set of nullable regular expressions.
 
  \begin{center}
   \begin{tabular}{lcl}
@@ -323,15 +383,7 @@
   \noindent
   Values
 
-  \begin{center}
-  @{text "v :="}
-  @{const "Void"} $\mid$
-  @{term "val.Char c"} $\mid$
-  @{term "Left v"} $\mid$
-  @{term "Right v"} $\mid$
-  @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
-  @{term "Stars vs"} 
-  \end{center}  
+