thys/Paper/Paper.thy
changeset 116 022503caa187
parent 115 15ef2af1a6f2
child 117 2c4ffcc95399
--- a/thys/Paper/Paper.thy	Sat Mar 05 12:10:01 2016 +0000
+++ b/thys/Paper/Paper.thy	Sun Mar 06 12:48:31 2016 +0000
@@ -13,10 +13,10 @@
 
 notation (latex output)
   If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
-  Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and
+  Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [75,75] 73) and
   
-  ZERO ("\<^bold>0" 80) and 
-  ONE ("\<^bold>1" 80) and 
+  ZERO ("\<^bold>0" 78) and 
+  ONE ("\<^bold>1" 78) and 
   CHAR ("_" [1000] 80) and
   ALT ("_ + _" [77,77] 78) and
   SEQ ("_ \<cdot> _" [77,77] 78) and
@@ -24,20 +24,23 @@
   
   val.Void ("'(')" 79) and
   val.Char ("Char _" [1000] 79) and
-  val.Left ("Left _" [1000] 78) and
-  val.Right ("Right _" [1000] 78) and
-  
+  val.Left ("Left _" [79] 78) and
+  val.Right ("Right _" [79] 78) and
+  val.Seq ("Seq _ _" [79,79] 78) and
+  val.Stars ("Stars _" [79] 78) and
+
   L ("L'(_')" [10] 78) and
   der_syn ("_\\_" [79, 1000] 76) and  
   ders_syn ("_\\_" [79, 1000] 76) and
   flat ("|_|" [75] 73) and
   Sequ ("_ @ _" [78,77] 63) and
-  injval ("inj _ _ _" [78,77,78] 77) and 
+  injval ("inj _ _ _" [79,77,79] 76) and 
+  mkeps ("mkeps _" [79] 76) and 
   projval ("proj _ _ _" [1000,77,1000] 77) and 
   length ("len _" [78] 73) and
 
   Prf ("\<triangleright> _ : _" [75,75] 75) and  
-  PMatch ("_ : _ \<rightarrow> _" [75,75,75] 75)
+  PMatch ("'(_, _') \<rightarrow> _" [63,75,75] 75)
   (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *)
 
 definition 
@@ -64,7 +67,8 @@
 and easily definable and reasoned about in theorem provers---the definitions
 just consist of inductive datatypes and simple recursive functions. A
 completely formalised correctness proof of this matcher in for example HOL4
-has been mentioned in~\cite{Owens2008}.
+has been mentioned in~\cite{Owens2008}. Another one in Isabelle/HOL is
+in \cite{Krauss2011}.
 
 One limitation of Brzozowski's matcher is that it only generates a YES/NO
 answer for whether a string is being matched by a regular expression.
@@ -86,25 +90,25 @@
 observations that, without evidence that it is transitive, it cannot be
 called an ``order relation'', and that the relation is called a ``total
 order'' despite being evidently not total\footnote{The relation @{text
-"\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the values for
-the regular expression @{term r}; but it only holds between @{term v} and
-@{term "v'"} in cases where @{term v} and @{term "v'"} have the same
-flattening (underlying string). So a counterexample to totality is given by
-taking two values @{term v} and @{term "v'"} for @{term r} that have
-different flattenings. A different relation @{text "\<ge>\<^bsub>r,s\<^esub>"}
-on the set of values for @{term r} with flattening @{term s} is definable by
-the same approach, and is indeed total; but that is not what Proposition 1
-of \cite{Sulzmann2014} does.}, we identify problems with this approach (of
-which some of the proofs are not published in \cite{Sulzmann2014}); perhaps
-more importantly, we give a simple inductive (and algorithm-independent)
-definition of what we call being a {\em POSIX value} for a regular
-expression @{term r} and a string @{term s}; we show that the algorithm
-computes such a value and that such a value is unique. Proofs are both done
-by hand and checked in Isabelle/HOL. The experience of doing our proofs has
-been that this mechanical checking was absolutely essential: this subject
-area has hidden snares. This was also noted by Kuklewitz \cite{Kuklewicz}
-who found that nearly all POSIX matching implementations are ``buggy''
-\cite[Page 203]{Sulzmann2014}.
+"\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the
+values for the regular expression @{term r}; but it only holds between
+@{term v} and @{term "v'"} in cases where @{term v} and @{term "v'"} have
+the same flattening (underlying string). So a counterexample to totality is
+given by taking two values @{term v} and @{term "v'"} for @{term r} that
+have different flattenings (see Section~\ref{posixsec}). A different
+relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}
+with flattening @{term s} is definable by the same approach, and is indeed
+total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}, we
+identify problems with this approach (of which some of the proofs are not
+published in \cite{Sulzmann2014}); perhaps more importantly, we give a
+simple inductive (and algorithm-independent) definition of what we call
+being a {\em POSIX value} for a regular expression @{term r} and a string
+@{term s}; we show that the algorithm computes such a value and that such a
+value is unique. Proofs are both done by hand and checked in Isabelle/HOL.
+The experience of doing our proofs has been that this mechanical checking
+was absolutely essential: this subject area has hidden snares. This was also
+noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
+implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
 
 If a regular expression matches a string, then in general there is more than
 one way of how the string is matched. There are two commonly used
@@ -142,7 +146,7 @@
  
 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as
 @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising
-identifiers (a single character followed by characters or numbers). Then we
+identifiers (say, a single character followed by characters or numbers). Then we
 can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
 matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the
 first case we obtain by the longest match rule a single identifier token,
@@ -150,10 +154,10 @@
 priority a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
 matches also.\bigskip
 
-\noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
+\noindent {\bf Contributions:} (NOT DONE YET) We have implemented in Isabelle/HOL the
 derivative-based regular expression matching algorithm as described by
 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
-algorithms according to our specification what a POSIX value is. The
+algorithm according to our specification of what a POSIX value is. The
 informal correctness proof given in \cite{Sulzmann2014} is in final
 form\footnote{} and to us contains unfillable gaps. Our specification of a
 POSIX value consists of a simple inductive definition that given a string
@@ -183,7 +187,8 @@
 of just a single character @{term c} is written @{term "[c]"}. By using the
 type @{type char} for characters we have a supply of finitely many
 characters roughly corresponding to the ASCII character set. Regular
-expressions are defined as usual as the following inductive datatype:
+expressions are defined as usual as the elements of the following inductive
+datatype:
 
   \begin{center}
   @{text "r :="}
@@ -198,7 +203,7 @@
   \noindent where @{const ZERO} stands for the regular expression that does
   not match any string, @{const ONE} for the regular expression that matches
   only the empty string and @{term c} for matching a character literal. The
-  language of a regular expression is again defined as usual by the
+  language of a regular expression is also defined as usual by the
   recursive function @{term L} with the clauses:
 
   \begin{center}
@@ -212,16 +217,16 @@
   \end{tabular}
   \end{center}
   
-  \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the
-  concatenation of two languages (it is also list-append for strings). We
-  use the star-notation for regular expressions and languages (in the last
-  clause above). The star on languages is defined inductively by two
-  clauses: @{text "(i)"} for the empty string being in the star of a
+  \noindent In the fourth clause we use the operation @{term "DUMMY ;;
+  DUMMY"} for the concatenation of two languages (it is also list-append for
+  strings). We use the star-notation for regular expressions and languages
+  (in the last clause above). The star on languages is defined inductively
+  by two clauses: @{text "(i)"} for the empty string being in the star of a
   language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term
   "s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in
   the star of this language. It will also be convenient to use the following
-  notion of a \emph{semantic derivative} (or left quotient) of a language,
-  say @{text A}, defined as:
+  notion of a \emph{semantic derivative} (or \emph{left quotient}) of a
+  language, say @{text A}, defined as:
 
   \begin{center}
   \begin{tabular}{lcl}
@@ -231,7 +236,7 @@
   
   \noindent 
   For semantic derivatives we have the following equations (for example
-  \cite{Krauss2011}):
+  mechanically proved in \cite{Krauss2011}):
 
   \begin{equation}\label{SemDer}
   \begin{array}{lcl}
@@ -285,8 +290,8 @@
   \begin{proposition}\mbox{}\\ 
   \begin{tabular}{ll}
   @{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if
-  @{thm (rhs) nullable_correctness} \\ 
-  @{text "(2)"} & @{thm[mode=IfThen] der_correctness}
+  @{thm (rhs) nullable_correctness}, and \\ 
+  @{text "(2)"} & @{thm[mode=IfThen] der_correctness}.
   \end{tabular}
   \end{proposition}
 
@@ -297,22 +302,24 @@
   @{thm match_def}
   \end{center}
 
-  \noindent gives a positive answer if and only if @{term "s \<in> L r"}. While
-  the matcher above calculates a provably correct a YES/NO answer for
-  whether a regular expression matches a string, the novel idea of Sulzmann
-  and Lu \cite{Sulzmann2014} is to append another phase to calculate a
-  value. We will explain the details next.
+  \noindent gives a positive answer if and only if @{term "s \<in> L r"}.
+  Consequently, this regular expression matching algorithm satisfies the
+  usual specification. While the matcher above calculates a provably correct
+  a YES/NO answer for whether a regular expression matches a string, the
+  novel idea of Sulzmann and Lu \cite{Sulzmann2014} is to append another
+  phase to this algorithm in order to calculate a [lexical] value. We will
+  explain the details next.
 
 *}
 
-section {* POSIX Regular Expression Matching *}
+section {* POSIX Regular Expression Matching\label{posixsec} *}
 
 text {* 
 
 The clever idea in \cite{Sulzmann2014} is to introduce values for encoding
 \emph{how} a regular expression matches a string and then define a function
 on values that mirrors (but inverts) the construction of the derivative on
-regular expressions. Values are defined as the inductive datatype
+regular expressions. \emph{Values} are defined as the inductive datatype
 
   \begin{center}
   @{text "v :="}
@@ -324,9 +331,11 @@
   @{term "Stars vs"} 
   \end{center}  
 
-  \noindent where we use @{term vs} standing for a list of values. The string
-  underlying a values can be calculated by the @{const flat} function, written
-  @{term "flat DUMMY"} and defined as:
+  \noindent where we use @{term vs} standing for a list of values. (This is
+  similar to the approach taken by Frisch and Cardelli for GREEDY matching
+  \cite{Frisch2014}, and Sulzmann and Lu \cite{2014} for POSIX matching).
+  The string underlying a value can be calculated by the @{const flat}
+  function, written @{term "flat DUMMY"} and defined as:
 
   \begin{center}
   \begin{tabular}{lcl}
@@ -345,12 +354,12 @@
 
   \begin{center}
   \begin{tabular}{c}
-  @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ 
+  @{thm[mode=Axiom] Prf.intros(4)} \qquad
+  @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\
   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\
-  @{thm[mode=Axiom] Prf.intros(4)} \qquad 
-  @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\
-  @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad 
+  @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ 
+  @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad  
   @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\
   \end{tabular}
   \end{center}
@@ -358,22 +367,30 @@
   \noindent Note that no values are associated with the regular expression
   @{term ZERO}, and that the only value associated with the regular
   expression @{term ONE} is @{term Void}, pronounced (if one must) as {\em
-  ``Void''}. It is routine to stablish how values `inhabitated' by a regular
+  ``Void''}. It is routine to stablish how values ``inhabiting'' a regular
   expression correspond to the language of a regular expression, namely
 
   \begin{proposition}
   @{thm L_flat_Prf}
   \end{proposition}
 
-  Graphically the algorithm by Sulzmann \& Lu can be illustrated by the
-  picture in Figure~\ref{Sulz} where the path from the left to the right
-  involving $der/nullable$ is the first phase of the algorithm and
-  $mkeps/inj$, the path from right to left, the second phase. This picture
-  shows the steps required when a regular expression, say $r_1$, matches the
-  string $abc$. We first build the three derivatives (according to $a$, $b$
-  and $c$). We then use $nullable$ to find out whether the resulting regular
-  expression can match the empty string. If yes, we call the function
-  $mkeps$.
+  In general there are more than one value associated with a regular
+  expression. In case of POSIX matching the problem is to calculate the
+  unique value that satisfies the (informal) POSIX constraints from the
+  Introduction. Graphically the regular expression matching algorithm by
+  Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
+  where the path from the left to the right involving @{const der}/@{const
+  nullable} is the first phase of the algorithm (calculating successive
+  \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
+  left, the second phase. This picture shows the steps required when a
+  regular expression, say @{text "r\<^sub>1"}, matches the string @{term
+  "[a,b,c]"}. We first build the three derivatives (according to @{term a},
+  @{term b} and @{term c}). We then use @{const nullable} to find out
+  whether the resulting derivative regular expression @{term "r\<^sub>4"}
+  can match the empty string. If yes, we call the function @{const mkeps}
+  that produces a value @{term "v\<^sub>4"} for how @{term "r\<^sub>4"} can
+  match the empty string (taking into account the POSIX constraints in case
+  there are several ways). This functions is defined by the clauses:
 
 \begin{figure}[t]
 \begin{center}
@@ -409,15 +426,7 @@
 \label{Sulz}}
 \end{figure} 
 
- NOT DONE YET 
-
-  We begin with the case of a nullable regular expression: from the
-  nullability we need to construct a value that witnesses the nullability.
-  The @{const mkeps} function (from \cite{Sulzmann2014}) is a partial (but
-  unambiguous) function from regular expressions to values, total on exactly
-  the set of nullable regular expressions.
-
- \begin{center}
+  \begin{center}
   \begin{tabular}{lcl}
   @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
   @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
@@ -426,6 +435,70 @@
   \end{tabular}
   \end{center}
 
+  \noindent Note that this function needs only to be partially defined,
+  namely only for regular expressions that are nullable. In case @{const
+  nullable} fails, the string @{term "[a,b,c]"} cannot be matched by @{term
+  "r\<^sub>1"} and an error is raised. Note also how this function makes
+  some subtle choices leading to a POSIX value: for example if the
+  alternative, say @{term "ALT r\<^sub>1 r\<^sub>2"}, can match the empty
+  string and furthermore @{term "r\<^sub>1"} can match the empty string,
+  then we return a @{const Left}-value. The @{const Right}-value will only
+  be returned if @{term "r\<^sub>1"} is not nullable.
+
+  The most interesting novelty from Sulzmann and Lu \cite{Sulzmann2014} is
+  the construction value for how @{term "r\<^sub>1"} can match the string
+  @{term "[a,b,c]"} from the value how the last derivative, @{term
+  "r\<^sub>4"} in Fig~\ref{Sulz}, can match the empty string. Sulzmann and
+  Lu acchieve this by stepwise ``injecting back'' the characters into the
+  values thus inverting the operation of building derivatives on the level
+  of values. The corresponding function, called @{term inj}, takes three
+  arguments, a regular expression, a character and a value. For example in
+  the first @{term inj}-step in Fig~\ref{Sulz} the regular expression @{term
+  "r\<^sub>3"}, the character @{term c} from the last derivative step and
+  @{term "v\<^sub>4"}, which is the value corresponding to the derivative
+  regular expression @{term "r\<^sub>4"}. The result is the new value @{term
+  "v\<^sub>3"}. The final result of the algorithm is the value @{term
+  "v\<^sub>1"} corresponding to the input regular expression. The @{term
+  inj} function is by recursion on the regular expression and by analysing
+  the shape of values.
+
+  \begin{center}
+  \begin{tabular}{lcl}
+  @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
+  @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
+      @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
+  @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
+      @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
+  @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
+      & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
+  @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
+      & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
+  @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ 
+      & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
+  @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ 
+      & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
+  \end{tabular}
+  \end{center}
+
+  \noindent To better understand what is going on in this definition it
+  might be instructive to look first at the three sequence cases (clauses
+  4--6). Recall the corresponding clause of the derivative
+
+  \begin{center}
+  @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}
+  \end{center}
+
+  \noindent
+  NOT DONE YET 
+
+  Therefore there are, for example, three
+  cases for sequence regular expressions (for all possible shapes of the
+  value).
+ 
+  Again the virtues of this algorithm is that it can be
+  implemented with ease in a functional programming language and
+  also in Isabelle/HOL.
+
 The well-known idea of POSIX lexing is informally defined in (for example)
 \cite{posix}; as correctly argued in \cite{Sulzmann2014}, this needs formal
 specification. The rough idea is that, in contrast to the so-called GREEDY
@@ -493,23 +566,7 @@
   \noindent
   The @{text inj} function
 
-  \begin{center}
-  \begin{tabular}{lcl}
-  @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
-  @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
-      @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
-  @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
-      @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
-  @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
-      & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
-  @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
-      & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
-  @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ 
-      & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
-  @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ 
-      & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
-  \end{tabular}
-  \end{center}
+  
 
   \noindent
   The inhabitation relation: