thys/Paper/Paper.thy
changeset 116 022503caa187
parent 115 15ef2af1a6f2
child 117 2c4ffcc95399
equal deleted inserted replaced
115:15ef2af1a6f2 116:022503caa187
    11 abbreviation 
    11 abbreviation 
    12  "ders_syn r s \<equiv> ders s r"
    12  "ders_syn r s \<equiv> ders s r"
    13 
    13 
    14 notation (latex output)
    14 notation (latex output)
    15   If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
    15   If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
    16   Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and
    16   Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [75,75] 73) and
    17   
    17   
    18   ZERO ("\<^bold>0" 80) and 
    18   ZERO ("\<^bold>0" 78) and 
    19   ONE ("\<^bold>1" 80) and 
    19   ONE ("\<^bold>1" 78) and 
    20   CHAR ("_" [1000] 80) and
    20   CHAR ("_" [1000] 80) and
    21   ALT ("_ + _" [77,77] 78) and
    21   ALT ("_ + _" [77,77] 78) and
    22   SEQ ("_ \<cdot> _" [77,77] 78) and
    22   SEQ ("_ \<cdot> _" [77,77] 78) and
    23   STAR ("_\<^sup>\<star>" [1000] 78) and
    23   STAR ("_\<^sup>\<star>" [1000] 78) and
    24   
    24   
    25   val.Void ("'(')" 79) and
    25   val.Void ("'(')" 79) and
    26   val.Char ("Char _" [1000] 79) and
    26   val.Char ("Char _" [1000] 79) and
    27   val.Left ("Left _" [1000] 78) and
    27   val.Left ("Left _" [79] 78) and
    28   val.Right ("Right _" [1000] 78) and
    28   val.Right ("Right _" [79] 78) and
    29   
    29   val.Seq ("Seq _ _" [79,79] 78) and
       
    30   val.Stars ("Stars _" [79] 78) and
       
    31 
    30   L ("L'(_')" [10] 78) and
    32   L ("L'(_')" [10] 78) and
    31   der_syn ("_\\_" [79, 1000] 76) and  
    33   der_syn ("_\\_" [79, 1000] 76) and  
    32   ders_syn ("_\\_" [79, 1000] 76) and
    34   ders_syn ("_\\_" [79, 1000] 76) and
    33   flat ("|_|" [75] 73) and
    35   flat ("|_|" [75] 73) and
    34   Sequ ("_ @ _" [78,77] 63) and
    36   Sequ ("_ @ _" [78,77] 63) and
    35   injval ("inj _ _ _" [78,77,78] 77) and 
    37   injval ("inj _ _ _" [79,77,79] 76) and 
       
    38   mkeps ("mkeps _" [79] 76) and 
    36   projval ("proj _ _ _" [1000,77,1000] 77) and 
    39   projval ("proj _ _ _" [1000,77,1000] 77) and 
    37   length ("len _" [78] 73) and
    40   length ("len _" [78] 73) and
    38 
    41 
    39   Prf ("\<triangleright> _ : _" [75,75] 75) and  
    42   Prf ("\<triangleright> _ : _" [75,75] 75) and  
    40   PMatch ("_ : _ \<rightarrow> _" [75,75,75] 75)
    43   PMatch ("'(_, _') \<rightarrow> _" [63,75,75] 75)
    41   (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *)
    44   (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *)
    42 
    45 
    43 definition 
    46 definition 
    44   "match r s \<equiv> nullable (ders s r)"
    47   "match r s \<equiv> nullable (ders s r)"
    45 
    48 
    62 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
    65 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
    63 derivatives is that they are neatly expressible in any functional language,
    66 derivatives is that they are neatly expressible in any functional language,
    64 and easily definable and reasoned about in theorem provers---the definitions
    67 and easily definable and reasoned about in theorem provers---the definitions
    65 just consist of inductive datatypes and simple recursive functions. A
    68 just consist of inductive datatypes and simple recursive functions. A
    66 completely formalised correctness proof of this matcher in for example HOL4
    69 completely formalised correctness proof of this matcher in for example HOL4
    67 has been mentioned in~\cite{Owens2008}.
    70 has been mentioned in~\cite{Owens2008}. Another one in Isabelle/HOL is
       
    71 in \cite{Krauss2011}.
    68 
    72 
    69 One limitation of Brzozowski's matcher is that it only generates a YES/NO
    73 One limitation of Brzozowski's matcher is that it only generates a YES/NO
    70 answer for whether a string is being matched by a regular expression.
    74 answer for whether a string is being matched by a regular expression.
    71 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
    75 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
    72 generation not just of a YES/NO answer but of an actual matching, called a
    76 generation not just of a YES/NO answer but of an actual matching, called a
    84 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a
    88 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a
    85 GREEDY regular expression matching algorithm. Beginning with our
    89 GREEDY regular expression matching algorithm. Beginning with our
    86 observations that, without evidence that it is transitive, it cannot be
    90 observations that, without evidence that it is transitive, it cannot be
    87 called an ``order relation'', and that the relation is called a ``total
    91 called an ``order relation'', and that the relation is called a ``total
    88 order'' despite being evidently not total\footnote{The relation @{text
    92 order'' despite being evidently not total\footnote{The relation @{text
    89 "\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the values for
    93 "\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the
    90 the regular expression @{term r}; but it only holds between @{term v} and
    94 values for the regular expression @{term r}; but it only holds between
    91 @{term "v'"} in cases where @{term v} and @{term "v'"} have the same
    95 @{term v} and @{term "v'"} in cases where @{term v} and @{term "v'"} have
    92 flattening (underlying string). So a counterexample to totality is given by
    96 the same flattening (underlying string). So a counterexample to totality is
    93 taking two values @{term v} and @{term "v'"} for @{term r} that have
    97 given by taking two values @{term v} and @{term "v'"} for @{term r} that
    94 different flattenings. A different relation @{text "\<ge>\<^bsub>r,s\<^esub>"}
    98 have different flattenings (see Section~\ref{posixsec}). A different
    95 on the set of values for @{term r} with flattening @{term s} is definable by
    99 relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r}
    96 the same approach, and is indeed total; but that is not what Proposition 1
   100 with flattening @{term s} is definable by the same approach, and is indeed
    97 of \cite{Sulzmann2014} does.}, we identify problems with this approach (of
   101 total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}, we
    98 which some of the proofs are not published in \cite{Sulzmann2014}); perhaps
   102 identify problems with this approach (of which some of the proofs are not
    99 more importantly, we give a simple inductive (and algorithm-independent)
   103 published in \cite{Sulzmann2014}); perhaps more importantly, we give a
   100 definition of what we call being a {\em POSIX value} for a regular
   104 simple inductive (and algorithm-independent) definition of what we call
   101 expression @{term r} and a string @{term s}; we show that the algorithm
   105 being a {\em POSIX value} for a regular expression @{term r} and a string
   102 computes such a value and that such a value is unique. Proofs are both done
   106 @{term s}; we show that the algorithm computes such a value and that such a
   103 by hand and checked in Isabelle/HOL. The experience of doing our proofs has
   107 value is unique. Proofs are both done by hand and checked in Isabelle/HOL.
   104 been that this mechanical checking was absolutely essential: this subject
   108 The experience of doing our proofs has been that this mechanical checking
   105 area has hidden snares. This was also noted by Kuklewitz \cite{Kuklewicz}
   109 was absolutely essential: this subject area has hidden snares. This was also
   106 who found that nearly all POSIX matching implementations are ``buggy''
   110 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
   107 \cite[Page 203]{Sulzmann2014}.
   111 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
   108 
   112 
   109 If a regular expression matches a string, then in general there is more than
   113 If a regular expression matches a string, then in general there is more than
   110 one way of how the string is matched. There are two commonly used
   114 one way of how the string is matched. There are two commonly used
   111 disambiguation strategies to generate a unique answer: one is called GREEDY
   115 disambiguation strategies to generate a unique answer: one is called GREEDY
   112 matching \cite{Frisch2004} and the other is POSIX
   116 matching \cite{Frisch2004} and the other is POSIX
   140 that can match determines the token.
   144 that can match determines the token.
   141 \end{itemize}
   145 \end{itemize}
   142  
   146  
   143 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as
   147 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as
   144 @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising
   148 @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising
   145 identifiers (a single character followed by characters or numbers). Then we
   149 identifiers (say, a single character followed by characters or numbers). Then we
   146 can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
   150 can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
   147 matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the
   151 matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the
   148 first case we obtain by the longest match rule a single identifier token,
   152 first case we obtain by the longest match rule a single identifier token,
   149 not a keyword followed by an identifier. In the second case we obtain by rule
   153 not a keyword followed by an identifier. In the second case we obtain by rule
   150 priority a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
   154 priority a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
   151 matches also.\bigskip
   155 matches also.\bigskip
   152 
   156 
   153 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the
   157 \noindent {\bf Contributions:} (NOT DONE YET) We have implemented in Isabelle/HOL the
   154 derivative-based regular expression matching algorithm as described by
   158 derivative-based regular expression matching algorithm as described by
   155 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
   159 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
   156 algorithms according to our specification what a POSIX value is. The
   160 algorithm according to our specification of what a POSIX value is. The
   157 informal correctness proof given in \cite{Sulzmann2014} is in final
   161 informal correctness proof given in \cite{Sulzmann2014} is in final
   158 form\footnote{} and to us contains unfillable gaps. Our specification of a
   162 form\footnote{} and to us contains unfillable gaps. Our specification of a
   159 POSIX value consists of a simple inductive definition that given a string
   163 POSIX value consists of a simple inductive definition that given a string
   160 and a regular expression uniquely determines this value. Derivatives as
   164 and a regular expression uniquely determines this value. Derivatives as
   161 calculated by Brzozowski's method are usually more complex regular
   165 calculated by Brzozowski's method are usually more complex regular
   181 list-cons being written as @{term "DUMMY # DUMMY"}. Often we use the usual
   185 list-cons being written as @{term "DUMMY # DUMMY"}. Often we use the usual
   182 bracket notation for lists also for strings; for example a string consisting
   186 bracket notation for lists also for strings; for example a string consisting
   183 of just a single character @{term c} is written @{term "[c]"}. By using the
   187 of just a single character @{term c} is written @{term "[c]"}. By using the
   184 type @{type char} for characters we have a supply of finitely many
   188 type @{type char} for characters we have a supply of finitely many
   185 characters roughly corresponding to the ASCII character set. Regular
   189 characters roughly corresponding to the ASCII character set. Regular
   186 expressions are defined as usual as the following inductive datatype:
   190 expressions are defined as usual as the elements of the following inductive
       
   191 datatype:
   187 
   192 
   188   \begin{center}
   193   \begin{center}
   189   @{text "r :="}
   194   @{text "r :="}
   190   @{const "ZERO"} $\mid$
   195   @{const "ZERO"} $\mid$
   191   @{const "ONE"} $\mid$
   196   @{const "ONE"} $\mid$
   196   \end{center}
   201   \end{center}
   197 
   202 
   198   \noindent where @{const ZERO} stands for the regular expression that does
   203   \noindent where @{const ZERO} stands for the regular expression that does
   199   not match any string, @{const ONE} for the regular expression that matches
   204   not match any string, @{const ONE} for the regular expression that matches
   200   only the empty string and @{term c} for matching a character literal. The
   205   only the empty string and @{term c} for matching a character literal. The
   201   language of a regular expression is again defined as usual by the
   206   language of a regular expression is also defined as usual by the
   202   recursive function @{term L} with the clauses:
   207   recursive function @{term L} with the clauses:
   203 
   208 
   204   \begin{center}
   209   \begin{center}
   205   \begin{tabular}{rcl}
   210   \begin{tabular}{rcl}
   206   @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
   211   @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\
   210   @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
   215   @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
   211   @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
   216   @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\
   212   \end{tabular}
   217   \end{tabular}
   213   \end{center}
   218   \end{center}
   214   
   219   
   215   \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the
   220   \noindent In the fourth clause we use the operation @{term "DUMMY ;;
   216   concatenation of two languages (it is also list-append for strings). We
   221   DUMMY"} for the concatenation of two languages (it is also list-append for
   217   use the star-notation for regular expressions and languages (in the last
   222   strings). We use the star-notation for regular expressions and languages
   218   clause above). The star on languages is defined inductively by two
   223   (in the last clause above). The star on languages is defined inductively
   219   clauses: @{text "(i)"} for the empty string being in the star of a
   224   by two clauses: @{text "(i)"} for the empty string being in the star of a
   220   language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term
   225   language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term
   221   "s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in
   226   "s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in
   222   the star of this language. It will also be convenient to use the following
   227   the star of this language. It will also be convenient to use the following
   223   notion of a \emph{semantic derivative} (or left quotient) of a language,
   228   notion of a \emph{semantic derivative} (or \emph{left quotient}) of a
   224   say @{text A}, defined as:
   229   language, say @{text A}, defined as:
   225 
   230 
   226   \begin{center}
   231   \begin{center}
   227   \begin{tabular}{lcl}
   232   \begin{tabular}{lcl}
   228   @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
   233   @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
   229   \end{tabular}
   234   \end{tabular}
   230   \end{center}
   235   \end{center}
   231   
   236   
   232   \noindent 
   237   \noindent 
   233   For semantic derivatives we have the following equations (for example
   238   For semantic derivatives we have the following equations (for example
   234   \cite{Krauss2011}):
   239   mechanically proved in \cite{Krauss2011}):
   235 
   240 
   236   \begin{equation}\label{SemDer}
   241   \begin{equation}\label{SemDer}
   237   \begin{array}{lcl}
   242   \begin{array}{lcl}
   238   @{thm (lhs) Der_null}  & \dn & @{thm (rhs) Der_null}\\
   243   @{thm (lhs) Der_null}  & \dn & @{thm (rhs) Der_null}\\
   239   @{thm (lhs) Der_empty}  & \dn & @{thm (rhs) Der_empty}\\
   244   @{thm (lhs) Der_empty}  & \dn & @{thm (rhs) Der_empty}\\
   283   exercise in mechanical reasoning to establish that
   288   exercise in mechanical reasoning to establish that
   284 
   289 
   285   \begin{proposition}\mbox{}\\ 
   290   \begin{proposition}\mbox{}\\ 
   286   \begin{tabular}{ll}
   291   \begin{tabular}{ll}
   287   @{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if
   292   @{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if
   288   @{thm (rhs) nullable_correctness} \\ 
   293   @{thm (rhs) nullable_correctness}, and \\ 
   289   @{text "(2)"} & @{thm[mode=IfThen] der_correctness}
   294   @{text "(2)"} & @{thm[mode=IfThen] der_correctness}.
   290   \end{tabular}
   295   \end{tabular}
   291   \end{proposition}
   296   \end{proposition}
   292 
   297 
   293   \noindent With this in place it is also very routine to prove that the
   298   \noindent With this in place it is also very routine to prove that the
   294   regular expression matcher defined as
   299   regular expression matcher defined as
   295 
   300 
   296   \begin{center}
   301   \begin{center}
   297   @{thm match_def}
   302   @{thm match_def}
   298   \end{center}
   303   \end{center}
   299 
   304 
   300   \noindent gives a positive answer if and only if @{term "s \<in> L r"}. While
   305   \noindent gives a positive answer if and only if @{term "s \<in> L r"}.
   301   the matcher above calculates a provably correct a YES/NO answer for
   306   Consequently, this regular expression matching algorithm satisfies the
   302   whether a regular expression matches a string, the novel idea of Sulzmann
   307   usual specification. While the matcher above calculates a provably correct
   303   and Lu \cite{Sulzmann2014} is to append another phase to calculate a
   308   a YES/NO answer for whether a regular expression matches a string, the
   304   value. We will explain the details next.
   309   novel idea of Sulzmann and Lu \cite{Sulzmann2014} is to append another
       
   310   phase to this algorithm in order to calculate a [lexical] value. We will
       
   311   explain the details next.
   305 
   312 
   306 *}
   313 *}
   307 
   314 
   308 section {* POSIX Regular Expression Matching *}
   315 section {* POSIX Regular Expression Matching\label{posixsec} *}
   309 
   316 
   310 text {* 
   317 text {* 
   311 
   318 
   312 The clever idea in \cite{Sulzmann2014} is to introduce values for encoding
   319 The clever idea in \cite{Sulzmann2014} is to introduce values for encoding
   313 \emph{how} a regular expression matches a string and then define a function
   320 \emph{how} a regular expression matches a string and then define a function
   314 on values that mirrors (but inverts) the construction of the derivative on
   321 on values that mirrors (but inverts) the construction of the derivative on
   315 regular expressions. Values are defined as the inductive datatype
   322 regular expressions. \emph{Values} are defined as the inductive datatype
   316 
   323 
   317   \begin{center}
   324   \begin{center}
   318   @{text "v :="}
   325   @{text "v :="}
   319   @{const "Void"} $\mid$
   326   @{const "Void"} $\mid$
   320   @{term "val.Char c"} $\mid$
   327   @{term "val.Char c"} $\mid$
   322   @{term "Right v"} $\mid$
   329   @{term "Right v"} $\mid$
   323   @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
   330   @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
   324   @{term "Stars vs"} 
   331   @{term "Stars vs"} 
   325   \end{center}  
   332   \end{center}  
   326 
   333 
   327   \noindent where we use @{term vs} standing for a list of values. The string
   334   \noindent where we use @{term vs} standing for a list of values. (This is
   328   underlying a values can be calculated by the @{const flat} function, written
   335   similar to the approach taken by Frisch and Cardelli for GREEDY matching
   329   @{term "flat DUMMY"} and defined as:
   336   \cite{Frisch2014}, and Sulzmann and Lu \cite{2014} for POSIX matching).
       
   337   The string underlying a value can be calculated by the @{const flat}
       
   338   function, written @{term "flat DUMMY"} and defined as:
   330 
   339 
   331   \begin{center}
   340   \begin{center}
   332   \begin{tabular}{lcl}
   341   \begin{tabular}{lcl}
   333   @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\
   342   @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\
   334   @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\
   343   @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\
   343   \noindent Sulzmann and Lu also define inductively an inhabitation relation
   352   \noindent Sulzmann and Lu also define inductively an inhabitation relation
   344   that associates values to regular expressions:
   353   that associates values to regular expressions:
   345 
   354 
   346   \begin{center}
   355   \begin{center}
   347   \begin{tabular}{c}
   356   \begin{tabular}{c}
   348   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ 
   357   @{thm[mode=Axiom] Prf.intros(4)} \qquad
       
   358   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\
   349   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
   359   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
   350   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\
   360   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\
   351   @{thm[mode=Axiom] Prf.intros(4)} \qquad 
   361   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ 
   352   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\
   362   @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad  
   353   @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad 
       
   354   @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\
   363   @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\
   355   \end{tabular}
   364   \end{tabular}
   356   \end{center}
   365   \end{center}
   357 
   366 
   358   \noindent Note that no values are associated with the regular expression
   367   \noindent Note that no values are associated with the regular expression
   359   @{term ZERO}, and that the only value associated with the regular
   368   @{term ZERO}, and that the only value associated with the regular
   360   expression @{term ONE} is @{term Void}, pronounced (if one must) as {\em
   369   expression @{term ONE} is @{term Void}, pronounced (if one must) as {\em
   361   ``Void''}. It is routine to stablish how values `inhabitated' by a regular
   370   ``Void''}. It is routine to stablish how values ``inhabiting'' a regular
   362   expression correspond to the language of a regular expression, namely
   371   expression correspond to the language of a regular expression, namely
   363 
   372 
   364   \begin{proposition}
   373   \begin{proposition}
   365   @{thm L_flat_Prf}
   374   @{thm L_flat_Prf}
   366   \end{proposition}
   375   \end{proposition}
   367 
   376 
   368   Graphically the algorithm by Sulzmann \& Lu can be illustrated by the
   377   In general there are more than one value associated with a regular
   369   picture in Figure~\ref{Sulz} where the path from the left to the right
   378   expression. In case of POSIX matching the problem is to calculate the
   370   involving $der/nullable$ is the first phase of the algorithm and
   379   unique value that satisfies the (informal) POSIX constraints from the
   371   $mkeps/inj$, the path from right to left, the second phase. This picture
   380   Introduction. Graphically the regular expression matching algorithm by
   372   shows the steps required when a regular expression, say $r_1$, matches the
   381   Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
   373   string $abc$. We first build the three derivatives (according to $a$, $b$
   382   where the path from the left to the right involving @{const der}/@{const
   374   and $c$). We then use $nullable$ to find out whether the resulting regular
   383   nullable} is the first phase of the algorithm (calculating successive
   375   expression can match the empty string. If yes, we call the function
   384   \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
   376   $mkeps$.
   385   left, the second phase. This picture shows the steps required when a
       
   386   regular expression, say @{text "r\<^sub>1"}, matches the string @{term
       
   387   "[a,b,c]"}. We first build the three derivatives (according to @{term a},
       
   388   @{term b} and @{term c}). We then use @{const nullable} to find out
       
   389   whether the resulting derivative regular expression @{term "r\<^sub>4"}
       
   390   can match the empty string. If yes, we call the function @{const mkeps}
       
   391   that produces a value @{term "v\<^sub>4"} for how @{term "r\<^sub>4"} can
       
   392   match the empty string (taking into account the POSIX constraints in case
       
   393   there are several ways). This functions is defined by the clauses:
   377 
   394 
   378 \begin{figure}[t]
   395 \begin{figure}[t]
   379 \begin{center}
   396 \begin{center}
   380 \begin{tikzpicture}[scale=2,node distance=1.3cm,
   397 \begin{tikzpicture}[scale=2,node distance=1.3cm,
   381                     every node/.style={minimum size=7mm}]
   398                     every node/.style={minimum size=7mm}]
   407 that the function @{term inj} `injects back' the characters of the string into
   424 that the function @{term inj} `injects back' the characters of the string into
   408 the values (the arrows from right to left).
   425 the values (the arrows from right to left).
   409 \label{Sulz}}
   426 \label{Sulz}}
   410 \end{figure} 
   427 \end{figure} 
   411 
   428 
   412  NOT DONE YET 
   429   \begin{center}
   413 
       
   414   We begin with the case of a nullable regular expression: from the
       
   415   nullability we need to construct a value that witnesses the nullability.
       
   416   The @{const mkeps} function (from \cite{Sulzmann2014}) is a partial (but
       
   417   unambiguous) function from regular expressions to values, total on exactly
       
   418   the set of nullable regular expressions.
       
   419 
       
   420  \begin{center}
       
   421   \begin{tabular}{lcl}
   430   \begin{tabular}{lcl}
   422   @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
   431   @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
   423   @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
   432   @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
   424   @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
   433   @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
   425   @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
   434   @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
   426   \end{tabular}
   435   \end{tabular}
   427   \end{center}
   436   \end{center}
   428 
   437 
   429 The well-known idea of POSIX lexing is informally defined in (for example)
   438   \noindent Note that this function needs only to be partially defined,
   430 \cite{posix}; as correctly argued in \cite{Sulzmann2014}, this needs formal
   439   namely only for regular expressions that are nullable. In case @{const
   431 specification. The rough idea is that, in contrast to the so-called GREEDY
   440   nullable} fails, the string @{term "[a,b,c]"} cannot be matched by @{term
   432 algorithm, POSIX lexing chooses to match more deeply and using left choices
   441   "r\<^sub>1"} and an error is raised. Note also how this function makes
   433 rather than a right choices. For example, note that to match the string 
   442   some subtle choices leading to a POSIX value: for example if the
   434 @{term "[a, b]"} with the regular expression $(a + \mts)\circ (b+ab)$ the matching
   443   alternative, say @{term "ALT r\<^sub>1 r\<^sub>2"}, can match the empty
   435 will return $( Void, Right(ab))$ rather than $(Left\ a, Left\ b)$. [The
   444   string and furthermore @{term "r\<^sub>1"} can match the empty string,
   436 regular expression $ab$ is short for $(Lit\ a) \circ (Lit\ b)$.] Similarly,
   445   then we return a @{const Left}-value. The @{const Right}-value will only
   437 to match {\em ``a''} with $(a+a)$ the leftmost $a$ will be chosen.
   446   be returned if @{term "r\<^sub>1"} is not nullable.
   438 
   447 
   439 We use a simple inductive definition to specify this notion, incorporating
   448   The most interesting novelty from Sulzmann and Lu \cite{Sulzmann2014} is
   440 the POSIX-specific choices into the side-conditions for the rules $R tl
   449   the construction value for how @{term "r\<^sub>1"} can match the string
   441 +_2$, $R tl\circ$ and $R tl*$ (as they are now called). By contrast,
   450   @{term "[a,b,c]"} from the value how the last derivative, @{term
   442 \cite{Sulzmann2014} defines a relation between values and argues that there is a
   451   "r\<^sub>4"} in Fig~\ref{Sulz}, can match the empty string. Sulzmann and
   443 maximum value, as given by the derivative-based algorithm yet to be spelt
   452   Lu acchieve this by stepwise ``injecting back'' the characters into the
   444 out. The relation we define is ternary, relating strings, values and regular
   453   values thus inverting the operation of building derivatives on the level
   445 expressions.
   454   of values. The corresponding function, called @{term inj}, takes three
   446 
   455   arguments, a regular expression, a character and a value. For example in
   447 Our Posix relation @{term "s \<in> r \<rightarrow> v"}
   456   the first @{term inj}-step in Fig~\ref{Sulz} the regular expression @{term
   448 
   457   "r\<^sub>3"}, the character @{term c} from the last derivative step and
   449   \begin{center}
   458   @{term "v\<^sub>4"}, which is the value corresponding to the derivative
   450   \begin{tabular}{c}
   459   regular expression @{term "r\<^sub>4"}. The result is the new value @{term
   451   @{thm[mode=Axiom] PMatch.intros(1)} \qquad
   460   "v\<^sub>3"}. The final result of the algorithm is the value @{term
   452   @{thm[mode=Axiom] PMatch.intros(2)}\medskip\\
   461   "v\<^sub>1"} corresponding to the input regular expression. The @{term
   453   @{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\qquad
   462   inj} function is by recursion on the regular expression and by analysing
   454   @{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\medskip\\
   463   the shape of values.
   455   \multicolumn{1}{p{5cm}}{@{thm[mode=Rule] PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}\medskip\\
       
   456   @{thm[mode=Rule] PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}\medskip\\
       
   457   @{thm[mode=Axiom] PMatch.intros(7)}\medskip\\
       
   458   \end{tabular}
       
   459   \end{center}
       
   460 
       
   461 *}
       
   462 
       
   463 section {* The Argument by Sulzmmann and Lu *}
       
   464 
       
   465 section {* Conclusion *}
       
   466 
       
   467 text {*
       
   468 
       
   469   Nipkow lexer from 2000
       
   470 
       
   471 *}
       
   472 
       
   473 
       
   474 text {*
       
   475 
       
   476 
       
   477 
       
   478 
       
   479   \noindent
       
   480   Values
       
   481 
       
   482   
       
   483 
       
   484  
       
   485 
       
   486  
       
   487 
       
   488   \noindent
       
   489   The @{const mkeps} function
       
   490 
       
   491  
       
   492 
       
   493   \noindent
       
   494   The @{text inj} function
       
   495 
   464 
   496   \begin{center}
   465   \begin{center}
   497   \begin{tabular}{lcl}
   466   \begin{tabular}{lcl}
   498   @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
   467   @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
   499   @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
   468   @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
   508       & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
   477       & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
   509   @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ 
   478   @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ 
   510       & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
   479       & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
   511   \end{tabular}
   480   \end{tabular}
   512   \end{center}
   481   \end{center}
       
   482 
       
   483   \noindent To better understand what is going on in this definition it
       
   484   might be instructive to look first at the three sequence cases (clauses
       
   485   4--6). Recall the corresponding clause of the derivative
       
   486 
       
   487   \begin{center}
       
   488   @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}
       
   489   \end{center}
       
   490 
       
   491   \noindent
       
   492   NOT DONE YET 
       
   493 
       
   494   Therefore there are, for example, three
       
   495   cases for sequence regular expressions (for all possible shapes of the
       
   496   value).
       
   497  
       
   498   Again the virtues of this algorithm is that it can be
       
   499   implemented with ease in a functional programming language and
       
   500   also in Isabelle/HOL.
       
   501 
       
   502 The well-known idea of POSIX lexing is informally defined in (for example)
       
   503 \cite{posix}; as correctly argued in \cite{Sulzmann2014}, this needs formal
       
   504 specification. The rough idea is that, in contrast to the so-called GREEDY
       
   505 algorithm, POSIX lexing chooses to match more deeply and using left choices
       
   506 rather than a right choices. For example, note that to match the string 
       
   507 @{term "[a, b]"} with the regular expression $(a + \mts)\circ (b+ab)$ the matching
       
   508 will return $( Void, Right(ab))$ rather than $(Left\ a, Left\ b)$. [The
       
   509 regular expression $ab$ is short for $(Lit\ a) \circ (Lit\ b)$.] Similarly,
       
   510 to match {\em ``a''} with $(a+a)$ the leftmost $a$ will be chosen.
       
   511 
       
   512 We use a simple inductive definition to specify this notion, incorporating
       
   513 the POSIX-specific choices into the side-conditions for the rules $R tl
       
   514 +_2$, $R tl\circ$ and $R tl*$ (as they are now called). By contrast,
       
   515 \cite{Sulzmann2014} defines a relation between values and argues that there is a
       
   516 maximum value, as given by the derivative-based algorithm yet to be spelt
       
   517 out. The relation we define is ternary, relating strings, values and regular
       
   518 expressions.
       
   519 
       
   520 Our Posix relation @{term "s \<in> r \<rightarrow> v"}
       
   521 
       
   522   \begin{center}
       
   523   \begin{tabular}{c}
       
   524   @{thm[mode=Axiom] PMatch.intros(1)} \qquad
       
   525   @{thm[mode=Axiom] PMatch.intros(2)}\medskip\\
       
   526   @{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\qquad
       
   527   @{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\medskip\\
       
   528   \multicolumn{1}{p{5cm}}{@{thm[mode=Rule] PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}\medskip\\
       
   529   @{thm[mode=Rule] PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}\medskip\\
       
   530   @{thm[mode=Axiom] PMatch.intros(7)}\medskip\\
       
   531   \end{tabular}
       
   532   \end{center}
       
   533 
       
   534 *}
       
   535 
       
   536 section {* The Argument by Sulzmmann and Lu *}
       
   537 
       
   538 section {* Conclusion *}
       
   539 
       
   540 text {*
       
   541 
       
   542   Nipkow lexer from 2000
       
   543 
       
   544 *}
       
   545 
       
   546 
       
   547 text {*
       
   548 
       
   549 
       
   550 
       
   551 
       
   552   \noindent
       
   553   Values
       
   554 
       
   555   
       
   556 
       
   557  
       
   558 
       
   559  
       
   560 
       
   561   \noindent
       
   562   The @{const mkeps} function
       
   563 
       
   564  
       
   565 
       
   566   \noindent
       
   567   The @{text inj} function
       
   568 
       
   569   
   513 
   570 
   514   \noindent
   571   \noindent
   515   The inhabitation relation:
   572   The inhabitation relation:
   516 
   573 
   517   \begin{center}
   574   \begin{center}