thys/Paper/Paper.thy
changeset 120 d74bfa11802c
parent 119 71e26f43c896
child 121 4c85af262ee7
equal deleted inserted replaced
119:71e26f43c896 120:d74bfa11802c
    11 abbreviation 
    11 abbreviation 
    12  "ders_syn r s \<equiv> ders s r"
    12  "ders_syn r s \<equiv> ders s r"
    13 
    13 
    14 notation (latex output)
    14 notation (latex output)
    15   If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
    15   If  ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and
    16   Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [75,75] 73) and  
    16   Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [75,73] 73) and  
    17 
    17 
    18   ZERO ("\<^bold>0" 78) and 
    18   ZERO ("\<^bold>0" 78) and 
    19   ONE ("\<^bold>1" 78) and 
    19   ONE ("\<^bold>1" 78) and 
    20   CHAR ("_" [1000] 80) and
    20   CHAR ("_" [1000] 80) and
    21   ALT ("_ + _" [77,77] 78) and
    21   ALT ("_ + _" [77,77] 78) and
    30   val.Stars ("Stars _" [79] 78) and
    30   val.Stars ("Stars _" [79] 78) and
    31 
    31 
    32   L ("L'(_')" [10] 78) and
    32   L ("L'(_')" [10] 78) and
    33   der_syn ("_\\_" [79, 1000] 76) and  
    33   der_syn ("_\\_" [79, 1000] 76) and  
    34   ders_syn ("_\\_" [79, 1000] 76) and
    34   ders_syn ("_\\_" [79, 1000] 76) and
    35   flat ("|_|" [75] 73) and
    35   flat ("|_|" [75] 74) and
    36   Sequ ("_ @ _" [78,77] 63) and
    36   Sequ ("_ @ _" [78,77] 63) and
    37   injval ("inj _ _ _" [79,77,79] 76) and 
    37   injval ("inj _ _ _" [79,77,79] 76) and 
    38   mkeps ("mkeps _" [79] 76) and 
    38   mkeps ("mkeps _" [79] 76) and 
    39   projval ("proj _ _ _" [1000,77,1000] 77) and 
    39   projval ("proj _ _ _" [1000,77,1000] 77) and 
    40   length ("len _" [78] 73) and
    40   length ("len _" [78] 73) and
    41   matcher ("lexer _ _" [78,78] 77) and
    41   matcher ("lexer _ _" [78,78] 77) and
    42 
    42 
    43   Prf ("\<triangleright> _ : _" [75,75] 75) and  
    43   Prf ("_ : _" [75,75] 75) and  
    44   PMatch ("'(_, _') \<rightarrow> _" [63,75,75] 75)
    44   PMatch ("'(_, _') \<rightarrow> _" [63,75,75] 75)
    45   (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *)
    45   (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *)
    46 
    46 
    47 definition 
    47 definition 
    48   "match r s \<equiv> nullable (ders s r)"
    48   "match r s \<equiv> nullable (ders s r)"
    57 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
    57 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em
    58 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
    58 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a
    59 character~@{text c}, and showed that it gave a simple solution to the
    59 character~@{text c}, and showed that it gave a simple solution to the
    60 problem of matching a string @{term s} with a regular expression @{term r}:
    60 problem of matching a string @{term s} with a regular expression @{term r}:
    61 if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
    61 if the derivative of @{term r} w.r.t.\ (in succession) all the characters of
    62 the string matches the empty string, then @{term r} matches @{term s}
    62 the string matches the empty string, then @{term r} matches @{term s} (and
    63 (and {\em vice versa}). The derivative has the property (which may be
    63 {\em vice versa}). The derivative has the property (which may almost be
    64 regarded as its specification) that, for every string @{term s} and regular
    64 regarded as its specification) that, for every string @{term s} and regular
    65 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
    65 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if
    66 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
    66 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
    67 derivatives is that they are neatly expressible in any functional language,
    67 derivatives is that they are neatly expressible in any functional language,
    68 and easily definable and reasoned about in theorem provers---the definitions
    68 and easily definable and reasoned about in theorem provers---the definitions
    69 just consist of inductive datatypes and simple recursive functions. A
    69 just consist of inductive datatypes and simple recursive functions. A
    70 completely formalised correctness proof of this matcher in for example HOL4
    70 completely formalised correctness proof of this matcher in for example HOL4
    71 has been mentioned in~\cite{Owens2008}. Another one in Isabelle/HOL is
    71 has been mentioned in~\cite{Owens2008}. Another one in Isabelle/HOL is part
    72 in \cite{Krauss2011}.
    72 of the work in \cite{Krauss2011}.
    73 
    73 
    74 One limitation of Brzozowski's matcher is that it only generates a YES/NO
    74 One limitation of Brzozowski's matcher is that it only generates a YES/NO
    75 answer for whether a string is being matched by a regular expression.
    75 answer for whether a string is being matched by a regular expression.
    76 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
    76 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
    77 generation not just of a YES/NO answer but of an actual matching, called a
    77 generation not just of a YES/NO answer but of an actual matching, called a
    78 [lexical] {\em value}. They give a simple algorithm to calculate a value
    78 [lexical] {\em value}. They give a simple algorithm to calculate a value
    79 that appears to be the value associated with POSIX matching
    79 that appears to be the value associated with POSIX matching
    80 \cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that
    80 \cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that
    81 value, in an algorithm-independent fashion, and to show that Sulzamann and
    81 value, in an algorithm-independent fashion, and to show that Sulzmann and
    82 Lu's derivative-based algorithm does indeed calculate a value that is
    82 Lu's derivative-based algorithm does indeed calculate a value that is
    83 correct according to the specification.
    83 correct according to the specification.
    84 
    84 
    85 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
    85 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
    86 relation (called an ``Order Relation'') on the set of values of @{term r},
    86 relation (called an ``Order Relation'') on the set of values of @{term r},
   145 that can match determines the token.
   145 that can match determines the token.
   146 \end{itemize}
   146 \end{itemize}
   147  
   147  
   148 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as
   148 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as
   149 @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising
   149 @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising
   150 identifiers (say, a single character followed by characters or numbers). Then we
   150 identifiers (say, a single character followed by characters or numbers).
   151 can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX
   151 Then we can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use
   152 matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the
   152 POSIX matching to tokenise strings, say @{text "iffoo"} and @{text "if"}.
   153 first case we obtain by the longest match rule a single identifier token,
   153 For @{text "iffoo"} we obtain by the longest match rule a single identifier
   154 not a keyword followed by an identifier. In the second case we obtain by rule
   154 token, not a keyword followed by an identifier. For @{text "if"} we obtain by
   155 priority a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
   155 rule priority a keyword token, not an identifier token---even if @{text
   156 matches also.\bigskip
   156 "r\<^bsub>id\<^esub>"} matches also.\bigskip
   157 
   157 
   158 \noindent {\bf Contributions:} (NOT DONE YET) We have implemented in Isabelle/HOL the
   158 \noindent {\bf Contributions:} (NOT DONE YET) We have implemented in
   159 derivative-based regular expression matching algorithm as described by
   159 Isabelle/HOL the derivative-based regular expression matching algorithm as
   160 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this
   160 described by Sulzmann and Lu \cite{Sulzmann2014}. We have proved the
   161 algorithm according to our specification of what a POSIX value is. The
   161 correctness of this algorithm according to our specification of what a POSIX
       
   162 value is. Sulzmann and Lu sketch in \cite{Sulzmann2014} an informal
       
   163 correctness proof: but to us it contains unfillable gaps.
       
   164 
   162 informal correctness proof given in \cite{Sulzmann2014} is in final
   165 informal correctness proof given in \cite{Sulzmann2014} is in final
   163 form\footnote{} and to us contains unfillable gaps. Our specification of a
   166 form\footnote{} and to us contains unfillable gaps.
   164 POSIX value consists of a simple inductive definition that given a string
   167 
   165 and a regular expression uniquely determines this value. Derivatives as
   168 Our specification of a POSIX value consists of a simple inductive definition
   166 calculated by Brzozowski's method are usually more complex regular
   169 that given a string and a regular expression uniquely determines this value.
   167 expressions than the initial one; various optimisations are possible, such
   170 Derivatives as calculated by Brzozowski's method are usually more complex
   168 as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r ZERO"}, @{term
   171 regular expressions than the initial one; various optimisations are
   169 "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the advantages of
   172 possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r
   170 having a simple specification and correctness proof is that the latter can
   173 ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the
   171 be refined to allow for such optimisations and simple correctness proof.
   174 advantages of having a simple specification and correctness proof is that
       
   175 the latter can be refined to allow for such optimisations and simple
       
   176 correctness proof.
   172 
   177 
   173 An extended version of \cite{Sulzmann2014} is available at the website of
   178 An extended version of \cite{Sulzmann2014} is available at the website of
   174 its first author; this includes some ``proofs'', claimed in
   179 its first author; this includes some ``proofs'', claimed in
   175 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
   180 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in
   176 final form, we make no comment thereon, preferring to give general reasons
   181 final form, we make no comment thereon, preferring to give general reasons
   218   \end{tabular}
   223   \end{tabular}
   219   \end{center}
   224   \end{center}
   220   
   225   
   221   \noindent In the fourth clause we use the operation @{term "DUMMY ;;
   226   \noindent In the fourth clause we use the operation @{term "DUMMY ;;
   222   DUMMY"} for the concatenation of two languages (it is also list-append for
   227   DUMMY"} for the concatenation of two languages (it is also list-append for
   223   strings). We use the star-notation for regular expressions and languages
   228   strings). We use the star-notation for regular expressions and for
   224   (in the last clause above). The star on languages is defined inductively
   229   languages (in the last clause above). The star for languages is defined
   225   by two clauses: @{text "(i)"} for the empty string being in the star of a
   230   inductively by two clauses: @{text "(i)"} the empty string being in
   226   language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term
   231   the star of a language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a
   227   "s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in
   232   language and @{term "s\<^sub>2"} in the star of this language, then also @{term
   228   the star of this language. It will also be convenient to use the following
   233   "s\<^sub>1 @ s\<^sub>2"} is in the star of this language. It will also be convenient
   229   notion of a \emph{semantic derivative} (or \emph{left quotient}) of a
   234   to use the following notion of a \emph{semantic derivative} (or \emph{left
   230   language, say @{text A}, defined as:
   235   quotient}) of a language defined as:
   231 
   236 
   232   \begin{center}
   237   \begin{center}
   233   \begin{tabular}{lcl}
   238   \begin{tabular}{lcl}
   234   @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
   239   @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\
   235   \end{tabular}
   240   \end{tabular}
   304   \end{center}
   309   \end{center}
   305 
   310 
   306   \noindent gives a positive answer if and only if @{term "s \<in> L r"}.
   311   \noindent gives a positive answer if and only if @{term "s \<in> L r"}.
   307   Consequently, this regular expression matching algorithm satisfies the
   312   Consequently, this regular expression matching algorithm satisfies the
   308   usual specification. While the matcher above calculates a provably correct
   313   usual specification. While the matcher above calculates a provably correct
   309   a YES/NO answer for whether a regular expression matches a string, the
   314   YES/NO answer for whether a regular expression matches a string, the novel
   310   novel idea of Sulzmann and Lu \cite{Sulzmann2014} is to append another
   315   idea of Sulzmann and Lu \cite{Sulzmann2014} is to append another phase to
   311   phase to this algorithm in order to calculate a [lexical] value. We will
   316   this algorithm in order to calculate a [lexical] value. We will explain
   312   explain the details next.
   317   the details next.
   313 
   318 
   314 *}
   319 *}
   315 
   320 
   316 section {* POSIX Regular Expression Matching\label{posixsec} *}
   321 section {* POSIX Regular Expression Matching\label{posixsec} *}
   317 
   322 
   318 text {* 
   323 text {* 
   319 
   324 
   320 The clever idea in \cite{Sulzmann2014} is to introduce values for encoding
   325   The clever idea in \cite{Sulzmann2014} is to introduce values for encoding
   321 \emph{how} a regular expression matches a string and then define a function
   326   \emph{how} a regular expression matches a string and then define a
   322 on values that mirrors (but inverts) the construction of the derivative on
   327   function on values that mirrors (but inverts) the construction of the
   323 regular expressions. \emph{Values} are defined as the inductive datatype
   328   derivative on regular expressions. \emph{Values} are defined as the
       
   329   inductive datatype
   324 
   330 
   325   \begin{center}
   331   \begin{center}
   326   @{text "v :="}
   332   @{text "v :="}
   327   @{const "Void"} $\mid$
   333   @{const "Void"} $\mid$
   328   @{term "val.Char c"} $\mid$
   334   @{term "val.Char c"} $\mid$
   332   @{term "Stars vs"} 
   338   @{term "Stars vs"} 
   333   \end{center}  
   339   \end{center}  
   334 
   340 
   335   \noindent where we use @{term vs} standing for a list of values. (This is
   341   \noindent where we use @{term vs} standing for a list of values. (This is
   336   similar to the approach taken by Frisch and Cardelli for GREEDY matching
   342   similar to the approach taken by Frisch and Cardelli for GREEDY matching
   337   \cite{Frisch2014}, and Sulzmann and Lu \cite{2014} for POSIX matching).
   343   \cite{Frisch2004}, and Sulzmann and Lu \cite{Sulzmann2014} for POSIX matching).
   338   The string underlying a value can be calculated by the @{const flat}
   344   The string underlying a value can be calculated by the @{const flat}
   339   function, written @{term "flat DUMMY"} and defined as:
   345   function, written @{term "flat DUMMY"} and defined as:
   340 
   346 
   341   \begin{center}
   347   \begin{center}
   342   \begin{tabular}{lcl}
   348   \begin{tabular}{lcl}
   365   \end{tabular}
   371   \end{tabular}
   366   \end{center}
   372   \end{center}
   367 
   373 
   368   \noindent Note that no values are associated with the regular expression
   374   \noindent Note that no values are associated with the regular expression
   369   @{term ZERO}, and that the only value associated with the regular
   375   @{term ZERO}, and that the only value associated with the regular
   370   expression @{term ONE} is @{term Void}, pronounced (if one must) as {\em
   376   expression @{term ONE} is @{term Void}, pronounced (if one must) as @{text
   371   ``Void''}. It is routine to establish how values ``inhabiting'' a regular
   377   "Void"}. It is routine to establish how values ``inhabiting'' a regular
   372   expression correspond to the language of a regular expression, namely
   378   expression correspond to the language of a regular expression, namely
   373 
   379 
   374   \begin{proposition}
   380   \begin{proposition}
   375   @{thm L_flat_Prf}
   381   @{thm L_flat_Prf}
   376   \end{proposition}
   382   \end{proposition}
   377 
   383 
   378   In general there are more than one value associated with a regular
   384   In general there is more than one value associated with a regular
   379   expression. In case of POSIX matching the problem is to calculate the
   385   expression. In case of POSIX matching the problem is to calculate the
   380   unique value that satisfies the (informal) POSIX constraints from the
   386   unique value that satisfies the (informal) POSIX rules from the
   381   Introduction. Graphically the regular expression matching algorithm by
   387   Introduction. Graphically the POSIX value calculation algorithm by
   382   Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
   388   Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
   383   where the path from the left to the right involving @{const der}/@{const
   389   where the path from the left to the right involving @{const der}/@{const
   384   nullable} is the first phase of the algorithm (calculating successive
   390   nullable} is the first phase of the algorithm (calculating successive
   385   \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
   391   \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
   386   left, the second phase. This picture shows the steps required when a
   392   left, the second phase. This picture shows the steps required when a
   388   "[a,b,c]"}. We first build the three derivatives (according to @{term a},
   394   "[a,b,c]"}. We first build the three derivatives (according to @{term a},
   389   @{term b} and @{term c}). We then use @{const nullable} to find out
   395   @{term b} and @{term c}). We then use @{const nullable} to find out
   390   whether the resulting derivative regular expression @{term "r\<^sub>4"}
   396   whether the resulting derivative regular expression @{term "r\<^sub>4"}
   391   can match the empty string. If yes, we call the function @{const mkeps}
   397   can match the empty string. If yes, we call the function @{const mkeps}
   392   that produces a value @{term "v\<^sub>4"} for how @{term "r\<^sub>4"} can
   398   that produces a value @{term "v\<^sub>4"} for how @{term "r\<^sub>4"} can
   393   match the empty string (taking into account the POSIX constraints in case
   399   match the empty string (taking into account the POSIX rules in case
   394   there are several ways). This functions is defined by the clauses:
   400   there are several ways). This functions is defined by the clauses:
   395 
   401 
   396 \begin{figure}[t]
   402 \begin{figure}[t]
   397 \begin{center}
   403 \begin{center}
   398 \begin{tikzpicture}[scale=2,node distance=1.3cm,
   404 \begin{tikzpicture}[scale=2,node distance=1.3cm,
   437   \end{center}
   443   \end{center}
   438 
   444 
   439   \noindent Note that this function needs only to be partially defined,
   445   \noindent Note that this function needs only to be partially defined,
   440   namely only for regular expressions that are nullable. In case @{const
   446   namely only for regular expressions that are nullable. In case @{const
   441   nullable} fails, the string @{term "[a,b,c]"} cannot be matched by @{term
   447   nullable} fails, the string @{term "[a,b,c]"} cannot be matched by @{term
   442   "r\<^sub>1"} and an error is raised. Note also how this function makes
   448   "r\<^sub>1"} and an error is raised instead. Note also how this function
   443   some subtle choices leading to a POSIX value: for example if the
   449   makes some subtle choices leading to a POSIX value: for example if an
   444   alternative, say @{term "ALT r\<^sub>1 r\<^sub>2"}, can match the empty
   450   alternative regular expression, say @{term "ALT r\<^sub>1 r\<^sub>2"}, can
   445   string and furthermore @{term "r\<^sub>1"} can match the empty string,
   451   match the empty string and furthermore @{term "r\<^sub>1"} can match the
   446   then we return a @{const Left}-value. The @{const Right}-value will only
   452   empty string, then we return a @{text Left}-value. The @{text
   447   be returned if @{term "r\<^sub>1"} is not nullable.
   453   Right}-value will only be returned if @{term "r\<^sub>1"} is not nullable.
   448 
   454 
   449   The most interesting novelty from Sulzmann and Lu \cite{Sulzmann2014} is
   455   The most interesting novelty from Sulzmann and Lu \cite{Sulzmann2014} is
   450   the construction value for how @{term "r\<^sub>1"} can match the string
   456   the construction of a value for how @{term "r\<^sub>1"} can match the
   451   @{term "[a,b,c]"} from the value how the last derivative, @{term
   457   string @{term "[a,b,c]"} from the value how the last derivative, @{term
   452   "r\<^sub>4"} in Fig~\ref{Sulz}, can match the empty string. Sulzmann and
   458   "r\<^sub>4"} in Fig~\ref{Sulz}, can match the empty string. Sulzmann and
   453   Lu acchieve this by stepwise ``injecting back'' the characters into the
   459   Lu achieve this by stepwise ``injecting back'' the characters into the
   454   values thus inverting the operation of building derivatives on the level
   460   values thus inverting the operation of building derivatives on the level
   455   of values. The corresponding function, called @{term inj}, takes three
   461   of values. The corresponding function, called @{term inj}, takes three
   456   arguments, a regular expression, a character and a value. For example in
   462   arguments, a regular expression, a character and a value. For example in
   457   the first @{term inj}-step in Fig~\ref{Sulz} the regular expression @{term
   463   the first (or right-most) @{term inj}-step in Fig~\ref{Sulz} the regular
   458   "r\<^sub>3"}, the character @{term c} from the last derivative step and
   464   expression @{term "r\<^sub>3"}, the character @{term c} from the last
   459   @{term "v\<^sub>4"}, which is the value corresponding to the derivative
   465   derivative step and @{term "v\<^sub>4"}, which is the value corresponding
   460   regular expression @{term "r\<^sub>4"}. The result is the new value @{term
   466   to the derivative regular expression @{term "r\<^sub>4"}. The result is
   461   "v\<^sub>3"}. The final result of the algorithm is the value @{term
   467   the new value @{term "v\<^sub>3"}. The final result of the algorithm is
   462   "v\<^sub>1"} corresponding to the input regular expression. The @{term
   468   the value @{term "v\<^sub>1"} corresponding to the input regular
   463   inj} function is by recursion on the regular expression and by analysing
   469   expression. The @{term inj} function is by recursion on the regular
   464   the shape of values.
   470   expressions and by analysing the shape of values (corresponding to 
   465 
   471   the derivative regular expressions).
   466   \begin{center}
   472 
   467   \begin{tabular}{llcl}
   473   \begin{center}
       
   474   \begin{tabular}{l@ {\hspace{5mm}}lcl}
   468   (1) & @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
   475   (1) & @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
   469   (2) & @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
   476   (2) & @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
   470       @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
   477       @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
   471   (3) & @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
   478   (3) & @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
   472       @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
   479       @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
   481   \end{tabular}
   488   \end{tabular}
   482   \end{center}
   489   \end{center}
   483 
   490 
   484   \noindent To better understand what is going on in this definition it
   491   \noindent To better understand what is going on in this definition it
   485   might be instructive to look first at the three sequence cases (clauses
   492   might be instructive to look first at the three sequence cases (clauses
   486   (4)--(6)). In each case we need to construct an ``injected value'' for @{term
   493   (4)--(6)). In each case we need to construct an ``injected value'' for
   487   "SEQ r\<^sub>1 r\<^sub>2"}. Recall the clause of the @{const der}-function
   494   @{term "SEQ r\<^sub>1 r\<^sub>2"}. This must be a value of the form @{term
       
   495   "Seq DUMMY DUMMY"}. Recall the clause of the @{const der}-function
   488   for sequence regular expressions:
   496   for sequence regular expressions:
   489 
   497 
   490   \begin{center}
   498   \begin{center}
   491   @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}
   499   @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}
   492   \end{center}
   500   \end{center}
   493 
   501 
   494   \noindent Consider first the else-branch where the derivative is @{term
   502   \noindent Consider first the else-branch where the derivative is @{term
   495   "SEQ (der c r\<^sub>1) r\<^sub>2"}. The corresponding value must therefore
   503   "SEQ (der c r\<^sub>1) r\<^sub>2"}. The corresponding value must therefore
   496   be the form @{term "Seq v\<^sub>1 v\<^sub>2"}, which matches clause (4) of
   504   be the form @{term "Seq v\<^sub>1 v\<^sub>2"}, which matches the left-hand
   497   @{term inj}. In the if-branch the derivative is an alternative, namely
   505   side in clause (4) of @{term inj}. In the if-branch the derivative is an
   498   @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c r\<^sub>2)"}. This
   506   alternative, namely @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c
   499   means we either have to consider a @{text Left}- or @{text Right}-value.
   507   r\<^sub>2)"}. This means we either have to consider a @{text Left}- or
   500   In case of the @{text Left}-value we know further it must be a value for a
   508   @{text Right}-value. In case of the @{text Left}-value we know further it
   501   sequence regular expression. Therefore the pattern we match in the clause
   509   must be a value for a sequence regular expression. Therefore the pattern
   502   (5) is @{term "Left (Seq v\<^sub>1 v\<^sub>2)"}, while in (6) it is just
   510   we match in the clause (5) is @{term "Left (Seq v\<^sub>1 v\<^sub>2)"},
   503   @{term "Right v\<^sub>2"}. One more interesting point is in the right-hand
   511   while in (6) it is just @{term "Right v\<^sub>2"}. One more interesting
   504   side of clause (6): since in this case the regular expression @{text
   512   point is in the right-hand side of clause (6): since in this case the
   505   "r\<^sub>1"} does not ``contribute'' in matching the string, that is only
   513   regular expression @{text "r\<^sub>1"} does not ``contribute'' for
   506   matches the empty string, we need to call @{const mkeps} in order to
   514   matching the string, that is only matches the empty string, we need to
   507   construct a value how @{term "r\<^sub>1"} can match this empty string. A
   515   call @{const mkeps} in order to construct a value how @{term "r\<^sub>1"}
   508   similar argument applies for why we can expect in clause (7) that the
   516   can match this empty string. A similar argument applies for why we can
   509   value is of the form @{term "Seq v (Stars vs)"} (the derivative of a star
   517   expect in the left-hand side of clause (7) that the value is of the form
   510   is @{term "SEQ r (STAR r)"}). Finally, the reason for why we can ignore
   518   @{term "Seq v (Stars vs)"}---the derivative of a star is @{term "SEQ r
   511   the second argument in clause (1) of @{term inj} is that it will only ever
   519   (STAR r)"}. Finally, the reason for why we can ignore the second argument
   512   be called in cases where @{term "c=d"}, but the usual linearity
   520   in clause (1) of @{term inj} is that it will only ever be called in cases
   513   restrictions in pattern-matches do not allow is to build this constraint
   521   where @{term "c=d"}, but the usual linearity restrictions in patterns do
   514   explicitly into the pattern.
   522   not allow is to build this constraint explicitly into our function
       
   523   definition.\footnote{Sulzmann and Lu state this clause as @{thm (lhs)
       
   524   injval.simps(1)[of "c" "c"]} $\dn$ @{thm (rhs) injval.simps(1)[of "c"]},
       
   525   but our deviation is harmless.}
   515 
   526 
   516   Having defined the @{const mkeps} and @{text inj} function we can extend
   527   Having defined the @{const mkeps} and @{text inj} function we can extend
   517   \Brz's matcher so that a [lexical] value is constructed (assuming the
   528   \Brz's matcher so that a [lexical] value is constructed (assuming the
   518   regular expression matches the string). The clauses of the lexer are
   529   regular expression matches the string). The clauses of the lexer are
   519 
   530 
   525                      & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}                          
   536                      & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}                          
   526   \end{tabular}
   537   \end{tabular}
   527   \end{center}
   538   \end{center}
   528 
   539 
   529   \noindent If the regular expression does not match, @{const None} is
   540   \noindent If the regular expression does not match, @{const None} is
   530   returned. If the regular expression does match the string, then @{const
   541   returned, indicating an error is raised. If the regular expression does
   531   Some} value is returned. Again the virtues of this algorithm is that it
   542   match the string, then @{const Some} value is returned. One important
   532   can be implemented with ease in a functional programming language and also
   543   virtue of this algorithm is that it can be implemented with ease in a
   533   in Isabelle/HOL. In the remaining part of this section we prove that
   544   functional programming language and also in Isabelle/HOL. In the remaining
   534   this algorithm is correct.
   545   part of this section we prove that this algorithm is correct.
   535 
   546 
   536   The well-known idea of POSIX matching is informally defined by the longest
   547   The well-known idea of POSIX matching is informally defined by the longest
   537   match and priority rule; as correctly argued in \cite{Sulzmann2014}, this
   548   match and priority rule; as correctly argued in \cite{Sulzmann2014}, this
   538   needs formal specification. 
   549   needs formal specification. Sulzmann and Lu define a \emph{dominance}
   539 
   550   relation\footnote{Sulzmann and Lu call it an ordering relation, but
   540   We use a simple inductive definition to specify this notion, incorporating
   551   without giving evidence that it is transitive.} between values and argue that
   541   the POSIX-specific choices into the side-conditions for the rules $R tl
   552   there is a maximum value, as given by the derivative-based algorithm. In
   542   +_2$, $R tl\circ$ and $R tl*$ (as they are now called). By contrast,
   553   contrast, we shall next introduce a simple inductive definition to specify
   543   \cite{Sulzmann2014} defines a relation between values and argues that there is a
   554   what a \emph{POSIX value} is, incorporating the POSIX-specific choices
   544   maximum value, as given by the derivative-based algorithm yet to be spelt
   555   into the side-conditions of our rules. Our definition is inspired by the
   545   out. The relation we define is ternary, relating strings, values and regular
   556   matching relation given in \cite{Vansummeren2006}. The relation we define
   546   expressions.
   557   is ternary and written as \mbox{@{term "s \<in> r \<rightarrow> v"}}, relating strings,
   547 
   558   regular expressions and values.
   548 Our Posix relation @{term "s \<in> r \<rightarrow> v"}
       
   549 
   559 
   550   \begin{center}
   560   \begin{center}
   551   \begin{tabular}{c}
   561   \begin{tabular}{c}
   552   @{thm[mode=Axiom] PMatch.intros(1)} \qquad
   562   @{thm[mode=Axiom] PMatch.intros(1)} \qquad
   553   @{thm[mode=Axiom] PMatch.intros(2)}\medskip\\
   563   @{thm[mode=Axiom] PMatch.intros(2)}\bigskip\\
   554   @{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\qquad
   564   @{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\qquad
   555   @{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\medskip\\
   565   @{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\bigskip\\
   556   \multicolumn{1}{p{5cm}}{@{thm[mode=Rule] PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}\medskip\\
   566   $\mprset{flushleft}
   557   @{thm[mode=Rule] PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}\medskip\\
   567    \inferrule
   558   @{thm[mode=Axiom] PMatch.intros(7)}\medskip\\
   568    {@{thm (prem 1) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
   559   \end{tabular}
   569     @{thm (prem 2) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\
   560   \end{center}
   570     @{thm (prem 3) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
   561 
   571    {@{thm (concl) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\\
       
   572   @{thm[mode=Axiom] PMatch.intros(7)}\bigskip\\
       
   573   $\mprset{flushleft}
       
   574    \inferrule
       
   575    {@{thm (prem 1) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
       
   576     @{thm (prem 2) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
       
   577     @{thm (prem 3) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
       
   578     @{thm (prem 4) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
       
   579    {@{thm (concl) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$
       
   580   \end{tabular}
       
   581   \end{center}
       
   582 
       
   583   \noindent We claim that this relation captures the idea behind the two
       
   584   informal POSIX rules shown in the Introduction: Consider the second line
       
   585   where the POSIX values for an alternative regular expression is
       
   586   specified---it is always a @{text "Left"}-value, \emph{except} when the
       
   587   string to be matched is not in the language of @{term "r\<^sub>1"}; only then it
       
   588   is a @{text Right}-value (see the side-condition in the rule on the
       
   589   right). Interesting is also the rule for sequence regular expressions
       
   590   (third line). The first two premises state that @{term "v\<^sub>1"} and
       
   591   @{term "v\<^sub>2"} are the POSIX values for @{term "(s\<^sub>1,
       
   592   r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"} respectively.
       
   593   
       
   594   \begin{theorem}
       
   595   @{thm[mode=IfThen] PMatch_determ(1)[of _ _ "v\<^sub>1" "v\<^sub>2"]}
       
   596   \end{theorem}
       
   597 
       
   598   \begin{lemma}
       
   599   @{thm[mode=IfThen] PMatch_mkeps}
       
   600   \end{lemma}
       
   601 
       
   602   \begin{lemma}
       
   603   @{thm[mode=IfThen] Prf_injval_flat}
       
   604   \end{lemma}
       
   605 
       
   606   \begin{lemma}
       
   607   @{thm[mode=IfThen] PMatch2_roy_version}
       
   608   \end{lemma}
       
   609 
       
   610   \begin{theorem}\mbox{}\smallskip\\
       
   611   \begin{tabular}{ll}
       
   612   (1) & @{thm (lhs) lex_correct1a} if and only if @{thm (rhs) lex_correct1a}\\
       
   613   (2) & @{thm (lhs) lex_correct3a} if and only if @{thm (rhs) lex_correct3a}\\
       
   614   \end{tabular}
       
   615   \end{theorem}
   562 *}
   616 *}
   563 
   617 
   564 section {* The Argument by Sulzmmann and Lu *}
   618 section {* The Argument by Sulzmmann and Lu *}
   565 
   619 
   566 section {* Conclusion *}
   620 section {* Conclusion *}
   572 *}
   626 *}
   573 
   627 
   574 
   628 
   575 text {*
   629 text {*
   576 
   630 
   577 
       
   578 
       
   579 
       
   580   \noindent
       
   581   Values
       
   582 
       
   583   
       
   584 
       
   585  
       
   586 
       
   587  
       
   588 
       
   589   \noindent
       
   590   The @{const mkeps} function
       
   591 
       
   592  
       
   593 
       
   594   \noindent
       
   595   The @{text inj} function
       
   596 
       
   597   
       
   598 
       
   599   \noindent
       
   600   The inhabitation relation:
       
   601 
       
   602   \begin{center}
       
   603   \begin{tabular}{c}
       
   604   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ 
       
   605   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
       
   606   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\
       
   607   @{thm[mode=Axiom] Prf.intros(4)} \qquad 
       
   608   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\
       
   609   @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad 
       
   610   @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\
       
   611   \end{tabular}
       
   612   \end{center}
       
   613 
   631 
   614   \noindent
   632   \noindent
   615   We have also introduced a slightly restricted version of this relation
   633   We have also introduced a slightly restricted version of this relation
   616   where the last rule is restricted so that @{term "flat v \<noteq> []"}.
   634   where the last rule is restricted so that @{term "flat v \<noteq> []"}.
   617   This relation for \emph{non-problematic} is written @{term "\<Turnstile> v : r"}.
   635   This relation for \emph{non-problematic} is written @{term "\<Turnstile> v : r"}.