thys/Paper/Paper.thy
changeset 115 15ef2af1a6f2
parent 114 8b41d01b5e5d
child 116 022503caa187
equal deleted inserted replaced
114:8b41d01b5e5d 115:15ef2af1a6f2
    62 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
    62 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's
    63 derivatives is that they are neatly expressible in any functional language,
    63 derivatives is that they are neatly expressible in any functional language,
    64 and easily definable and reasoned about in theorem provers---the definitions
    64 and easily definable and reasoned about in theorem provers---the definitions
    65 just consist of inductive datatypes and simple recursive functions. A
    65 just consist of inductive datatypes and simple recursive functions. A
    66 completely formalised correctness proof of this matcher in for example HOL4
    66 completely formalised correctness proof of this matcher in for example HOL4
    67 has been given in~\cite{Owens2008}.
    67 has been mentioned in~\cite{Owens2008}.
    68 
    68 
    69 One limitation of Brzozowski's matcher is that it only generates a YES/NO
    69 One limitation of Brzozowski's matcher is that it only generates a YES/NO
    70 answer for whether a string is being matched by a regular expression.
    70 answer for whether a string is being matched by a regular expression.
    71 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
    71 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow
    72 generation not just of a YES/NO answer but of an actual matching, called a
    72 generation not just of a YES/NO answer but of an actual matching, called a
    83 element and that it is computed by their derivative-based algorithm. This
    83 element and that it is computed by their derivative-based algorithm. This
    84 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a
    84 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a
    85 GREEDY regular expression matching algorithm. Beginning with our
    85 GREEDY regular expression matching algorithm. Beginning with our
    86 observations that, without evidence that it is transitive, it cannot be
    86 observations that, without evidence that it is transitive, it cannot be
    87 called an ``order relation'', and that the relation is called a ``total
    87 called an ``order relation'', and that the relation is called a ``total
    88 order'' despite being evidently not total\footnote{\textcolor{green}{We
    88 order'' despite being evidently not total\footnote{The relation @{text
    89 should give an argument as footnote}}, we identify problems with this
    89 "\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the values for
    90 approach (of which some of the proofs are not published in
    90 the regular expression @{term r}; but it only holds between @{term v} and
    91 \cite{Sulzmann2014}); perhaps more importantly, we give a simple inductive
    91 @{term "v'"} in cases where @{term v} and @{term "v'"} have the same
    92 (and algorithm-independent) definition of what we call being a {\em POSIX
    92 flattening (underlying string). So a counterexample to totality is given by
    93 value} for a regular expression @{term r} and a string @{term s}; we show
    93 taking two values @{term v} and @{term "v'"} for @{term r} that have
    94 that the algorithm computes such a value and that such a value is unique.
    94 different flattenings. A different relation @{text "\<ge>\<^bsub>r,s\<^esub>"}
    95 Proofs are both done by hand and checked in Isabelle/HOL. The
    95 on the set of values for @{term r} with flattening @{term s} is definable by
    96 experience of doing our proofs has been that this mechanical checking was
    96 the same approach, and is indeed total; but that is not what Proposition 1
    97 absolutely essential: this subject area has hidden snares. This was also
    97 of \cite{Sulzmann2014} does.}, we identify problems with this approach (of
    98 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching
    98 which some of the proofs are not published in \cite{Sulzmann2014}); perhaps
    99 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}.
    99 more importantly, we give a simple inductive (and algorithm-independent)
       
   100 definition of what we call being a {\em POSIX value} for a regular
       
   101 expression @{term r} and a string @{term s}; we show that the algorithm
       
   102 computes such a value and that such a value is unique. Proofs are both done
       
   103 by hand and checked in Isabelle/HOL. The experience of doing our proofs has
       
   104 been that this mechanical checking was absolutely essential: this subject
       
   105 area has hidden snares. This was also noted by Kuklewitz \cite{Kuklewicz}
       
   106 who found that nearly all POSIX matching implementations are ``buggy''
       
   107 \cite[Page 203]{Sulzmann2014}.
   100 
   108 
   101 If a regular expression matches a string, then in general there is more than
   109 If a regular expression matches a string, then in general there is more than
   102 one way of how the string is matched. There are two commonly used
   110 one way of how the string is matched. There are two commonly used
   103 disambiguation strategies to generate a unique answer: one is called GREEDY
   111 disambiguation strategies to generate a unique answer: one is called GREEDY
   104 matching \cite{Frisch2004} and the other is POSIX
   112 matching \cite{Frisch2004} and the other is POSIX
   367   expression can match the empty string. If yes, we call the function
   375   expression can match the empty string. If yes, we call the function
   368   $mkeps$.
   376   $mkeps$.
   369 
   377 
   370 \begin{figure}[t]
   378 \begin{figure}[t]
   371 \begin{center}
   379 \begin{center}
   372 \begin{tikzpicture}[scale=2,node distance=1.2cm,
   380 \begin{tikzpicture}[scale=2,node distance=1.3cm,
   373                     every node/.style={minimum size=7mm}]
   381                     every node/.style={minimum size=7mm}]
   374 \node (r1)  {@{term "r\<^sub>1"}};
   382 \node (r1)  {@{term "r\<^sub>1"}};
   375 \node (r2) [right=of r1]{@{term "r\<^sub>2"}};
   383 \node (r2) [right=of r1]{@{term "r\<^sub>2"}};
   376 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}};
   384 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}};
   377 \node (r3) [right=of r2]{@{term "r\<^sub>3"}};
   385 \node (r3) [right=of r2]{@{term "r\<^sub>3"}};
   380 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {@{term "der c DUMMY"}};
   388 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {@{term "der c DUMMY"}};
   381 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{@{term nullable}}};
   389 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{@{term nullable}}};
   382 \node (v4) [below=of r4]{@{term "v\<^sub>4"}};
   390 \node (v4) [below=of r4]{@{term "v\<^sub>4"}};
   383 \draw[->,line width=1mm](r4) -- (v4);
   391 \draw[->,line width=1mm](r4) -- (v4);
   384 \node (v3) [left=of v4] {@{term "v\<^sub>3"}};
   392 \node (v3) [left=of v4] {@{term "v\<^sub>3"}};
   385 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {@{term "inj c"}};
   393 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {@{text "inj r\<^sub>3 c"}};
   386 \node (v2) [left=of v3]{@{term "v\<^sub>2"}};
   394 \node (v2) [left=of v3]{@{term "v\<^sub>2"}};
   387 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {@{term "inj b"}};
   395 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {@{text "inj r\<^sub>2 b"}};
   388 \node (v1) [left=of v2] {@{term "v\<^sub>1"}};
   396 \node (v1) [left=of v2] {@{term "v\<^sub>1"}};
   389 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {@{term "inj a"}};
   397 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {@{text "inj r\<^sub>1 a"}};
   390 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}};
   398 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}};
   391 \end{tikzpicture}
   399 \end{tikzpicture}
   392 \end{center}
   400 \end{center}
   393 \caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014}
   401 \caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014}
   394 analysing the string @{term "[a,b,c]"}. The first phase (the arrows from 
   402 matching the string @{term "[a,b,c]"}. The first phase (the arrows from 
   395 left to right) is \Brz's matcher building succesive derivatives. If at the 
   403 left to right) is \Brz's matcher building succesive derivatives. If at the 
   396 last regular expression is @{term nullable}, then functions of the 
   404 last regular expression is @{term nullable}, then functions of the 
   397 second phase are called: first @{term mkeps} calculates a value witnessing
   405 second phase are called: first @{term mkeps} calculates a value witnessing
   398 how the empty string has been recognised by @{term "r\<^sub>4"}. After
   406 how the empty string has been recognised by @{term "r\<^sub>4"}. After
   399 that the function @{term inj} `injects back' the characters of the string into
   407 that the function @{term inj} `injects back' the characters of the string into