62 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
62 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
63 derivatives is that they are neatly expressible in any functional language, |
63 derivatives is that they are neatly expressible in any functional language, |
64 and easily definable and reasoned about in theorem provers---the definitions |
64 and easily definable and reasoned about in theorem provers---the definitions |
65 just consist of inductive datatypes and simple recursive functions. A |
65 just consist of inductive datatypes and simple recursive functions. A |
66 completely formalised correctness proof of this matcher in for example HOL4 |
66 completely formalised correctness proof of this matcher in for example HOL4 |
67 has been given in~\cite{Owens2008}. |
67 has been mentioned in~\cite{Owens2008}. |
68 |
68 |
69 One limitation of Brzozowski's matcher is that it only generates a YES/NO |
69 One limitation of Brzozowski's matcher is that it only generates a YES/NO |
70 answer for whether a string is being matched by a regular expression. |
70 answer for whether a string is being matched by a regular expression. |
71 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow |
71 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow |
72 generation not just of a YES/NO answer but of an actual matching, called a |
72 generation not just of a YES/NO answer but of an actual matching, called a |
83 element and that it is computed by their derivative-based algorithm. This |
83 element and that it is computed by their derivative-based algorithm. This |
84 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a |
84 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a |
85 GREEDY regular expression matching algorithm. Beginning with our |
85 GREEDY regular expression matching algorithm. Beginning with our |
86 observations that, without evidence that it is transitive, it cannot be |
86 observations that, without evidence that it is transitive, it cannot be |
87 called an ``order relation'', and that the relation is called a ``total |
87 called an ``order relation'', and that the relation is called a ``total |
88 order'' despite being evidently not total\footnote{\textcolor{green}{We |
88 order'' despite being evidently not total\footnote{The relation @{text |
89 should give an argument as footnote}}, we identify problems with this |
89 "\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the values for |
90 approach (of which some of the proofs are not published in |
90 the regular expression @{term r}; but it only holds between @{term v} and |
91 \cite{Sulzmann2014}); perhaps more importantly, we give a simple inductive |
91 @{term "v'"} in cases where @{term v} and @{term "v'"} have the same |
92 (and algorithm-independent) definition of what we call being a {\em POSIX |
92 flattening (underlying string). So a counterexample to totality is given by |
93 value} for a regular expression @{term r} and a string @{term s}; we show |
93 taking two values @{term v} and @{term "v'"} for @{term r} that have |
94 that the algorithm computes such a value and that such a value is unique. |
94 different flattenings. A different relation @{text "\<ge>\<^bsub>r,s\<^esub>"} |
95 Proofs are both done by hand and checked in Isabelle/HOL. The |
95 on the set of values for @{term r} with flattening @{term s} is definable by |
96 experience of doing our proofs has been that this mechanical checking was |
96 the same approach, and is indeed total; but that is not what Proposition 1 |
97 absolutely essential: this subject area has hidden snares. This was also |
97 of \cite{Sulzmann2014} does.}, we identify problems with this approach (of |
98 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching |
98 which some of the proofs are not published in \cite{Sulzmann2014}); perhaps |
99 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
99 more importantly, we give a simple inductive (and algorithm-independent) |
|
100 definition of what we call being a {\em POSIX value} for a regular |
|
101 expression @{term r} and a string @{term s}; we show that the algorithm |
|
102 computes such a value and that such a value is unique. Proofs are both done |
|
103 by hand and checked in Isabelle/HOL. The experience of doing our proofs has |
|
104 been that this mechanical checking was absolutely essential: this subject |
|
105 area has hidden snares. This was also noted by Kuklewitz \cite{Kuklewicz} |
|
106 who found that nearly all POSIX matching implementations are ``buggy'' |
|
107 \cite[Page 203]{Sulzmann2014}. |
100 |
108 |
101 If a regular expression matches a string, then in general there is more than |
109 If a regular expression matches a string, then in general there is more than |
102 one way of how the string is matched. There are two commonly used |
110 one way of how the string is matched. There are two commonly used |
103 disambiguation strategies to generate a unique answer: one is called GREEDY |
111 disambiguation strategies to generate a unique answer: one is called GREEDY |
104 matching \cite{Frisch2004} and the other is POSIX |
112 matching \cite{Frisch2004} and the other is POSIX |
367 expression can match the empty string. If yes, we call the function |
375 expression can match the empty string. If yes, we call the function |
368 $mkeps$. |
376 $mkeps$. |
369 |
377 |
370 \begin{figure}[t] |
378 \begin{figure}[t] |
371 \begin{center} |
379 \begin{center} |
372 \begin{tikzpicture}[scale=2,node distance=1.2cm, |
380 \begin{tikzpicture}[scale=2,node distance=1.3cm, |
373 every node/.style={minimum size=7mm}] |
381 every node/.style={minimum size=7mm}] |
374 \node (r1) {@{term "r\<^sub>1"}}; |
382 \node (r1) {@{term "r\<^sub>1"}}; |
375 \node (r2) [right=of r1]{@{term "r\<^sub>2"}}; |
383 \node (r2) [right=of r1]{@{term "r\<^sub>2"}}; |
376 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}}; |
384 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}}; |
377 \node (r3) [right=of r2]{@{term "r\<^sub>3"}}; |
385 \node (r3) [right=of r2]{@{term "r\<^sub>3"}}; |
380 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {@{term "der c DUMMY"}}; |
388 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {@{term "der c DUMMY"}}; |
381 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{@{term nullable}}}; |
389 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{@{term nullable}}}; |
382 \node (v4) [below=of r4]{@{term "v\<^sub>4"}}; |
390 \node (v4) [below=of r4]{@{term "v\<^sub>4"}}; |
383 \draw[->,line width=1mm](r4) -- (v4); |
391 \draw[->,line width=1mm](r4) -- (v4); |
384 \node (v3) [left=of v4] {@{term "v\<^sub>3"}}; |
392 \node (v3) [left=of v4] {@{term "v\<^sub>3"}}; |
385 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {@{term "inj c"}}; |
393 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {@{text "inj r\<^sub>3 c"}}; |
386 \node (v2) [left=of v3]{@{term "v\<^sub>2"}}; |
394 \node (v2) [left=of v3]{@{term "v\<^sub>2"}}; |
387 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {@{term "inj b"}}; |
395 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {@{text "inj r\<^sub>2 b"}}; |
388 \node (v1) [left=of v2] {@{term "v\<^sub>1"}}; |
396 \node (v1) [left=of v2] {@{term "v\<^sub>1"}}; |
389 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {@{term "inj a"}}; |
397 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {@{text "inj r\<^sub>1 a"}}; |
390 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}}; |
398 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}}; |
391 \end{tikzpicture} |
399 \end{tikzpicture} |
392 \end{center} |
400 \end{center} |
393 \caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014} |
401 \caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014} |
394 analysing the string @{term "[a,b,c]"}. The first phase (the arrows from |
402 matching the string @{term "[a,b,c]"}. The first phase (the arrows from |
395 left to right) is \Brz's matcher building succesive derivatives. If at the |
403 left to right) is \Brz's matcher building succesive derivatives. If at the |
396 last regular expression is @{term nullable}, then functions of the |
404 last regular expression is @{term nullable}, then functions of the |
397 second phase are called: first @{term mkeps} calculates a value witnessing |
405 second phase are called: first @{term mkeps} calculates a value witnessing |
398 how the empty string has been recognised by @{term "r\<^sub>4"}. After |
406 how the empty string has been recognised by @{term "r\<^sub>4"}. After |
399 that the function @{term inj} `injects back' the characters of the string into |
407 that the function @{term inj} `injects back' the characters of the string into |