82 derivatives is that they are neatly expressible in any functional language, |
82 derivatives is that they are neatly expressible in any functional language, |
83 and easily definable and reasoned about in theorem provers---the definitions |
83 and easily definable and reasoned about in theorem provers---the definitions |
84 just consist of inductive datatypes and simple recursive functions. A |
84 just consist of inductive datatypes and simple recursive functions. A |
85 completely formalised correctness proof of this matcher in for example HOL4 |
85 completely formalised correctness proof of this matcher in for example HOL4 |
86 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
86 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
87 of the work by Krauss and Nipkow \cite{Krauss2011}. |
87 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given |
|
88 by Coquand and Siles \cite{Coquand2012}. |
88 |
89 |
89 One limitation of Brzozowski's matcher is that it only generates a YES/NO |
90 One limitation of Brzozowski's matcher is that it only generates a YES/NO |
90 answer for whether a string is being matched by a regular expression. |
91 answer for whether a string is being matched by a regular expression. |
91 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow |
92 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow |
92 generation not just of a YES/NO answer but of an actual matching, called a |
93 generation not just of a YES/NO answer but of an actual matching, called a |
100 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a |
101 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a |
101 relation (called an ``order relation'') on the set of values of @{term r}, |
102 relation (called an ``order relation'') on the set of values of @{term r}, |
102 and to show that (once a string to be matched is chosen) there is a maximum |
103 and to show that (once a string to be matched is chosen) there is a maximum |
103 element and that it is computed by their derivative-based algorithm. This |
104 element and that it is computed by their derivative-based algorithm. This |
104 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a |
105 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a |
105 GREEDY regular expression matching algorithm. Beginning with our |
106 GREEDY regular expression matching algorithm. However, we were not able to |
106 observations that, without evidence that it is transitive, it cannot be |
107 establish transitivity and totality for the ``order relation'' by |
107 called an ``order relation'', and that the relation is called a ``total |
108 Sulzmann and Lu. In Section \ref{arg} we |
108 order'' despite being evidently not total\footnote{The relation @{text |
109 identify problems with their approach (of which some of the proofs are not |
109 "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} is a relation on the |
|
110 values for the regular expression @{term r}; but it only holds between |
|
111 @{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have |
|
112 the same flattening (underlying string). So a counterexample to totality is |
|
113 given by taking two values @{term "v\<^sub>1"} and @{term "v\<^sub>2"} for @{term r} that |
|
114 have different flattenings (see Section~\ref{posixsec}). A different |
|
115 relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r} |
|
116 with flattening @{term s} is definable by the same approach, and is indeed |
|
117 total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}, we |
|
118 identify problems with this approach (of which some of the proofs are not |
|
119 published in \cite{Sulzmann2014}); perhaps more importantly, we give a |
110 published in \cite{Sulzmann2014}); perhaps more importantly, we give a |
120 simple inductive (and algorithm-independent) definition of what we call |
111 simple inductive (and algorithm-independent) definition of what we call |
121 being a {\em POSIX value} for a regular expression @{term r} and a string |
112 being a {\em POSIX value} for a regular expression @{term r} and a string |
122 @{term s}; we show that the algorithm computes such a value and that such a |
113 @{term s}; we show that the algorithm computes such a value and that such a |
123 value is unique. Proofs are both done by hand and checked in Isabelle/HOL. |
114 value is unique. Proofs are both done by hand and checked in Isabelle/HOL. |
124 The experience of doing our proofs has been that this mechanical checking |
115 The experience of doing our proofs has been that this mechanical checking |
125 was absolutely essential: this subject area has hidden snares. This was also |
116 was absolutely essential: this subject area has hidden snares. This was also |
126 noted by Kuklewicz \cite{Kuklewicz} who found that nearly all POSIX matching |
117 noted by Kuklewicz \cite{Kuklewicz} who found that nearly all POSIX matching |
127 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
118 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
|
119 |
|
120 %\footnote{The relation @{text "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} |
|
121 %is a relation on the |
|
122 %values for the regular expression @{term r}; but it only holds between |
|
123 %@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have |
|
124 %the same flattening (underlying string). So a counterexample to totality is |
|
125 %given by taking two values @{term "v\<^sub>1"} and @{term "v\<^sub>2"} for @{term r} that |
|
126 %have different flattenings (see Section~\ref{posixsec}). A different |
|
127 %relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r} |
|
128 %with flattening @{term s} is definable by the same approach, and is indeed |
|
129 %total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.} |
|
130 |
128 |
131 |
129 If a regular expression matches a string, then in general there is more than |
132 If a regular expression matches a string, then in general there is more than |
130 one way of how the string is matched. There are two commonly used |
133 one way of how the string is matched. There are two commonly used |
131 disambiguation strategies to generate a unique answer: one is called GREEDY |
134 disambiguation strategies to generate a unique answer: one is called GREEDY |
132 matching \cite{Frisch2004} and the other is POSIX |
135 matching \cite{Frisch2004} and the other is POSIX |