81 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
81 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
82 derivatives is that they are neatly expressible in any functional language, |
82 derivatives is that they are neatly expressible in any functional language, |
83 and easily definable and reasoned about in theorem provers---the definitions |
83 and easily definable and reasoned about in theorem provers---the definitions |
84 just consist of inductive datatypes and simple recursive functions. A |
84 just consist of inductive datatypes and simple recursive functions. A |
85 completely formalised correctness proof of this matcher in for example HOL4 |
85 completely formalised correctness proof of this matcher in for example HOL4 |
86 has been mentioned in~\cite{Owens2008}. Another one in Isabelle/HOL is part |
86 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
87 of the work in \cite{Krauss2011}. |
87 of the work by Krauss and Nipkow \cite{Krauss2011}. |
88 |
88 |
89 One limitation of Brzozowski's matcher is that it only generates a YES/NO |
89 One limitation of Brzozowski's matcher is that it only generates a YES/NO |
90 answer for whether a string is being matched by a regular expression. |
90 answer for whether a string is being matched by a regular expression. |
91 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow |
91 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow |
92 generation not just of a YES/NO answer but of an actual matching, called a |
92 generation not just of a YES/NO answer but of an actual matching, called a |
104 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a |
104 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a |
105 GREEDY regular expression matching algorithm. Beginning with our |
105 GREEDY regular expression matching algorithm. Beginning with our |
106 observations that, without evidence that it is transitive, it cannot be |
106 observations that, without evidence that it is transitive, it cannot be |
107 called an ``order relation'', and that the relation is called a ``total |
107 called an ``order relation'', and that the relation is called a ``total |
108 order'' despite being evidently not total\footnote{The relation @{text |
108 order'' despite being evidently not total\footnote{The relation @{text |
109 "\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the |
109 "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} is a relation on the |
110 values for the regular expression @{term r}; but it only holds between |
110 values for the regular expression @{term r}; but it only holds between |
111 @{term v} and @{term "v'"} in cases where @{term v} and @{term "v'"} have |
111 @{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have |
112 the same flattening (underlying string). So a counterexample to totality is |
112 the same flattening (underlying string). So a counterexample to totality is |
113 given by taking two values @{term v} and @{term "v'"} for @{term r} that |
113 given by taking two values @{term "v\<^sub>1"} and @{term "v\<^sub>2"} for @{term r} that |
114 have different flattenings (see Section~\ref{posixsec}). A different |
114 have different flattenings (see Section~\ref{posixsec}). A different |
115 relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r} |
115 relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r} |
116 with flattening @{term s} is definable by the same approach, and is indeed |
116 with flattening @{term s} is definable by the same approach, and is indeed |
117 total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}, we |
117 total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}, we |
118 identify problems with this approach (of which some of the proofs are not |
118 identify problems with this approach (of which some of the proofs are not |
121 being a {\em POSIX value} for a regular expression @{term r} and a string |
121 being a {\em POSIX value} for a regular expression @{term r} and a string |
122 @{term s}; we show that the algorithm computes such a value and that such a |
122 @{term s}; we show that the algorithm computes such a value and that such a |
123 value is unique. Proofs are both done by hand and checked in Isabelle/HOL. |
123 value is unique. Proofs are both done by hand and checked in Isabelle/HOL. |
124 The experience of doing our proofs has been that this mechanical checking |
124 The experience of doing our proofs has been that this mechanical checking |
125 was absolutely essential: this subject area has hidden snares. This was also |
125 was absolutely essential: this subject area has hidden snares. This was also |
126 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching |
126 noted by Kuklewicz \cite{Kuklewicz} who found that nearly all POSIX matching |
127 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
127 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
128 |
128 |
129 If a regular expression matches a string, then in general there is more than |
129 If a regular expression matches a string, then in general there is more than |
130 one way of how the string is matched. There are two commonly used |
130 one way of how the string is matched. There are two commonly used |
131 disambiguation strategies to generate a unique answer: one is called GREEDY |
131 disambiguation strategies to generate a unique answer: one is called GREEDY |
148 identifiers, respectively. There are two underlying (informal) rules behind |
148 identifiers, respectively. There are two underlying (informal) rules behind |
149 tokenising a string in a POSIX fashion according to a collection of regular |
149 tokenising a string in a POSIX fashion according to a collection of regular |
150 expressions: |
150 expressions: |
151 |
151 |
152 \begin{itemize} |
152 \begin{itemize} |
153 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):} |
153 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}\smallskip |
154 |
154 |
155 The longest initial substring matched by any regular expression is taken as |
155 The longest initial substring matched by any regular expression is taken as |
156 next token.\smallskip |
156 next token.\smallskip |
157 |
157 |
158 \item[$\bullet$] \underline{Priority Rule:} |
158 \item[$\bullet$] \underline{Priority Rule:}\smallskip |
159 |
159 |
160 For a particular longest initial substring, the first regular expression |
160 For a particular longest initial substring, the first regular expression |
161 that can match determines the token. |
161 that can match determines the token. |
162 \end{itemize} |
162 \end{itemize} |
163 |
163 |
331 |
331 |
332 section {* POSIX Regular Expression Matching\label{posixsec} *} |
332 section {* POSIX Regular Expression Matching\label{posixsec} *} |
333 |
333 |
334 text {* |
334 text {* |
335 |
335 |
336 The clever idea in \cite{Sulzmann2014} is to introduce values for encoding |
336 The clever idea by Sulzmann and Lu \cite{Sulzmann2014} is to introduce values for encoding |
337 \emph{how} a regular expression matches a string and then define a |
337 \emph{how} a regular expression matches a string and then define a |
338 function on values that mirrors (but inverts) the construction of the |
338 function on values that mirrors (but inverts) the construction of the |
339 derivative on regular expressions. \emph{Values} are defined as the |
339 derivative on regular expressions. \emph{Values} are defined as the |
340 inductive datatype |
340 inductive datatype |
341 |
341 |
347 @{term "Right v"} $\mid$ |
347 @{term "Right v"} $\mid$ |
348 @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ |
348 @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ |
349 @{term "Stars vs"} |
349 @{term "Stars vs"} |
350 \end{center} |
350 \end{center} |
351 |
351 |
352 \noindent where we use @{term vs} to stand for a list of values. (This is |
352 \noindent where we use @{term vs} to stand for a list of |
353 similar to the approach taken by Frisch and Cardelli for GREEDY matching |
353 values. (This is similar to the approach taken by Frisch and |
354 \cite{Frisch2004}, and Sulzmann and Lu \cite{Sulzmann2014} for POSIX |
354 Cardelli for GREEDY matching \cite{Frisch2004}, and Sulzmann and Lu |
355 matching). The string underlying a value can be calculated by the @{const |
355 for POSIX matching \cite{Sulzmann2014}). The string underlying a |
356 flat} function, written @{term "flat DUMMY"} and defined as: |
356 value can be calculated by the @{const flat} function, written |
|
357 @{term "flat DUMMY"} and defined as: |
357 |
358 |
358 \begin{center} |
359 \begin{center} |
359 \begin{tabular}{lcl} |
360 \begin{tabular}{lcl} |
360 @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\ |
361 @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\ |
361 @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\ |
362 @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\ |
584 without giving evidence that it is transitive.} between values and argue |
585 without giving evidence that it is transitive.} between values and argue |
585 that there is a maximum value, as given by the derivative-based algorithm. |
586 that there is a maximum value, as given by the derivative-based algorithm. |
586 In contrast, we shall introduce a simple inductive definition that |
587 In contrast, we shall introduce a simple inductive definition that |
587 specifies directly what a \emph{POSIX value} is, incorporating the |
588 specifies directly what a \emph{POSIX value} is, incorporating the |
588 POSIX-specific choices into the side-conditions of our rules. Our |
589 POSIX-specific choices into the side-conditions of our rules. Our |
589 definition is inspired by the matching relation given in |
590 definition is inspired by the matching relation given by Vansummeren |
590 \cite{Vansummeren2006}. The relation we define is ternary and written as |
591 \cite{Vansummeren2006}. The relation we define is ternary and written as |
591 \mbox{@{term "s \<in> r \<rightarrow> v"}}, relating strings, regular expressions and |
592 \mbox{@{term "s \<in> r \<rightarrow> v"}}, relating strings, regular expressions and |
592 values. |
593 values. |
593 |
594 |
594 \begin{center} |
595 \begin{center} |
1094 |
1095 |
1095 section {* Conclusion *} |
1096 section {* Conclusion *} |
1096 |
1097 |
1097 text {* |
1098 text {* |
1098 |
1099 |
1099 We have implemented the POSIX value calculation algorithm introduced in |
1100 We have implemented the POSIX value calculation algorithm introduced by |
|
1101 Sulzmann and Lu |
1100 \cite{Sulzmann2014}. Our implementation is nearly identical to the |
1102 \cite{Sulzmann2014}. Our implementation is nearly identical to the |
1101 original and all modifications we introduced are harmless (like our char-clause for |
1103 original and all modifications we introduced are harmless (like our char-clause for |
1102 @{text inj}). We have proved this algorithm to be correct, but correct |
1104 @{text inj}). We have proved this algorithm to be correct, but correct |
1103 according to our own specification of what POSIX values are. Our |
1105 according to our own specification of what POSIX values are. Our |
1104 specification (inspired from work in \cite{Vansummeren2006}) appears to be |
1106 specification (inspired from work by Vansummeren \cite{Vansummeren2006}) appears to be |
1105 much simpler than in \cite{Sulzmann2014} and our proofs are nearly always |
1107 much simpler than in \cite{Sulzmann2014} and our proofs are nearly always |
1106 straightforward. We have attempted to formalise the original proof |
1108 straightforward. We have attempted to formalise the original proof |
1107 by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains |
1109 by Sulzmann and Lu \cite{Sulzmann2014}, but we believe it contains |
1108 unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors |
1110 unfillable gaps. In the online version of \cite{Sulzmann2014}, the authors |
1109 already acknowledge some small problems, but our experience suggests |
1111 already acknowledge some small problems, but our experience suggests |
1127 Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest |
1129 Nipkow \cite{Nipkow98}. This lexer also splits up strings into longest |
1128 initial substrings, but Nipkow's algorithm is not completely |
1130 initial substrings, but Nipkow's algorithm is not completely |
1129 computational. The algorithm by Sulzmann and Lu, in contrast, can be |
1131 computational. The algorithm by Sulzmann and Lu, in contrast, can be |
1130 implemented with easy in any functional language. A bespoke lexer for the |
1132 implemented with easy in any functional language. A bespoke lexer for the |
1131 Imp-language is formalised in Coq as part of the Software Foundations book |
1133 Imp-language is formalised in Coq as part of the Software Foundations book |
1132 \cite{Pierce2015}. The disadvantage of such bespoke lexers is that they |
1134 by Pierce et al\cite{Pierce2015}. The disadvantage of such bespoke lexers is that they |
1133 do not generalise easily to more advanced features. |
1135 do not generalise easily to more advanced features. |
1134 Our formalisation is available from |
1136 Our formalisation is available from |
1135 \url{http://www.inf.kcl.ac.uk/staff/urbanc/lex}. |
1137 \url{http://www.inf.kcl.ac.uk/staff/urbanc/lex}.\medskip |
1136 |
1138 |
1137 %\noindent |
1139 \noindent |
1138 %{\bf Acknowledgements:} |
1140 {\bf Acknowledgements:} |
1139 %We are grateful for the comments we received from anonymous |
1141 We are very grateful to Martin Sulzmann for his comments on our work and |
1140 %referees. |
1142 also patiently explaining to us the details in \cite{Sulzmann2014}. |
1141 \small |
1143 \small |
1142 \bibliographystyle{plain} |
1144 \bibliographystyle{plain} |
1143 \bibliography{root} |
1145 \bibliography{root} |
1144 |
1146 |
1145 *} |
1147 *} |