120 |
121 |
121 |
122 |
122 text {* |
123 text {* |
123 |
124 |
124 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em |
125 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em |
125 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a |
126 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ |
126 character~@{text c}, and showed that it gave a simple solution to the |
127 a character~@{text c}, and showed that it gave a simple solution to the |
127 problem of matching a string @{term s} with a regular expression @{term r}: |
128 problem of matching a string @{term s} with a regular expression @{term |
128 if the derivative of @{term r} w.r.t.\ (in succession) all the characters of |
129 r}: if the derivative of @{term r} w.r.t.\ (in succession) all the |
129 the string matches the empty string, then @{term r} matches @{term s} (and |
130 characters of the string matches the empty string, then @{term r} |
130 {\em vice versa}). The derivative has the property (which may almost be |
131 matches @{term s} (and {\em vice versa}). The derivative has the |
131 regarded as its specification) that, for every string @{term s} and regular |
132 property (which may almost be regarded as its specification) that, for |
132 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if |
133 every string @{term s} and regular expression @{term r} and character |
133 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
134 @{term c}, one has @{term "cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}. |
134 derivatives is that they are neatly expressible in any functional language, |
135 The beauty of Brzozowski's derivatives is that |
135 and easily definable and reasoned about in theorem provers---the definitions |
136 they are neatly expressible in any functional language, and easily |
136 just consist of inductive datatypes and simple recursive functions. A |
137 definable and reasoned about in theorem provers---the definitions just |
|
138 consist of inductive datatypes and simple recursive functions. A |
137 mechanised correctness proof of Brzozowski's matcher in for example HOL4 |
139 mechanised correctness proof of Brzozowski's matcher in for example HOL4 |
138 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
140 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in |
139 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given |
141 Isabelle/HOL is part of the work by Krauss and Nipkow \cite{Krauss2011}. |
140 by Coquand and Siles \cite{Coquand2012}. |
142 And another one in Coq is given by Coquand and Siles \cite{Coquand2012}. |
141 |
143 |
142 If a regular expression matches a string, then in general there is more than |
144 If a regular expression matches a string, then in general there is more |
143 one way of how the string is matched. There are two commonly used |
145 than one way of how the string is matched. There are two commonly used |
144 disambiguation strategies to generate a unique answer: one is called GREEDY |
146 disambiguation strategies to generate a unique answer: one is called |
145 matching \cite{Frisch2004} and the other is POSIX |
147 GREEDY matching \cite{Frisch2004} and the other is POSIX |
146 matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}. For example consider |
148 matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}. |
147 the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT |
149 For example consider the string @{term xy} and the regular expression |
148 (ALT x y) xy)"}}. Either the string can be matched in two `iterations' by |
150 \mbox{@{term "STAR (ALT (ALT x y) xy)"}}. Either the string can be |
149 the single letter-regular expressions @{term x} and @{term y}, or directly |
151 matched in two `iterations' by the single letter-regular expressions |
150 in one iteration by @{term xy}. The first case corresponds to GREEDY |
152 @{term x} and @{term y}, or directly in one iteration by @{term xy}. The |
151 matching, which first matches with the left-most symbol and only matches the |
153 first case corresponds to GREEDY matching, which first matches with the |
152 next symbol in case of a mismatch (this is greedy in the sense of preferring |
154 left-most symbol and only matches the next symbol in case of a mismatch |
153 instant gratification to delayed repletion). The second case is POSIX |
155 (this is greedy in the sense of preferring instant gratification to |
154 matching, which prefers the longest match. |
156 delayed repletion). The second case is POSIX matching, which prefers the |
|
157 longest match. |
155 |
158 |
156 In the context of lexing, where an input string needs to be split up |
159 In the context of lexing, where an input string needs to be split up |
157 into a sequence of tokens, POSIX is the more natural disambiguation |
160 into a sequence of tokens, POSIX is the more natural disambiguation |
158 strategy for what programmers consider basic syntactic building blocks |
161 strategy for what programmers consider basic syntactic building blocks |
159 in their programs. These building blocks are often specified by some |
162 in their programs. These building blocks are often specified by some |
160 regular expressions, say @{text "r\<^bsub>key\<^esub>"} and @{text |
163 regular expressions, say @{text "r\<^bsub>key\<^esub>"} and @{text |
161 "r\<^bsub>id\<^esub>"} for recognising keywords and identifiers, |
164 "r\<^bsub>id\<^esub>"} for recognising keywords and identifiers, |
162 respectively. There are a few underlying (informal) rules behind |
165 respectively. There are a few underlying (informal) rules behind |
163 tokenising a string in a POSIX \cite{POSIX} fashion according to a |
166 tokenising a string in a POSIX \cite{POSIX} fashion: |
164 collection of regular expressions: |
|
165 |
167 |
166 \begin{itemize} |
168 \begin{itemize} |
167 \item[$\bullet$] \emph{The Longest Match Rule} (or \emph{``{M}aximal {M}unch {R}ule''}): |
169 \item[$\bullet$] \emph{The Longest Match Rule} (or \emph{``{M}aximal {M}unch {R}ule''}): |
168 The longest initial substring matched by any regular expression is taken as |
170 The longest initial substring matched by any regular expression is taken as |
169 next token.\smallskip |
171 next token.\smallskip |
224 Char} are constructors for values. @{text Stars} records how many |
226 Char} are constructors for values. @{text Stars} records how many |
225 iterations were used; @{text Left}, respectively @{text Right}, which |
227 iterations were used; @{text Left}, respectively @{text Right}, which |
226 alternative is used. This `tree view' leads naturally to the idea that |
228 alternative is used. This `tree view' leads naturally to the idea that |
227 regular expressions act as types and values as inhabiting those types |
229 regular expressions act as types and values as inhabiting those types |
228 (see, for example, \cite{HosoyaVouillonPierce2005}). The value for |
230 (see, for example, \cite{HosoyaVouillonPierce2005}). The value for |
229 the single `iteration', i.e.~the POSIX value, would look as follows |
231 matching @{text "xy"} in a single `iteration', i.e.~the POSIX value, |
|
232 would look as follows |
230 |
233 |
231 \begin{center} |
234 \begin{center} |
232 @{term "Stars [Seq (Char x) (Char y)]"} |
235 @{term "Stars [Seq (Char x) (Char y)]"} |
233 \end{center} |
236 \end{center} |
234 |
237 |
235 \noindent where @{const Stars} has only a single-element list for the |
238 \noindent where @{const Stars} has only a single-element list for the |
236 single iteration and @{const Seq} indicates that @{term xy} is matched |
239 single iteration and @{const Seq} indicates that @{term xy} is matched |
237 by a sequence regular expression, which we will in what follows |
240 by a sequence regular expression. |
238 write more formally as @{term "SEQ x y"}. |
241 |
|
242 %, which we will in what follows |
|
243 %write more formally as @{term "SEQ x y"}. |
239 |
244 |
240 |
245 |
241 Sulzmann and Lu give a simple algorithm to calculate a value that |
246 Sulzmann and Lu give a simple algorithm to calculate a value that |
242 appears to be the value associated with POSIX matching. The challenge |
247 appears to be the value associated with POSIX matching. The challenge |
243 then is to specify that value, in an algorithm-independent fashion, |
248 then is to specify that value, in an algorithm-independent fashion, |
269 \it{}``The POSIX strategy is more complicated than the greedy because of |
274 \it{}``The POSIX strategy is more complicated than the greedy because of |
270 the dependence on information about the length of matched strings in the |
275 the dependence on information about the length of matched strings in the |
271 various subexpressions.'' |
276 various subexpressions.'' |
272 \end{quote} |
277 \end{quote} |
273 |
278 |
274 %\footnote{The relation @{text "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} |
|
275 %is a relation on the |
|
276 %values for the regular expression @{term r}; but it only holds between |
|
277 %@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have |
|
278 %the same flattening (underlying string). So a counterexample to totality is |
|
279 %given by taking two values @{term "v\<^sub>1"} and @{term "v\<^sub>2"} for @{term r} that |
|
280 %have different flattenings (see Section~\ref{posixsec}). A different |
|
281 %relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r} |
|
282 %with flattening @{term s} is definable by the same approach, and is indeed |
|
283 %total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.} |
|
284 |
279 |
285 |
280 |
286 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the |
281 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the |
287 derivative-based regular expression matching algorithm of |
282 derivative-based regular expression matching algorithm of |
288 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this |
283 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this |
295 informal proof contains gaps, and possible fixes are not fully worked out.} |
290 informal proof contains gaps, and possible fixes are not fully worked out.} |
296 Our specification of a POSIX value consists of a simple inductive definition |
291 Our specification of a POSIX value consists of a simple inductive definition |
297 that given a string and a regular expression uniquely determines this value. |
292 that given a string and a regular expression uniquely determines this value. |
298 We also show that our definition is equivalent to an ordering |
293 We also show that our definition is equivalent to an ordering |
299 of values based on positions by Okui and Suzuki \cite{OkuiSuzuki2010}. |
294 of values based on positions by Okui and Suzuki \cite{OkuiSuzuki2010}. |
300 Derivatives as calculated by Brzozowski's method are usually more complex |
295 |
301 regular expressions than the initial one; various optimisations are |
296 %Derivatives as calculated by Brzozowski's method are usually more complex |
302 possible. We prove the correctness when simplifications of @{term "ALT ZERO |
297 %regular expressions than the initial one; various optimisations are |
303 r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to |
298 %possible. We prove the correctness when simplifications of @{term "ALT ZERO r"}, |
304 @{term r} are applied. We extend our results to ??? |
299 %@{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to |
|
300 %@{term r} are applied. |
|
301 |
|
302 We extend our results to ??? |
305 |
303 |
306 *} |
304 *} |
307 |
305 |
308 section {* Preliminaries *} |
306 section {* Preliminaries *} |
309 |
307 |