100 completely formalised correctness proof of this matcher in for example HOL4 |
100 completely formalised correctness proof of this matcher in for example HOL4 |
101 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
101 has been mentioned by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
102 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given |
102 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given |
103 by Coquand and Siles \cite{Coquand2012}. |
103 by Coquand and Siles \cite{Coquand2012}. |
104 |
104 |
|
105 If a regular expression matches a string, then in general there is more than |
|
106 one way of how the string is matched. There are two commonly used |
|
107 disambiguation strategies to generate a unique answer: one is called GREEDY |
|
108 matching \cite{Frisch2004} and the other is POSIX |
|
109 matching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example consider |
|
110 the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT |
|
111 (ALT x y) xy)"}}. Either the string can be matched in two `iterations' by |
|
112 the single letter-regular expressions @{term x} and @{term y}, or directly |
|
113 in one iteration by @{term xy}. The first case corresponds to GREEDY |
|
114 matching, which first matches with the left-most symbol and only matches the |
|
115 next symbol in case of a mismatch (this is greedy in the sense of preferring |
|
116 instant gratification to delayed repletion). The second case is POSIX |
|
117 matching, which prefers the longest match. |
|
118 |
|
119 In the context of lexing, where an input string needs to be split up into a |
|
120 sequence of tokens, POSIX is the more natural disambiguation strategy for |
|
121 what programmers consider basic syntactic building blocks in their programs. |
|
122 These building blocks are often specified by some regular expressions, say |
|
123 @{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords and |
|
124 identifiers, respectively. There are two underlying (informal) rules behind |
|
125 tokenising a string in a POSIX fashion according to a collection of regular |
|
126 expressions: |
|
127 |
|
128 \begin{itemize} |
|
129 \item[$\bullet$] \emph{The Longest Match Rule} (or \emph{``maximal munch rule''}): |
|
130 The longest initial substring matched by any regular expression is taken as |
|
131 next token.\smallskip |
|
132 |
|
133 \item[$\bullet$] \emph{Priority Rule:} |
|
134 For a particular longest initial substring, the first regular expression |
|
135 that can match determines the token. |
|
136 \end{itemize} |
|
137 |
|
138 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords |
|
139 such as @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} |
|
140 recognising identifiers (say, a single character followed by |
|
141 characters or numbers). Then we can form the regular expression |
|
142 @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX matching to tokenise strings, |
|
143 say @{text "iffoo"} and @{text "if"}. For @{text "iffoo"} we obtain |
|
144 by the Longest Match Rule a single identifier token, not a keyword |
|
145 followed by an identifier. For @{text "if"} we obtain by the Priority |
|
146 Rule a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"} |
|
147 matches also. |
|
148 |
105 One limitation of Brzozowski's matcher is that it only generates a |
149 One limitation of Brzozowski's matcher is that it only generates a |
106 YES/NO answer for whether a string is being matched by a regular |
150 YES/NO answer for whether a string is being matched by a regular |
107 expression. Sulzmann and Lu \cite{Sulzmann2014} extended this matcher |
151 expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this matcher |
108 to allow generation not just of a YES/NO answer but of an actual |
152 to allow generation not just of a YES/NO answer but of an actual |
109 matching, called a [lexical] {\em value}. They give a simple algorithm |
153 matching, called a [lexical] {\em value}. They give a simple algorithm |
110 to calculate a value that appears to be the value associated with |
154 to calculate a value that appears to be the value associated with |
111 POSIX matching. The challenge then is to specify that value, in an |
155 POSIX matching. The challenge then is to specify that value, in an |
112 algorithm-independent fashion, and to show that Sulzmann and Lu's |
156 algorithm-independent fashion, and to show that Sulzmann and Lu's |
113 derivative-based algorithm does indeed calculate a value that is |
157 derivative-based algorithm does indeed calculate a value that is |
114 correct according to the specification. |
158 correct according to the specification. |
115 |
159 |
116 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a |
160 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a |
125 which some of the proofs are not published in \cite{Sulzmann2014}); |
169 which some of the proofs are not published in \cite{Sulzmann2014}); |
126 perhaps more importantly, we give a simple inductive (and |
170 perhaps more importantly, we give a simple inductive (and |
127 algorithm-independent) definition of what we call being a {\em POSIX |
171 algorithm-independent) definition of what we call being a {\em POSIX |
128 value} for a regular expression @{term r} and a string @{term s}; we |
172 value} for a regular expression @{term r} and a string @{term s}; we |
129 show that the algorithm computes such a value and that such a value is |
173 show that the algorithm computes such a value and that such a value is |
130 unique. Proofs are both done by hand and checked in Isabelle/HOL. The |
174 unique. Our proofs are both done by hand and checked in Isabelle/HOL. The |
131 experience of doing our proofs has been that this mechanical checking |
175 experience of doing our proofs has been that this mechanical checking |
132 was absolutely essential: this subject area has hidden snares. This |
176 was absolutely essential: this subject area has hidden snares. This |
133 was also noted by Kuklewicz \cite{Kuklewicz} who found that nearly all |
177 was also noted by Kuklewicz \cite{Kuklewicz} who found that nearly all |
134 POSIX matching implementations are ``buggy'' \cite[Page |
178 POSIX matching implementations are ``buggy'' \cite[Page |
135 203]{Sulzmann2014}. |
179 203]{Sulzmann2014} and by Grathwohl et al \cite[Page 36]{CrashCourse2014} |
|
180 who wrote: |
|
181 |
|
182 \begin{quote} |
|
183 \it{}``The POSIX strategy is more complicated than the greedy because of |
|
184 the dependence on information about the length of matched strings in the |
|
185 various subexpressions.'' |
|
186 \end{quote} |
136 |
187 |
137 %\footnote{The relation @{text "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} |
188 %\footnote{The relation @{text "\<ge>\<^bsub>r\<^esub>"} defined by Sulzmann and Lu \cite{Sulzmann2014} |
138 %is a relation on the |
189 %is a relation on the |
139 %values for the regular expression @{term r}; but it only holds between |
190 %values for the regular expression @{term r}; but it only holds between |
140 %@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have |
191 %@{term "v\<^sub>1"} and @{term "v\<^sub>2"} in cases where @{term "v\<^sub>1"} and @{term "v\<^sub>2"} have |
143 %have different flattenings (see Section~\ref{posixsec}). A different |
194 %have different flattenings (see Section~\ref{posixsec}). A different |
144 %relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r} |
195 %relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r} |
145 %with flattening @{term s} is definable by the same approach, and is indeed |
196 %with flattening @{term s} is definable by the same approach, and is indeed |
146 %total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.} |
197 %total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.} |
147 |
198 |
148 |
|
149 If a regular expression matches a string, then in general there is more than |
|
150 one way of how the string is matched. There are two commonly used |
|
151 disambiguation strategies to generate a unique answer: one is called GREEDY |
|
152 matching \cite{Frisch2004} and the other is POSIX |
|
153 matching~\cite{Kuklewicz,Sulzmann2014,Vansummeren2006}. For example consider |
|
154 the string @{term xy} and the regular expression \mbox{@{term "STAR (ALT |
|
155 (ALT x y) xy)"}}. Either the string can be matched in two `iterations' by |
|
156 the single letter-regular expressions @{term x} and @{term y}, or directly |
|
157 in one iteration by @{term xy}. The first case corresponds to GREEDY |
|
158 matching, which first matches with the left-most symbol and only matches the |
|
159 next symbol in case of a mismatch (this is greedy in the sense of preferring |
|
160 instant gratification to delayed repletion). The second case is POSIX |
|
161 matching, which prefers the longest match. |
|
162 |
|
163 In the context of lexing, where an input string needs to be split up into a |
|
164 sequence of tokens, POSIX is the more natural disambiguation strategy for |
|
165 what programmers consider basic syntactic building blocks in their programs. |
|
166 These building blocks are often specified by some regular expressions, say |
|
167 @{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords and |
|
168 identifiers, respectively. There are two underlying (informal) rules behind |
|
169 tokenising a string in a POSIX fashion according to a collection of regular |
|
170 expressions: |
|
171 |
|
172 \begin{itemize} |
|
173 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):}\smallskip |
|
174 |
|
175 The longest initial substring matched by any regular expression is taken as |
|
176 next token.\smallskip |
|
177 |
|
178 \item[$\bullet$] \underline{Priority Rule:}\smallskip |
|
179 |
|
180 For a particular longest initial substring, the first regular expression |
|
181 that can match determines the token. |
|
182 \end{itemize} |
|
183 |
|
184 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as |
|
185 @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising |
|
186 identifiers (say, a single character followed by characters or numbers). |
|
187 Then we can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use |
|
188 POSIX matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. |
|
189 For @{text "iffoo"} we obtain by the longest match rule a single identifier |
|
190 token, not a keyword followed by an identifier. For @{text "if"} we obtain by |
|
191 the priority rule a keyword token, not an identifier token---even if @{text |
|
192 "r\<^bsub>id\<^esub>"} matches also.\bigskip |
|
193 |
199 |
194 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the |
200 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the |
195 derivative-based regular expression matching algorithm of |
201 derivative-based regular expression matching algorithm of |
196 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this |
202 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this |
197 algorithm according to our specification of what a POSIX value is. Sulzmann |
203 algorithm according to our specification of what a POSIX value is. Sulzmann |
237 only the empty string and @{term c} for matching a character literal. The |
243 only the empty string and @{term c} for matching a character literal. The |
238 language of a regular expression is also defined as usual by the |
244 language of a regular expression is also defined as usual by the |
239 recursive function @{term L} with the clauses: |
245 recursive function @{term L} with the clauses: |
240 |
246 |
241 \begin{center} |
247 \begin{center} |
242 \begin{tabular}{l@ {\hspace{5mm}}rcl} |
248 \begin{tabular}{l@ {\hspace{3mm}}rcl} |
243 (1) & @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\ |
249 (1) & @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\ |
244 (2) & @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\ |
250 (2) & @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\ |
245 (3) & @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\ |
251 (3) & @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\ |
|
252 \end{tabular} |
|
253 \hspace{14mm} |
|
254 \begin{tabular}{l@ {\hspace{3mm}}rcl} |
246 (4) & @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
255 (4) & @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
247 (5) & @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
256 (5) & @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
248 (6) & @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\ |
257 (6) & @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\ |
249 \end{tabular} |
258 \end{tabular} |
250 \end{center} |
259 \end{center} |
377 for POSIX matching \cite{Sulzmann2014}). The string underlying a |
386 for POSIX matching \cite{Sulzmann2014}). The string underlying a |
378 value can be calculated by the @{const flat} function, written |
387 value can be calculated by the @{const flat} function, written |
379 @{term "flat DUMMY"} and defined as: |
388 @{term "flat DUMMY"} and defined as: |
380 |
389 |
381 \begin{center} |
390 \begin{center} |
382 \begin{tabular}{lcl} |
391 \begin{tabular}[t]{lcl} |
383 @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\ |
392 @{thm (lhs) flat.simps(1)} & $\dn$ & @{thm (rhs) flat.simps(1)}\\ |
384 @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\ |
393 @{thm (lhs) flat.simps(2)} & $\dn$ & @{thm (rhs) flat.simps(2)}\\ |
385 @{thm (lhs) flat.simps(3)} & $\dn$ & @{thm (rhs) flat.simps(3)}\\ |
394 @{thm (lhs) flat.simps(3)} & $\dn$ & @{thm (rhs) flat.simps(3)}\\ |
386 @{thm (lhs) flat.simps(4)} & $\dn$ & @{thm (rhs) flat.simps(4)}\\ |
395 @{thm (lhs) flat.simps(4)} & $\dn$ & @{thm (rhs) flat.simps(4)} |
|
396 \end{tabular}\hspace{14mm} |
|
397 \begin{tabular}[t]{lcl} |
387 @{thm (lhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\ |
398 @{thm (lhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) flat.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\ |
388 @{thm (lhs) flat.simps(6)} & $\dn$ & @{thm (rhs) flat.simps(6)}\\ |
399 @{thm (lhs) flat.simps(6)} & $\dn$ & @{thm (rhs) flat.simps(6)}\\ |
389 @{thm (lhs) flat.simps(7)} & $\dn$ & @{thm (rhs) flat.simps(7)}\\ |
400 @{thm (lhs) flat.simps(7)} & $\dn$ & @{thm (rhs) flat.simps(7)}\\ |
390 \end{tabular} |
401 \end{tabular} |
391 \end{center} |
402 \end{center} |
393 \noindent Sulzmann and Lu also define inductively an inhabitation relation |
404 \noindent Sulzmann and Lu also define inductively an inhabitation relation |
394 that associates values to regular expressions: |
405 that associates values to regular expressions: |
395 |
406 |
396 \begin{center} |
407 \begin{center} |
397 \begin{tabular}{c} |
408 \begin{tabular}{c} |
|
409 \\[-8mm] |
398 @{thm[mode=Axiom] Prf.intros(4)} \qquad |
410 @{thm[mode=Axiom] Prf.intros(4)} \qquad |
399 @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\smallskip\\ |
411 @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[3mm] |
400 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad |
412 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad |
401 @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\smallskip\\ |
413 @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[3mm] |
402 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\smallskip\\ |
414 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[3mm] |
403 @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad |
415 @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad |
404 @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]} |
416 @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]} |
405 \end{tabular} |
417 \end{tabular} |
406 \end{center} |
418 \end{center} |
407 |
419 |