11 abbreviation |
11 abbreviation |
12 "ders_syn r s \<equiv> ders s r" |
12 "ders_syn r s \<equiv> ders s r" |
13 |
13 |
14 notation (latex output) |
14 notation (latex output) |
15 If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and |
15 If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and |
16 Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and |
16 Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [75,75] 73) and |
17 |
17 |
18 ZERO ("\<^bold>0" 80) and |
18 ZERO ("\<^bold>0" 78) and |
19 ONE ("\<^bold>1" 80) and |
19 ONE ("\<^bold>1" 78) and |
20 CHAR ("_" [1000] 80) and |
20 CHAR ("_" [1000] 80) and |
21 ALT ("_ + _" [77,77] 78) and |
21 ALT ("_ + _" [77,77] 78) and |
22 SEQ ("_ \<cdot> _" [77,77] 78) and |
22 SEQ ("_ \<cdot> _" [77,77] 78) and |
23 STAR ("_\<^sup>\<star>" [1000] 78) and |
23 STAR ("_\<^sup>\<star>" [1000] 78) and |
24 |
24 |
25 val.Void ("'(')" 79) and |
25 val.Void ("'(')" 79) and |
26 val.Char ("Char _" [1000] 79) and |
26 val.Char ("Char _" [1000] 79) and |
27 val.Left ("Left _" [1000] 78) and |
27 val.Left ("Left _" [79] 78) and |
28 val.Right ("Right _" [1000] 78) and |
28 val.Right ("Right _" [79] 78) and |
29 |
29 val.Seq ("Seq _ _" [79,79] 78) and |
|
30 val.Stars ("Stars _" [79] 78) and |
|
31 |
30 L ("L'(_')" [10] 78) and |
32 L ("L'(_')" [10] 78) and |
31 der_syn ("_\\_" [79, 1000] 76) and |
33 der_syn ("_\\_" [79, 1000] 76) and |
32 ders_syn ("_\\_" [79, 1000] 76) and |
34 ders_syn ("_\\_" [79, 1000] 76) and |
33 flat ("|_|" [75] 73) and |
35 flat ("|_|" [75] 73) and |
34 Sequ ("_ @ _" [78,77] 63) and |
36 Sequ ("_ @ _" [78,77] 63) and |
35 injval ("inj _ _ _" [78,77,78] 77) and |
37 injval ("inj _ _ _" [79,77,79] 76) and |
|
38 mkeps ("mkeps _" [79] 76) and |
36 projval ("proj _ _ _" [1000,77,1000] 77) and |
39 projval ("proj _ _ _" [1000,77,1000] 77) and |
37 length ("len _" [78] 73) and |
40 length ("len _" [78] 73) and |
38 |
41 |
39 Prf ("\<triangleright> _ : _" [75,75] 75) and |
42 Prf ("\<triangleright> _ : _" [75,75] 75) and |
40 PMatch ("_ : _ \<rightarrow> _" [75,75,75] 75) |
43 PMatch ("'(_, _') \<rightarrow> _" [63,75,75] 75) |
41 (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *) |
44 (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *) |
42 |
45 |
43 definition |
46 definition |
44 "match r s \<equiv> nullable (ders s r)" |
47 "match r s \<equiv> nullable (ders s r)" |
45 |
48 |
62 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
65 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
63 derivatives is that they are neatly expressible in any functional language, |
66 derivatives is that they are neatly expressible in any functional language, |
64 and easily definable and reasoned about in theorem provers---the definitions |
67 and easily definable and reasoned about in theorem provers---the definitions |
65 just consist of inductive datatypes and simple recursive functions. A |
68 just consist of inductive datatypes and simple recursive functions. A |
66 completely formalised correctness proof of this matcher in for example HOL4 |
69 completely formalised correctness proof of this matcher in for example HOL4 |
67 has been mentioned in~\cite{Owens2008}. |
70 has been mentioned in~\cite{Owens2008}. Another one in Isabelle/HOL is |
|
71 in \cite{Krauss2011}. |
68 |
72 |
69 One limitation of Brzozowski's matcher is that it only generates a YES/NO |
73 One limitation of Brzozowski's matcher is that it only generates a YES/NO |
70 answer for whether a string is being matched by a regular expression. |
74 answer for whether a string is being matched by a regular expression. |
71 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow |
75 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow |
72 generation not just of a YES/NO answer but of an actual matching, called a |
76 generation not just of a YES/NO answer but of an actual matching, called a |
84 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a |
88 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a |
85 GREEDY regular expression matching algorithm. Beginning with our |
89 GREEDY regular expression matching algorithm. Beginning with our |
86 observations that, without evidence that it is transitive, it cannot be |
90 observations that, without evidence that it is transitive, it cannot be |
87 called an ``order relation'', and that the relation is called a ``total |
91 called an ``order relation'', and that the relation is called a ``total |
88 order'' despite being evidently not total\footnote{The relation @{text |
92 order'' despite being evidently not total\footnote{The relation @{text |
89 "\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the values for |
93 "\<ge>\<^bsub>r\<^esub>"} defined in \cite{Sulzmann2014} is a relation on the |
90 the regular expression @{term r}; but it only holds between @{term v} and |
94 values for the regular expression @{term r}; but it only holds between |
91 @{term "v'"} in cases where @{term v} and @{term "v'"} have the same |
95 @{term v} and @{term "v'"} in cases where @{term v} and @{term "v'"} have |
92 flattening (underlying string). So a counterexample to totality is given by |
96 the same flattening (underlying string). So a counterexample to totality is |
93 taking two values @{term v} and @{term "v'"} for @{term r} that have |
97 given by taking two values @{term v} and @{term "v'"} for @{term r} that |
94 different flattenings. A different relation @{text "\<ge>\<^bsub>r,s\<^esub>"} |
98 have different flattenings (see Section~\ref{posixsec}). A different |
95 on the set of values for @{term r} with flattening @{term s} is definable by |
99 relation @{text "\<ge>\<^bsub>r,s\<^esub>"} on the set of values for @{term r} |
96 the same approach, and is indeed total; but that is not what Proposition 1 |
100 with flattening @{term s} is definable by the same approach, and is indeed |
97 of \cite{Sulzmann2014} does.}, we identify problems with this approach (of |
101 total; but that is not what Proposition 1 of \cite{Sulzmann2014} does.}, we |
98 which some of the proofs are not published in \cite{Sulzmann2014}); perhaps |
102 identify problems with this approach (of which some of the proofs are not |
99 more importantly, we give a simple inductive (and algorithm-independent) |
103 published in \cite{Sulzmann2014}); perhaps more importantly, we give a |
100 definition of what we call being a {\em POSIX value} for a regular |
104 simple inductive (and algorithm-independent) definition of what we call |
101 expression @{term r} and a string @{term s}; we show that the algorithm |
105 being a {\em POSIX value} for a regular expression @{term r} and a string |
102 computes such a value and that such a value is unique. Proofs are both done |
106 @{term s}; we show that the algorithm computes such a value and that such a |
103 by hand and checked in Isabelle/HOL. The experience of doing our proofs has |
107 value is unique. Proofs are both done by hand and checked in Isabelle/HOL. |
104 been that this mechanical checking was absolutely essential: this subject |
108 The experience of doing our proofs has been that this mechanical checking |
105 area has hidden snares. This was also noted by Kuklewitz \cite{Kuklewicz} |
109 was absolutely essential: this subject area has hidden snares. This was also |
106 who found that nearly all POSIX matching implementations are ``buggy'' |
110 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching |
107 \cite[Page 203]{Sulzmann2014}. |
111 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
108 |
112 |
109 If a regular expression matches a string, then in general there is more than |
113 If a regular expression matches a string, then in general there is more than |
110 one way of how the string is matched. There are two commonly used |
114 one way of how the string is matched. There are two commonly used |
111 disambiguation strategies to generate a unique answer: one is called GREEDY |
115 disambiguation strategies to generate a unique answer: one is called GREEDY |
112 matching \cite{Frisch2004} and the other is POSIX |
116 matching \cite{Frisch2004} and the other is POSIX |
140 that can match determines the token. |
144 that can match determines the token. |
141 \end{itemize} |
145 \end{itemize} |
142 |
146 |
143 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as |
147 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as |
144 @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising |
148 @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising |
145 identifiers (a single character followed by characters or numbers). Then we |
149 identifiers (say, a single character followed by characters or numbers). Then we |
146 can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX |
150 can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX |
147 matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the |
151 matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the |
148 first case we obtain by the longest match rule a single identifier token, |
152 first case we obtain by the longest match rule a single identifier token, |
149 not a keyword followed by an identifier. In the second case we obtain by rule |
153 not a keyword followed by an identifier. In the second case we obtain by rule |
150 priority a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"} |
154 priority a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"} |
151 matches also.\bigskip |
155 matches also.\bigskip |
152 |
156 |
153 \noindent {\bf Contributions:} We have implemented in Isabelle/HOL the |
157 \noindent {\bf Contributions:} (NOT DONE YET) We have implemented in Isabelle/HOL the |
154 derivative-based regular expression matching algorithm as described by |
158 derivative-based regular expression matching algorithm as described by |
155 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this |
159 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this |
156 algorithms according to our specification what a POSIX value is. The |
160 algorithm according to our specification of what a POSIX value is. The |
157 informal correctness proof given in \cite{Sulzmann2014} is in final |
161 informal correctness proof given in \cite{Sulzmann2014} is in final |
158 form\footnote{} and to us contains unfillable gaps. Our specification of a |
162 form\footnote{} and to us contains unfillable gaps. Our specification of a |
159 POSIX value consists of a simple inductive definition that given a string |
163 POSIX value consists of a simple inductive definition that given a string |
160 and a regular expression uniquely determines this value. Derivatives as |
164 and a regular expression uniquely determines this value. Derivatives as |
161 calculated by Brzozowski's method are usually more complex regular |
165 calculated by Brzozowski's method are usually more complex regular |
210 @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
215 @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
211 @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\ |
216 @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\ |
212 \end{tabular} |
217 \end{tabular} |
213 \end{center} |
218 \end{center} |
214 |
219 |
215 \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the |
220 \noindent In the fourth clause we use the operation @{term "DUMMY ;; |
216 concatenation of two languages (it is also list-append for strings). We |
221 DUMMY"} for the concatenation of two languages (it is also list-append for |
217 use the star-notation for regular expressions and languages (in the last |
222 strings). We use the star-notation for regular expressions and languages |
218 clause above). The star on languages is defined inductively by two |
223 (in the last clause above). The star on languages is defined inductively |
219 clauses: @{text "(i)"} for the empty string being in the star of a |
224 by two clauses: @{text "(i)"} for the empty string being in the star of a |
220 language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term |
225 language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term |
221 "s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in |
226 "s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in |
222 the star of this language. It will also be convenient to use the following |
227 the star of this language. It will also be convenient to use the following |
223 notion of a \emph{semantic derivative} (or left quotient) of a language, |
228 notion of a \emph{semantic derivative} (or \emph{left quotient}) of a |
224 say @{text A}, defined as: |
229 language, say @{text A}, defined as: |
225 |
230 |
226 \begin{center} |
231 \begin{center} |
227 \begin{tabular}{lcl} |
232 \begin{tabular}{lcl} |
228 @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\ |
233 @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\ |
229 \end{tabular} |
234 \end{tabular} |
230 \end{center} |
235 \end{center} |
231 |
236 |
232 \noindent |
237 \noindent |
233 For semantic derivatives we have the following equations (for example |
238 For semantic derivatives we have the following equations (for example |
234 \cite{Krauss2011}): |
239 mechanically proved in \cite{Krauss2011}): |
235 |
240 |
236 \begin{equation}\label{SemDer} |
241 \begin{equation}\label{SemDer} |
237 \begin{array}{lcl} |
242 \begin{array}{lcl} |
238 @{thm (lhs) Der_null} & \dn & @{thm (rhs) Der_null}\\ |
243 @{thm (lhs) Der_null} & \dn & @{thm (rhs) Der_null}\\ |
239 @{thm (lhs) Der_empty} & \dn & @{thm (rhs) Der_empty}\\ |
244 @{thm (lhs) Der_empty} & \dn & @{thm (rhs) Der_empty}\\ |
283 exercise in mechanical reasoning to establish that |
288 exercise in mechanical reasoning to establish that |
284 |
289 |
285 \begin{proposition}\mbox{}\\ |
290 \begin{proposition}\mbox{}\\ |
286 \begin{tabular}{ll} |
291 \begin{tabular}{ll} |
287 @{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if |
292 @{text "(1)"} & @{thm (lhs) nullable_correctness} if and only if |
288 @{thm (rhs) nullable_correctness} \\ |
293 @{thm (rhs) nullable_correctness}, and \\ |
289 @{text "(2)"} & @{thm[mode=IfThen] der_correctness} |
294 @{text "(2)"} & @{thm[mode=IfThen] der_correctness}. |
290 \end{tabular} |
295 \end{tabular} |
291 \end{proposition} |
296 \end{proposition} |
292 |
297 |
293 \noindent With this in place it is also very routine to prove that the |
298 \noindent With this in place it is also very routine to prove that the |
294 regular expression matcher defined as |
299 regular expression matcher defined as |
295 |
300 |
296 \begin{center} |
301 \begin{center} |
297 @{thm match_def} |
302 @{thm match_def} |
298 \end{center} |
303 \end{center} |
299 |
304 |
300 \noindent gives a positive answer if and only if @{term "s \<in> L r"}. While |
305 \noindent gives a positive answer if and only if @{term "s \<in> L r"}. |
301 the matcher above calculates a provably correct a YES/NO answer for |
306 Consequently, this regular expression matching algorithm satisfies the |
302 whether a regular expression matches a string, the novel idea of Sulzmann |
307 usual specification. While the matcher above calculates a provably correct |
303 and Lu \cite{Sulzmann2014} is to append another phase to calculate a |
308 a YES/NO answer for whether a regular expression matches a string, the |
304 value. We will explain the details next. |
309 novel idea of Sulzmann and Lu \cite{Sulzmann2014} is to append another |
|
310 phase to this algorithm in order to calculate a [lexical] value. We will |
|
311 explain the details next. |
305 |
312 |
306 *} |
313 *} |
307 |
314 |
308 section {* POSIX Regular Expression Matching *} |
315 section {* POSIX Regular Expression Matching\label{posixsec} *} |
309 |
316 |
310 text {* |
317 text {* |
311 |
318 |
312 The clever idea in \cite{Sulzmann2014} is to introduce values for encoding |
319 The clever idea in \cite{Sulzmann2014} is to introduce values for encoding |
313 \emph{how} a regular expression matches a string and then define a function |
320 \emph{how} a regular expression matches a string and then define a function |
314 on values that mirrors (but inverts) the construction of the derivative on |
321 on values that mirrors (but inverts) the construction of the derivative on |
315 regular expressions. Values are defined as the inductive datatype |
322 regular expressions. \emph{Values} are defined as the inductive datatype |
316 |
323 |
317 \begin{center} |
324 \begin{center} |
318 @{text "v :="} |
325 @{text "v :="} |
319 @{const "Void"} $\mid$ |
326 @{const "Void"} $\mid$ |
320 @{term "val.Char c"} $\mid$ |
327 @{term "val.Char c"} $\mid$ |
343 \noindent Sulzmann and Lu also define inductively an inhabitation relation |
352 \noindent Sulzmann and Lu also define inductively an inhabitation relation |
344 that associates values to regular expressions: |
353 that associates values to regular expressions: |
345 |
354 |
346 \begin{center} |
355 \begin{center} |
347 \begin{tabular}{c} |
356 \begin{tabular}{c} |
348 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ |
357 @{thm[mode=Axiom] Prf.intros(4)} \qquad |
|
358 @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\ |
349 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad |
359 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad |
350 @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\ |
360 @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\medskip\\ |
351 @{thm[mode=Axiom] Prf.intros(4)} \qquad |
361 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ |
352 @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\medskip\\ |
362 @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad |
353 @{thm[mode=Axiom] Prf.intros(6)[of "r"]} \qquad |
|
354 @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\ |
363 @{thm[mode=Rule] Prf.intros(7)[of "v" "r" "vs"]}\medskip\\ |
355 \end{tabular} |
364 \end{tabular} |
356 \end{center} |
365 \end{center} |
357 |
366 |
358 \noindent Note that no values are associated with the regular expression |
367 \noindent Note that no values are associated with the regular expression |
359 @{term ZERO}, and that the only value associated with the regular |
368 @{term ZERO}, and that the only value associated with the regular |
360 expression @{term ONE} is @{term Void}, pronounced (if one must) as {\em |
369 expression @{term ONE} is @{term Void}, pronounced (if one must) as {\em |
361 ``Void''}. It is routine to stablish how values `inhabitated' by a regular |
370 ``Void''}. It is routine to stablish how values ``inhabiting'' a regular |
362 expression correspond to the language of a regular expression, namely |
371 expression correspond to the language of a regular expression, namely |
363 |
372 |
364 \begin{proposition} |
373 \begin{proposition} |
365 @{thm L_flat_Prf} |
374 @{thm L_flat_Prf} |
366 \end{proposition} |
375 \end{proposition} |
367 |
376 |
368 Graphically the algorithm by Sulzmann \& Lu can be illustrated by the |
377 In general there are more than one value associated with a regular |
369 picture in Figure~\ref{Sulz} where the path from the left to the right |
378 expression. In case of POSIX matching the problem is to calculate the |
370 involving $der/nullable$ is the first phase of the algorithm and |
379 unique value that satisfies the (informal) POSIX constraints from the |
371 $mkeps/inj$, the path from right to left, the second phase. This picture |
380 Introduction. Graphically the regular expression matching algorithm by |
372 shows the steps required when a regular expression, say $r_1$, matches the |
381 Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz} |
373 string $abc$. We first build the three derivatives (according to $a$, $b$ |
382 where the path from the left to the right involving @{const der}/@{const |
374 and $c$). We then use $nullable$ to find out whether the resulting regular |
383 nullable} is the first phase of the algorithm (calculating successive |
375 expression can match the empty string. If yes, we call the function |
384 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to |
376 $mkeps$. |
385 left, the second phase. This picture shows the steps required when a |
|
386 regular expression, say @{text "r\<^sub>1"}, matches the string @{term |
|
387 "[a,b,c]"}. We first build the three derivatives (according to @{term a}, |
|
388 @{term b} and @{term c}). We then use @{const nullable} to find out |
|
389 whether the resulting derivative regular expression @{term "r\<^sub>4"} |
|
390 can match the empty string. If yes, we call the function @{const mkeps} |
|
391 that produces a value @{term "v\<^sub>4"} for how @{term "r\<^sub>4"} can |
|
392 match the empty string (taking into account the POSIX constraints in case |
|
393 there are several ways). This functions is defined by the clauses: |
377 |
394 |
378 \begin{figure}[t] |
395 \begin{figure}[t] |
379 \begin{center} |
396 \begin{center} |
380 \begin{tikzpicture}[scale=2,node distance=1.3cm, |
397 \begin{tikzpicture}[scale=2,node distance=1.3cm, |
381 every node/.style={minimum size=7mm}] |
398 every node/.style={minimum size=7mm}] |
407 that the function @{term inj} `injects back' the characters of the string into |
424 that the function @{term inj} `injects back' the characters of the string into |
408 the values (the arrows from right to left). |
425 the values (the arrows from right to left). |
409 \label{Sulz}} |
426 \label{Sulz}} |
410 \end{figure} |
427 \end{figure} |
411 |
428 |
412 NOT DONE YET |
429 \begin{center} |
413 |
|
414 We begin with the case of a nullable regular expression: from the |
|
415 nullability we need to construct a value that witnesses the nullability. |
|
416 The @{const mkeps} function (from \cite{Sulzmann2014}) is a partial (but |
|
417 unambiguous) function from regular expressions to values, total on exactly |
|
418 the set of nullable regular expressions. |
|
419 |
|
420 \begin{center} |
|
421 \begin{tabular}{lcl} |
430 \begin{tabular}{lcl} |
422 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ |
431 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ |
423 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
432 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
424 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
433 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
425 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ |
434 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ |
426 \end{tabular} |
435 \end{tabular} |
427 \end{center} |
436 \end{center} |
428 |
437 |
429 The well-known idea of POSIX lexing is informally defined in (for example) |
438 \noindent Note that this function needs only to be partially defined, |
430 \cite{posix}; as correctly argued in \cite{Sulzmann2014}, this needs formal |
439 namely only for regular expressions that are nullable. In case @{const |
431 specification. The rough idea is that, in contrast to the so-called GREEDY |
440 nullable} fails, the string @{term "[a,b,c]"} cannot be matched by @{term |
432 algorithm, POSIX lexing chooses to match more deeply and using left choices |
441 "r\<^sub>1"} and an error is raised. Note also how this function makes |
433 rather than a right choices. For example, note that to match the string |
442 some subtle choices leading to a POSIX value: for example if the |
434 @{term "[a, b]"} with the regular expression $(a + \mts)\circ (b+ab)$ the matching |
443 alternative, say @{term "ALT r\<^sub>1 r\<^sub>2"}, can match the empty |
435 will return $( Void, Right(ab))$ rather than $(Left\ a, Left\ b)$. [The |
444 string and furthermore @{term "r\<^sub>1"} can match the empty string, |
436 regular expression $ab$ is short for $(Lit\ a) \circ (Lit\ b)$.] Similarly, |
445 then we return a @{const Left}-value. The @{const Right}-value will only |
437 to match {\em ``a''} with $(a+a)$ the leftmost $a$ will be chosen. |
446 be returned if @{term "r\<^sub>1"} is not nullable. |
438 |
447 |
439 We use a simple inductive definition to specify this notion, incorporating |
448 The most interesting novelty from Sulzmann and Lu \cite{Sulzmann2014} is |
440 the POSIX-specific choices into the side-conditions for the rules $R tl |
449 the construction value for how @{term "r\<^sub>1"} can match the string |
441 +_2$, $R tl\circ$ and $R tl*$ (as they are now called). By contrast, |
450 @{term "[a,b,c]"} from the value how the last derivative, @{term |
442 \cite{Sulzmann2014} defines a relation between values and argues that there is a |
451 "r\<^sub>4"} in Fig~\ref{Sulz}, can match the empty string. Sulzmann and |
443 maximum value, as given by the derivative-based algorithm yet to be spelt |
452 Lu acchieve this by stepwise ``injecting back'' the characters into the |
444 out. The relation we define is ternary, relating strings, values and regular |
453 values thus inverting the operation of building derivatives on the level |
445 expressions. |
454 of values. The corresponding function, called @{term inj}, takes three |
446 |
455 arguments, a regular expression, a character and a value. For example in |
447 Our Posix relation @{term "s \<in> r \<rightarrow> v"} |
456 the first @{term inj}-step in Fig~\ref{Sulz} the regular expression @{term |
448 |
457 "r\<^sub>3"}, the character @{term c} from the last derivative step and |
449 \begin{center} |
458 @{term "v\<^sub>4"}, which is the value corresponding to the derivative |
450 \begin{tabular}{c} |
459 regular expression @{term "r\<^sub>4"}. The result is the new value @{term |
451 @{thm[mode=Axiom] PMatch.intros(1)} \qquad |
460 "v\<^sub>3"}. The final result of the algorithm is the value @{term |
452 @{thm[mode=Axiom] PMatch.intros(2)}\medskip\\ |
461 "v\<^sub>1"} corresponding to the input regular expression. The @{term |
453 @{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\qquad |
462 inj} function is by recursion on the regular expression and by analysing |
454 @{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\medskip\\ |
463 the shape of values. |
455 \multicolumn{1}{p{5cm}}{@{thm[mode=Rule] PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}\medskip\\ |
|
456 @{thm[mode=Rule] PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}\medskip\\ |
|
457 @{thm[mode=Axiom] PMatch.intros(7)}\medskip\\ |
|
458 \end{tabular} |
|
459 \end{center} |
|
460 |
|
461 *} |
|
462 |
|
463 section {* The Argument by Sulzmmann and Lu *} |
|
464 |
|
465 section {* Conclusion *} |
|
466 |
|
467 text {* |
|
468 |
|
469 Nipkow lexer from 2000 |
|
470 |
|
471 *} |
|
472 |
|
473 |
|
474 text {* |
|
475 |
|
476 |
|
477 |
|
478 |
|
479 \noindent |
|
480 Values |
|
481 |
|
482 |
|
483 |
|
484 |
|
485 |
|
486 |
|
487 |
|
488 \noindent |
|
489 The @{const mkeps} function |
|
490 |
|
491 |
|
492 |
|
493 \noindent |
|
494 The @{text inj} function |
|
495 |
464 |
496 \begin{center} |
465 \begin{center} |
497 \begin{tabular}{lcl} |
466 \begin{tabular}{lcl} |
498 @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ |
467 @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ |
499 @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & |
468 @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & |
508 & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
477 & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
509 @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ |
478 @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ |
510 & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\ |
479 & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\ |
511 \end{tabular} |
480 \end{tabular} |
512 \end{center} |
481 \end{center} |
|
482 |
|
483 \noindent To better understand what is going on in this definition it |
|
484 might be instructive to look first at the three sequence cases (clauses |
|
485 4--6). Recall the corresponding clause of the derivative |
|
486 |
|
487 \begin{center} |
|
488 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} |
|
489 \end{center} |
|
490 |
|
491 \noindent |
|
492 NOT DONE YET |
|
493 |
|
494 Therefore there are, for example, three |
|
495 cases for sequence regular expressions (for all possible shapes of the |
|
496 value). |
|
497 |
|
498 Again the virtues of this algorithm is that it can be |
|
499 implemented with ease in a functional programming language and |
|
500 also in Isabelle/HOL. |
|
501 |
|
502 The well-known idea of POSIX lexing is informally defined in (for example) |
|
503 \cite{posix}; as correctly argued in \cite{Sulzmann2014}, this needs formal |
|
504 specification. The rough idea is that, in contrast to the so-called GREEDY |
|
505 algorithm, POSIX lexing chooses to match more deeply and using left choices |
|
506 rather than a right choices. For example, note that to match the string |
|
507 @{term "[a, b]"} with the regular expression $(a + \mts)\circ (b+ab)$ the matching |
|
508 will return $( Void, Right(ab))$ rather than $(Left\ a, Left\ b)$. [The |
|
509 regular expression $ab$ is short for $(Lit\ a) \circ (Lit\ b)$.] Similarly, |
|
510 to match {\em ``a''} with $(a+a)$ the leftmost $a$ will be chosen. |
|
511 |
|
512 We use a simple inductive definition to specify this notion, incorporating |
|
513 the POSIX-specific choices into the side-conditions for the rules $R tl |
|
514 +_2$, $R tl\circ$ and $R tl*$ (as they are now called). By contrast, |
|
515 \cite{Sulzmann2014} defines a relation between values and argues that there is a |
|
516 maximum value, as given by the derivative-based algorithm yet to be spelt |
|
517 out. The relation we define is ternary, relating strings, values and regular |
|
518 expressions. |
|
519 |
|
520 Our Posix relation @{term "s \<in> r \<rightarrow> v"} |
|
521 |
|
522 \begin{center} |
|
523 \begin{tabular}{c} |
|
524 @{thm[mode=Axiom] PMatch.intros(1)} \qquad |
|
525 @{thm[mode=Axiom] PMatch.intros(2)}\medskip\\ |
|
526 @{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\qquad |
|
527 @{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\medskip\\ |
|
528 \multicolumn{1}{p{5cm}}{@{thm[mode=Rule] PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}\medskip\\ |
|
529 @{thm[mode=Rule] PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}\medskip\\ |
|
530 @{thm[mode=Axiom] PMatch.intros(7)}\medskip\\ |
|
531 \end{tabular} |
|
532 \end{center} |
|
533 |
|
534 *} |
|
535 |
|
536 section {* The Argument by Sulzmmann and Lu *} |
|
537 |
|
538 section {* Conclusion *} |
|
539 |
|
540 text {* |
|
541 |
|
542 Nipkow lexer from 2000 |
|
543 |
|
544 *} |
|
545 |
|
546 |
|
547 text {* |
|
548 |
|
549 |
|
550 |
|
551 |
|
552 \noindent |
|
553 Values |
|
554 |
|
555 |
|
556 |
|
557 |
|
558 |
|
559 |
|
560 |
|
561 \noindent |
|
562 The @{const mkeps} function |
|
563 |
|
564 |
|
565 |
|
566 \noindent |
|
567 The @{text inj} function |
|
568 |
|
569 |
513 |
570 |
514 \noindent |
571 \noindent |
515 The inhabitation relation: |
572 The inhabitation relation: |
516 |
573 |
517 \begin{center} |
574 \begin{center} |