11 abbreviation |
11 abbreviation |
12 "ders_syn r s \<equiv> ders s r" |
12 "ders_syn r s \<equiv> ders s r" |
13 |
13 |
14 notation (latex output) |
14 notation (latex output) |
15 If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and |
15 If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and |
16 Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [75,75] 73) and |
16 Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [75,73] 73) and |
17 |
17 |
18 ZERO ("\<^bold>0" 78) and |
18 ZERO ("\<^bold>0" 78) and |
19 ONE ("\<^bold>1" 78) and |
19 ONE ("\<^bold>1" 78) and |
20 CHAR ("_" [1000] 80) and |
20 CHAR ("_" [1000] 80) and |
21 ALT ("_ + _" [77,77] 78) and |
21 ALT ("_ + _" [77,77] 78) and |
30 val.Stars ("Stars _" [79] 78) and |
30 val.Stars ("Stars _" [79] 78) and |
31 |
31 |
32 L ("L'(_')" [10] 78) and |
32 L ("L'(_')" [10] 78) and |
33 der_syn ("_\\_" [79, 1000] 76) and |
33 der_syn ("_\\_" [79, 1000] 76) and |
34 ders_syn ("_\\_" [79, 1000] 76) and |
34 ders_syn ("_\\_" [79, 1000] 76) and |
35 flat ("|_|" [75] 73) and |
35 flat ("|_|" [75] 74) and |
36 Sequ ("_ @ _" [78,77] 63) and |
36 Sequ ("_ @ _" [78,77] 63) and |
37 injval ("inj _ _ _" [79,77,79] 76) and |
37 injval ("inj _ _ _" [79,77,79] 76) and |
38 mkeps ("mkeps _" [79] 76) and |
38 mkeps ("mkeps _" [79] 76) and |
39 projval ("proj _ _ _" [1000,77,1000] 77) and |
39 projval ("proj _ _ _" [1000,77,1000] 77) and |
40 length ("len _" [78] 73) and |
40 length ("len _" [78] 73) and |
41 matcher ("lexer _ _" [78,78] 77) and |
41 matcher ("lexer _ _" [78,78] 77) and |
42 |
42 |
43 Prf ("\<triangleright> _ : _" [75,75] 75) and |
43 Prf ("_ : _" [75,75] 75) and |
44 PMatch ("'(_, _') \<rightarrow> _" [63,75,75] 75) |
44 PMatch ("'(_, _') \<rightarrow> _" [63,75,75] 75) |
45 (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *) |
45 (* and ValOrd ("_ \<succeq>\<^bsub>_\<^esub> _" [78,77,77] 73) *) |
46 |
46 |
47 definition |
47 definition |
48 "match r s \<equiv> nullable (ders s r)" |
48 "match r s \<equiv> nullable (ders s r)" |
57 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em |
57 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em |
58 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a |
58 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a |
59 character~@{text c}, and showed that it gave a simple solution to the |
59 character~@{text c}, and showed that it gave a simple solution to the |
60 problem of matching a string @{term s} with a regular expression @{term r}: |
60 problem of matching a string @{term s} with a regular expression @{term r}: |
61 if the derivative of @{term r} w.r.t.\ (in succession) all the characters of |
61 if the derivative of @{term r} w.r.t.\ (in succession) all the characters of |
62 the string matches the empty string, then @{term r} matches @{term s} |
62 the string matches the empty string, then @{term r} matches @{term s} (and |
63 (and {\em vice versa}). The derivative has the property (which may be |
63 {\em vice versa}). The derivative has the property (which may almost be |
64 regarded as its specification) that, for every string @{term s} and regular |
64 regarded as its specification) that, for every string @{term s} and regular |
65 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if |
65 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if |
66 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
66 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
67 derivatives is that they are neatly expressible in any functional language, |
67 derivatives is that they are neatly expressible in any functional language, |
68 and easily definable and reasoned about in theorem provers---the definitions |
68 and easily definable and reasoned about in theorem provers---the definitions |
69 just consist of inductive datatypes and simple recursive functions. A |
69 just consist of inductive datatypes and simple recursive functions. A |
70 completely formalised correctness proof of this matcher in for example HOL4 |
70 completely formalised correctness proof of this matcher in for example HOL4 |
71 has been mentioned in~\cite{Owens2008}. Another one in Isabelle/HOL is |
71 has been mentioned in~\cite{Owens2008}. Another one in Isabelle/HOL is part |
72 in \cite{Krauss2011}. |
72 of the work in \cite{Krauss2011}. |
73 |
73 |
74 One limitation of Brzozowski's matcher is that it only generates a YES/NO |
74 One limitation of Brzozowski's matcher is that it only generates a YES/NO |
75 answer for whether a string is being matched by a regular expression. |
75 answer for whether a string is being matched by a regular expression. |
76 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow |
76 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow |
77 generation not just of a YES/NO answer but of an actual matching, called a |
77 generation not just of a YES/NO answer but of an actual matching, called a |
78 [lexical] {\em value}. They give a simple algorithm to calculate a value |
78 [lexical] {\em value}. They give a simple algorithm to calculate a value |
79 that appears to be the value associated with POSIX matching |
79 that appears to be the value associated with POSIX matching |
80 \cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that |
80 \cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that |
81 value, in an algorithm-independent fashion, and to show that Sulzamann and |
81 value, in an algorithm-independent fashion, and to show that Sulzmann and |
82 Lu's derivative-based algorithm does indeed calculate a value that is |
82 Lu's derivative-based algorithm does indeed calculate a value that is |
83 correct according to the specification. |
83 correct according to the specification. |
84 |
84 |
85 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a |
85 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a |
86 relation (called an ``Order Relation'') on the set of values of @{term r}, |
86 relation (called an ``Order Relation'') on the set of values of @{term r}, |
145 that can match determines the token. |
145 that can match determines the token. |
146 \end{itemize} |
146 \end{itemize} |
147 |
147 |
148 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as |
148 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising keywords such as |
149 @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising |
149 @{text "if"}, @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} recognising |
150 identifiers (say, a single character followed by characters or numbers). Then we |
150 identifiers (say, a single character followed by characters or numbers). |
151 can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX |
151 Then we can form the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use |
152 matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the |
152 POSIX matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. |
153 first case we obtain by the longest match rule a single identifier token, |
153 For @{text "iffoo"} we obtain by the longest match rule a single identifier |
154 not a keyword followed by an identifier. In the second case we obtain by rule |
154 token, not a keyword followed by an identifier. For @{text "if"} we obtain by |
155 priority a keyword token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"} |
155 rule priority a keyword token, not an identifier token---even if @{text |
156 matches also.\bigskip |
156 "r\<^bsub>id\<^esub>"} matches also.\bigskip |
157 |
157 |
158 \noindent {\bf Contributions:} (NOT DONE YET) We have implemented in Isabelle/HOL the |
158 \noindent {\bf Contributions:} (NOT DONE YET) We have implemented in |
159 derivative-based regular expression matching algorithm as described by |
159 Isabelle/HOL the derivative-based regular expression matching algorithm as |
160 Sulzmann and Lu \cite{Sulzmann2014}. We have proved the correctness of this |
160 described by Sulzmann and Lu \cite{Sulzmann2014}. We have proved the |
161 algorithm according to our specification of what a POSIX value is. The |
161 correctness of this algorithm according to our specification of what a POSIX |
|
162 value is. Sulzmann and Lu sketch in \cite{Sulzmann2014} an informal |
|
163 correctness proof: but to us it contains unfillable gaps. |
|
164 |
162 informal correctness proof given in \cite{Sulzmann2014} is in final |
165 informal correctness proof given in \cite{Sulzmann2014} is in final |
163 form\footnote{} and to us contains unfillable gaps. Our specification of a |
166 form\footnote{} and to us contains unfillable gaps. |
164 POSIX value consists of a simple inductive definition that given a string |
167 |
165 and a regular expression uniquely determines this value. Derivatives as |
168 Our specification of a POSIX value consists of a simple inductive definition |
166 calculated by Brzozowski's method are usually more complex regular |
169 that given a string and a regular expression uniquely determines this value. |
167 expressions than the initial one; various optimisations are possible, such |
170 Derivatives as calculated by Brzozowski's method are usually more complex |
168 as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r ZERO"}, @{term |
171 regular expressions than the initial one; various optimisations are |
169 "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the advantages of |
172 possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r |
170 having a simple specification and correctness proof is that the latter can |
173 ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the |
171 be refined to allow for such optimisations and simple correctness proof. |
174 advantages of having a simple specification and correctness proof is that |
|
175 the latter can be refined to allow for such optimisations and simple |
|
176 correctness proof. |
172 |
177 |
173 An extended version of \cite{Sulzmann2014} is available at the website of |
178 An extended version of \cite{Sulzmann2014} is available at the website of |
174 its first author; this includes some ``proofs'', claimed in |
179 its first author; this includes some ``proofs'', claimed in |
175 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in |
180 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in |
176 final form, we make no comment thereon, preferring to give general reasons |
181 final form, we make no comment thereon, preferring to give general reasons |
218 \end{tabular} |
223 \end{tabular} |
219 \end{center} |
224 \end{center} |
220 |
225 |
221 \noindent In the fourth clause we use the operation @{term "DUMMY ;; |
226 \noindent In the fourth clause we use the operation @{term "DUMMY ;; |
222 DUMMY"} for the concatenation of two languages (it is also list-append for |
227 DUMMY"} for the concatenation of two languages (it is also list-append for |
223 strings). We use the star-notation for regular expressions and languages |
228 strings). We use the star-notation for regular expressions and for |
224 (in the last clause above). The star on languages is defined inductively |
229 languages (in the last clause above). The star for languages is defined |
225 by two clauses: @{text "(i)"} for the empty string being in the star of a |
230 inductively by two clauses: @{text "(i)"} the empty string being in |
226 language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a language and @{term |
231 the star of a language and @{text "(ii)"} if @{term "s\<^sub>1"} is in a |
227 "s\<^sub>2"} in the star of this language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in |
232 language and @{term "s\<^sub>2"} in the star of this language, then also @{term |
228 the star of this language. It will also be convenient to use the following |
233 "s\<^sub>1 @ s\<^sub>2"} is in the star of this language. It will also be convenient |
229 notion of a \emph{semantic derivative} (or \emph{left quotient}) of a |
234 to use the following notion of a \emph{semantic derivative} (or \emph{left |
230 language, say @{text A}, defined as: |
235 quotient}) of a language defined as: |
231 |
236 |
232 \begin{center} |
237 \begin{center} |
233 \begin{tabular}{lcl} |
238 \begin{tabular}{lcl} |
234 @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\ |
239 @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\ |
235 \end{tabular} |
240 \end{tabular} |
304 \end{center} |
309 \end{center} |
305 |
310 |
306 \noindent gives a positive answer if and only if @{term "s \<in> L r"}. |
311 \noindent gives a positive answer if and only if @{term "s \<in> L r"}. |
307 Consequently, this regular expression matching algorithm satisfies the |
312 Consequently, this regular expression matching algorithm satisfies the |
308 usual specification. While the matcher above calculates a provably correct |
313 usual specification. While the matcher above calculates a provably correct |
309 a YES/NO answer for whether a regular expression matches a string, the |
314 YES/NO answer for whether a regular expression matches a string, the novel |
310 novel idea of Sulzmann and Lu \cite{Sulzmann2014} is to append another |
315 idea of Sulzmann and Lu \cite{Sulzmann2014} is to append another phase to |
311 phase to this algorithm in order to calculate a [lexical] value. We will |
316 this algorithm in order to calculate a [lexical] value. We will explain |
312 explain the details next. |
317 the details next. |
313 |
318 |
314 *} |
319 *} |
315 |
320 |
316 section {* POSIX Regular Expression Matching\label{posixsec} *} |
321 section {* POSIX Regular Expression Matching\label{posixsec} *} |
317 |
322 |
318 text {* |
323 text {* |
319 |
324 |
320 The clever idea in \cite{Sulzmann2014} is to introduce values for encoding |
325 The clever idea in \cite{Sulzmann2014} is to introduce values for encoding |
321 \emph{how} a regular expression matches a string and then define a function |
326 \emph{how} a regular expression matches a string and then define a |
322 on values that mirrors (but inverts) the construction of the derivative on |
327 function on values that mirrors (but inverts) the construction of the |
323 regular expressions. \emph{Values} are defined as the inductive datatype |
328 derivative on regular expressions. \emph{Values} are defined as the |
|
329 inductive datatype |
324 |
330 |
325 \begin{center} |
331 \begin{center} |
326 @{text "v :="} |
332 @{text "v :="} |
327 @{const "Void"} $\mid$ |
333 @{const "Void"} $\mid$ |
328 @{term "val.Char c"} $\mid$ |
334 @{term "val.Char c"} $\mid$ |
365 \end{tabular} |
371 \end{tabular} |
366 \end{center} |
372 \end{center} |
367 |
373 |
368 \noindent Note that no values are associated with the regular expression |
374 \noindent Note that no values are associated with the regular expression |
369 @{term ZERO}, and that the only value associated with the regular |
375 @{term ZERO}, and that the only value associated with the regular |
370 expression @{term ONE} is @{term Void}, pronounced (if one must) as {\em |
376 expression @{term ONE} is @{term Void}, pronounced (if one must) as @{text |
371 ``Void''}. It is routine to establish how values ``inhabiting'' a regular |
377 "Void"}. It is routine to establish how values ``inhabiting'' a regular |
372 expression correspond to the language of a regular expression, namely |
378 expression correspond to the language of a regular expression, namely |
373 |
379 |
374 \begin{proposition} |
380 \begin{proposition} |
375 @{thm L_flat_Prf} |
381 @{thm L_flat_Prf} |
376 \end{proposition} |
382 \end{proposition} |
377 |
383 |
378 In general there are more than one value associated with a regular |
384 In general there is more than one value associated with a regular |
379 expression. In case of POSIX matching the problem is to calculate the |
385 expression. In case of POSIX matching the problem is to calculate the |
380 unique value that satisfies the (informal) POSIX constraints from the |
386 unique value that satisfies the (informal) POSIX rules from the |
381 Introduction. Graphically the regular expression matching algorithm by |
387 Introduction. Graphically the POSIX value calculation algorithm by |
382 Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz} |
388 Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz} |
383 where the path from the left to the right involving @{const der}/@{const |
389 where the path from the left to the right involving @{const der}/@{const |
384 nullable} is the first phase of the algorithm (calculating successive |
390 nullable} is the first phase of the algorithm (calculating successive |
385 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to |
391 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to |
386 left, the second phase. This picture shows the steps required when a |
392 left, the second phase. This picture shows the steps required when a |
388 "[a,b,c]"}. We first build the three derivatives (according to @{term a}, |
394 "[a,b,c]"}. We first build the three derivatives (according to @{term a}, |
389 @{term b} and @{term c}). We then use @{const nullable} to find out |
395 @{term b} and @{term c}). We then use @{const nullable} to find out |
390 whether the resulting derivative regular expression @{term "r\<^sub>4"} |
396 whether the resulting derivative regular expression @{term "r\<^sub>4"} |
391 can match the empty string. If yes, we call the function @{const mkeps} |
397 can match the empty string. If yes, we call the function @{const mkeps} |
392 that produces a value @{term "v\<^sub>4"} for how @{term "r\<^sub>4"} can |
398 that produces a value @{term "v\<^sub>4"} for how @{term "r\<^sub>4"} can |
393 match the empty string (taking into account the POSIX constraints in case |
399 match the empty string (taking into account the POSIX rules in case |
394 there are several ways). This functions is defined by the clauses: |
400 there are several ways). This functions is defined by the clauses: |
395 |
401 |
396 \begin{figure}[t] |
402 \begin{figure}[t] |
397 \begin{center} |
403 \begin{center} |
398 \begin{tikzpicture}[scale=2,node distance=1.3cm, |
404 \begin{tikzpicture}[scale=2,node distance=1.3cm, |
437 \end{center} |
443 \end{center} |
438 |
444 |
439 \noindent Note that this function needs only to be partially defined, |
445 \noindent Note that this function needs only to be partially defined, |
440 namely only for regular expressions that are nullable. In case @{const |
446 namely only for regular expressions that are nullable. In case @{const |
441 nullable} fails, the string @{term "[a,b,c]"} cannot be matched by @{term |
447 nullable} fails, the string @{term "[a,b,c]"} cannot be matched by @{term |
442 "r\<^sub>1"} and an error is raised. Note also how this function makes |
448 "r\<^sub>1"} and an error is raised instead. Note also how this function |
443 some subtle choices leading to a POSIX value: for example if the |
449 makes some subtle choices leading to a POSIX value: for example if an |
444 alternative, say @{term "ALT r\<^sub>1 r\<^sub>2"}, can match the empty |
450 alternative regular expression, say @{term "ALT r\<^sub>1 r\<^sub>2"}, can |
445 string and furthermore @{term "r\<^sub>1"} can match the empty string, |
451 match the empty string and furthermore @{term "r\<^sub>1"} can match the |
446 then we return a @{const Left}-value. The @{const Right}-value will only |
452 empty string, then we return a @{text Left}-value. The @{text |
447 be returned if @{term "r\<^sub>1"} is not nullable. |
453 Right}-value will only be returned if @{term "r\<^sub>1"} is not nullable. |
448 |
454 |
449 The most interesting novelty from Sulzmann and Lu \cite{Sulzmann2014} is |
455 The most interesting novelty from Sulzmann and Lu \cite{Sulzmann2014} is |
450 the construction value for how @{term "r\<^sub>1"} can match the string |
456 the construction of a value for how @{term "r\<^sub>1"} can match the |
451 @{term "[a,b,c]"} from the value how the last derivative, @{term |
457 string @{term "[a,b,c]"} from the value how the last derivative, @{term |
452 "r\<^sub>4"} in Fig~\ref{Sulz}, can match the empty string. Sulzmann and |
458 "r\<^sub>4"} in Fig~\ref{Sulz}, can match the empty string. Sulzmann and |
453 Lu acchieve this by stepwise ``injecting back'' the characters into the |
459 Lu achieve this by stepwise ``injecting back'' the characters into the |
454 values thus inverting the operation of building derivatives on the level |
460 values thus inverting the operation of building derivatives on the level |
455 of values. The corresponding function, called @{term inj}, takes three |
461 of values. The corresponding function, called @{term inj}, takes three |
456 arguments, a regular expression, a character and a value. For example in |
462 arguments, a regular expression, a character and a value. For example in |
457 the first @{term inj}-step in Fig~\ref{Sulz} the regular expression @{term |
463 the first (or right-most) @{term inj}-step in Fig~\ref{Sulz} the regular |
458 "r\<^sub>3"}, the character @{term c} from the last derivative step and |
464 expression @{term "r\<^sub>3"}, the character @{term c} from the last |
459 @{term "v\<^sub>4"}, which is the value corresponding to the derivative |
465 derivative step and @{term "v\<^sub>4"}, which is the value corresponding |
460 regular expression @{term "r\<^sub>4"}. The result is the new value @{term |
466 to the derivative regular expression @{term "r\<^sub>4"}. The result is |
461 "v\<^sub>3"}. The final result of the algorithm is the value @{term |
467 the new value @{term "v\<^sub>3"}. The final result of the algorithm is |
462 "v\<^sub>1"} corresponding to the input regular expression. The @{term |
468 the value @{term "v\<^sub>1"} corresponding to the input regular |
463 inj} function is by recursion on the regular expression and by analysing |
469 expression. The @{term inj} function is by recursion on the regular |
464 the shape of values. |
470 expressions and by analysing the shape of values (corresponding to |
465 |
471 the derivative regular expressions). |
466 \begin{center} |
472 |
467 \begin{tabular}{llcl} |
473 \begin{center} |
|
474 \begin{tabular}{l@ {\hspace{5mm}}lcl} |
468 (1) & @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ |
475 (1) & @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ |
469 (2) & @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & |
476 (2) & @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & |
470 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ |
477 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ |
471 (3) & @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & |
478 (3) & @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & |
472 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
479 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
481 \end{tabular} |
488 \end{tabular} |
482 \end{center} |
489 \end{center} |
483 |
490 |
484 \noindent To better understand what is going on in this definition it |
491 \noindent To better understand what is going on in this definition it |
485 might be instructive to look first at the three sequence cases (clauses |
492 might be instructive to look first at the three sequence cases (clauses |
486 (4)--(6)). In each case we need to construct an ``injected value'' for @{term |
493 (4)--(6)). In each case we need to construct an ``injected value'' for |
487 "SEQ r\<^sub>1 r\<^sub>2"}. Recall the clause of the @{const der}-function |
494 @{term "SEQ r\<^sub>1 r\<^sub>2"}. This must be a value of the form @{term |
|
495 "Seq DUMMY DUMMY"}. Recall the clause of the @{const der}-function |
488 for sequence regular expressions: |
496 for sequence regular expressions: |
489 |
497 |
490 \begin{center} |
498 \begin{center} |
491 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} |
499 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} |
492 \end{center} |
500 \end{center} |
493 |
501 |
494 \noindent Consider first the else-branch where the derivative is @{term |
502 \noindent Consider first the else-branch where the derivative is @{term |
495 "SEQ (der c r\<^sub>1) r\<^sub>2"}. The corresponding value must therefore |
503 "SEQ (der c r\<^sub>1) r\<^sub>2"}. The corresponding value must therefore |
496 be the form @{term "Seq v\<^sub>1 v\<^sub>2"}, which matches clause (4) of |
504 be the form @{term "Seq v\<^sub>1 v\<^sub>2"}, which matches the left-hand |
497 @{term inj}. In the if-branch the derivative is an alternative, namely |
505 side in clause (4) of @{term inj}. In the if-branch the derivative is an |
498 @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c r\<^sub>2)"}. This |
506 alternative, namely @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c |
499 means we either have to consider a @{text Left}- or @{text Right}-value. |
507 r\<^sub>2)"}. This means we either have to consider a @{text Left}- or |
500 In case of the @{text Left}-value we know further it must be a value for a |
508 @{text Right}-value. In case of the @{text Left}-value we know further it |
501 sequence regular expression. Therefore the pattern we match in the clause |
509 must be a value for a sequence regular expression. Therefore the pattern |
502 (5) is @{term "Left (Seq v\<^sub>1 v\<^sub>2)"}, while in (6) it is just |
510 we match in the clause (5) is @{term "Left (Seq v\<^sub>1 v\<^sub>2)"}, |
503 @{term "Right v\<^sub>2"}. One more interesting point is in the right-hand |
511 while in (6) it is just @{term "Right v\<^sub>2"}. One more interesting |
504 side of clause (6): since in this case the regular expression @{text |
512 point is in the right-hand side of clause (6): since in this case the |
505 "r\<^sub>1"} does not ``contribute'' in matching the string, that is only |
513 regular expression @{text "r\<^sub>1"} does not ``contribute'' for |
506 matches the empty string, we need to call @{const mkeps} in order to |
514 matching the string, that is only matches the empty string, we need to |
507 construct a value how @{term "r\<^sub>1"} can match this empty string. A |
515 call @{const mkeps} in order to construct a value how @{term "r\<^sub>1"} |
508 similar argument applies for why we can expect in clause (7) that the |
516 can match this empty string. A similar argument applies for why we can |
509 value is of the form @{term "Seq v (Stars vs)"} (the derivative of a star |
517 expect in the left-hand side of clause (7) that the value is of the form |
510 is @{term "SEQ r (STAR r)"}). Finally, the reason for why we can ignore |
518 @{term "Seq v (Stars vs)"}---the derivative of a star is @{term "SEQ r |
511 the second argument in clause (1) of @{term inj} is that it will only ever |
519 (STAR r)"}. Finally, the reason for why we can ignore the second argument |
512 be called in cases where @{term "c=d"}, but the usual linearity |
520 in clause (1) of @{term inj} is that it will only ever be called in cases |
513 restrictions in pattern-matches do not allow is to build this constraint |
521 where @{term "c=d"}, but the usual linearity restrictions in patterns do |
514 explicitly into the pattern. |
522 not allow is to build this constraint explicitly into our function |
|
523 definition.\footnote{Sulzmann and Lu state this clause as @{thm (lhs) |
|
524 injval.simps(1)[of "c" "c"]} $\dn$ @{thm (rhs) injval.simps(1)[of "c"]}, |
|
525 but our deviation is harmless.} |
515 |
526 |
516 Having defined the @{const mkeps} and @{text inj} function we can extend |
527 Having defined the @{const mkeps} and @{text inj} function we can extend |
517 \Brz's matcher so that a [lexical] value is constructed (assuming the |
528 \Brz's matcher so that a [lexical] value is constructed (assuming the |
518 regular expression matches the string). The clauses of the lexer are |
529 regular expression matches the string). The clauses of the lexer are |
519 |
530 |
525 & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"} |
536 & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"} |
526 \end{tabular} |
537 \end{tabular} |
527 \end{center} |
538 \end{center} |
528 |
539 |
529 \noindent If the regular expression does not match, @{const None} is |
540 \noindent If the regular expression does not match, @{const None} is |
530 returned. If the regular expression does match the string, then @{const |
541 returned, indicating an error is raised. If the regular expression does |
531 Some} value is returned. Again the virtues of this algorithm is that it |
542 match the string, then @{const Some} value is returned. One important |
532 can be implemented with ease in a functional programming language and also |
543 virtue of this algorithm is that it can be implemented with ease in a |
533 in Isabelle/HOL. In the remaining part of this section we prove that |
544 functional programming language and also in Isabelle/HOL. In the remaining |
534 this algorithm is correct. |
545 part of this section we prove that this algorithm is correct. |
535 |
546 |
536 The well-known idea of POSIX matching is informally defined by the longest |
547 The well-known idea of POSIX matching is informally defined by the longest |
537 match and priority rule; as correctly argued in \cite{Sulzmann2014}, this |
548 match and priority rule; as correctly argued in \cite{Sulzmann2014}, this |
538 needs formal specification. |
549 needs formal specification. Sulzmann and Lu define a \emph{dominance} |
539 |
550 relation\footnote{Sulzmann and Lu call it an ordering relation, but |
540 We use a simple inductive definition to specify this notion, incorporating |
551 without giving evidence that it is transitive.} between values and argue that |
541 the POSIX-specific choices into the side-conditions for the rules $R tl |
552 there is a maximum value, as given by the derivative-based algorithm. In |
542 +_2$, $R tl\circ$ and $R tl*$ (as they are now called). By contrast, |
553 contrast, we shall next introduce a simple inductive definition to specify |
543 \cite{Sulzmann2014} defines a relation between values and argues that there is a |
554 what a \emph{POSIX value} is, incorporating the POSIX-specific choices |
544 maximum value, as given by the derivative-based algorithm yet to be spelt |
555 into the side-conditions of our rules. Our definition is inspired by the |
545 out. The relation we define is ternary, relating strings, values and regular |
556 matching relation given in \cite{Vansummeren2006}. The relation we define |
546 expressions. |
557 is ternary and written as \mbox{@{term "s \<in> r \<rightarrow> v"}}, relating strings, |
547 |
558 regular expressions and values. |
548 Our Posix relation @{term "s \<in> r \<rightarrow> v"} |
|
549 |
559 |
550 \begin{center} |
560 \begin{center} |
551 \begin{tabular}{c} |
561 \begin{tabular}{c} |
552 @{thm[mode=Axiom] PMatch.intros(1)} \qquad |
562 @{thm[mode=Axiom] PMatch.intros(1)} \qquad |
553 @{thm[mode=Axiom] PMatch.intros(2)}\medskip\\ |
563 @{thm[mode=Axiom] PMatch.intros(2)}\bigskip\\ |
554 @{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\qquad |
564 @{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\qquad |
555 @{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\medskip\\ |
565 @{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\bigskip\\ |
556 \multicolumn{1}{p{5cm}}{@{thm[mode=Rule] PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}\medskip\\ |
566 $\mprset{flushleft} |
557 @{thm[mode=Rule] PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}\medskip\\ |
567 \inferrule |
558 @{thm[mode=Axiom] PMatch.intros(7)}\medskip\\ |
568 {@{thm (prem 1) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad |
559 \end{tabular} |
569 @{thm (prem 2) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\ |
560 \end{center} |
570 @{thm (prem 3) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}} |
561 |
571 {@{thm (concl) PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\\ |
|
572 @{thm[mode=Axiom] PMatch.intros(7)}\bigskip\\ |
|
573 $\mprset{flushleft} |
|
574 \inferrule |
|
575 {@{thm (prem 1) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
|
576 @{thm (prem 2) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
|
577 @{thm (prem 3) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ |
|
578 @{thm (prem 4) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} |
|
579 {@{thm (concl) PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$ |
|
580 \end{tabular} |
|
581 \end{center} |
|
582 |
|
583 \noindent We claim that this relation captures the idea behind the two |
|
584 informal POSIX rules shown in the Introduction: Consider the second line |
|
585 where the POSIX values for an alternative regular expression is |
|
586 specified---it is always a @{text "Left"}-value, \emph{except} when the |
|
587 string to be matched is not in the language of @{term "r\<^sub>1"}; only then it |
|
588 is a @{text Right}-value (see the side-condition in the rule on the |
|
589 right). Interesting is also the rule for sequence regular expressions |
|
590 (third line). The first two premises state that @{term "v\<^sub>1"} and |
|
591 @{term "v\<^sub>2"} are the POSIX values for @{term "(s\<^sub>1, |
|
592 r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"} respectively. |
|
593 |
|
594 \begin{theorem} |
|
595 @{thm[mode=IfThen] PMatch_determ(1)[of _ _ "v\<^sub>1" "v\<^sub>2"]} |
|
596 \end{theorem} |
|
597 |
|
598 \begin{lemma} |
|
599 @{thm[mode=IfThen] PMatch_mkeps} |
|
600 \end{lemma} |
|
601 |
|
602 \begin{lemma} |
|
603 @{thm[mode=IfThen] Prf_injval_flat} |
|
604 \end{lemma} |
|
605 |
|
606 \begin{lemma} |
|
607 @{thm[mode=IfThen] PMatch2_roy_version} |
|
608 \end{lemma} |
|
609 |
|
610 \begin{theorem}\mbox{}\smallskip\\ |
|
611 \begin{tabular}{ll} |
|
612 (1) & @{thm (lhs) lex_correct1a} if and only if @{thm (rhs) lex_correct1a}\\ |
|
613 (2) & @{thm (lhs) lex_correct3a} if and only if @{thm (rhs) lex_correct3a}\\ |
|
614 \end{tabular} |
|
615 \end{theorem} |
562 *} |
616 *} |
563 |
617 |
564 section {* The Argument by Sulzmmann and Lu *} |
618 section {* The Argument by Sulzmmann and Lu *} |
565 |
619 |
566 section {* Conclusion *} |
620 section {* Conclusion *} |