7 |
7 |
8 abbreviation |
8 abbreviation |
9 "der_syn r c \<equiv> der c r" |
9 "der_syn r c \<equiv> der c r" |
10 |
10 |
11 notation (latex output) |
11 notation (latex output) |
12 If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and |
12 If ("(\<^raw:\textrm{>if\<^raw:}> (_)/ \<^raw:\textrm{>then\<^raw:}> (_)/ \<^raw:\textrm{>else\<^raw:}> (_))" 10) and |
13 Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and |
13 Cons ("_\<^raw:\mbox{$\,$}>::\<^raw:\mbox{$\,$}>_" [78,77] 73) and |
|
14 |
14 ZERO ("\<^bold>0" 80) and |
15 ZERO ("\<^bold>0" 80) and |
15 ONE ("\<^bold>1" 80) and |
16 ONE ("\<^bold>1" 80) and |
16 CHAR ("_" [1000] 80) and |
17 CHAR ("_" [1000] 80) and |
17 ALT ("_ + _" [77,77] 78) and |
18 ALT ("_ + _" [77,77] 78) and |
18 SEQ ("_ \<cdot> _" [77,77] 78) and |
19 SEQ ("_ \<cdot> _" [77,77] 78) and |
19 STAR ("_\<^sup>\<star>" [1000] 78) and |
20 STAR ("_\<^sup>\<star>" [1000] 78) and |
|
21 |
|
22 val.Void ("'(')" 78) and |
20 val.Char ("Char _" [1000] 78) and |
23 val.Char ("Char _" [1000] 78) and |
21 val.Left ("Left _" [1000] 78) and |
24 val.Left ("Left _" [1000] 78) and |
22 val.Right ("Right _" [1000] 78) and |
25 val.Right ("Right _" [1000] 78) and |
|
26 |
23 L ("L'(_')" [10] 78) and |
27 L ("L'(_')" [10] 78) and |
24 der_syn ("_\\_" [79, 1000] 76) and |
28 der_syn ("_\\_" [79, 1000] 76) and |
25 flat ("|_|" [70] 73) and |
29 flat ("|_|" [70] 73) and |
26 Sequ ("_ @ _" [78,77] 63) and |
30 Sequ ("_ @ _" [78,77] 63) and |
27 injval ("inj _ _ _" [1000,77,1000] 77) and |
31 injval ("inj _ _ _" [1000,77,1000] 77) and |
151 |
155 |
152 *} |
156 *} |
153 |
157 |
154 section {* Preliminaries *} |
158 section {* Preliminaries *} |
155 |
159 |
156 text {* \noindent Strings in Isabelle/HOL are lists of characters with |
160 text {* \noindent Strings in Isabelle/HOL are lists of characters with the |
157 the empty string being represented by the empty list, written @{term |
161 empty string being represented by the empty list, written @{term "[]"}, and |
158 "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. |
162 list-cons being written as @{term "DUMMY # DUMMY"}. Often we use the usual |
159 Often we use the usual bracket notation for strings; for example a |
163 bracket notation for strings; for example a string consisting of just a |
160 string consisting of a single character is written @{term "[c]"}. By |
164 single character is written @{term "[c]"}. By using the type @{type char} |
161 using the type @{type char} for characters we have a supply of |
165 for characters we have a supply of finitely many characters roughly |
162 finitely many characters roughly corresponding to the ASCII |
166 corresponding to the ASCII character set. Regular expressions are defined as |
163 character set. Regular expressions are defined as usual as the |
167 usual as the following inductive datatype: |
164 following inductive datatype: |
|
165 |
168 |
166 \begin{center} |
169 \begin{center} |
167 @{text "r :="} |
170 @{text "r :="} |
168 @{const "ZERO"} $\mid$ |
171 @{const "ZERO"} $\mid$ |
169 @{const "ONE"} $\mid$ |
172 @{const "ONE"} $\mid$ |
246 \end{tabular} |
249 \end{tabular} |
247 \end{center} |
250 \end{center} |
248 *} |
251 *} |
249 |
252 |
250 section {* POSIX Regular Expression Matching *} |
253 section {* POSIX Regular Expression Matching *} |
|
254 |
|
255 text {* |
|
256 |
|
257 The clever idea in \cite{Sulzmann2014} is to define a function on values that mirrors |
|
258 (but inverts) the construction of the derivative on regular expressions. We |
|
259 begin with the case of a nullable regular expression: from the nullability |
|
260 we need to construct a value that witnesses the nullability. This is as |
|
261 follows. The @{const mkeps} function (from \cite{Sulzmann2014}) is a partial (but |
|
262 unambiguous) function from regular expressions to values, total on exactly |
|
263 the set of nullable regular expressions. |
|
264 |
|
265 \begin{center} |
|
266 \begin{tabular}{lcl} |
|
267 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ |
|
268 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
|
269 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
|
270 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ |
|
271 \end{tabular} |
|
272 \end{center} |
|
273 |
|
274 The well-known idea of POSIX lexing is informally defined in (for example) |
|
275 \cite{posix}; as correctly argued in \cite{Sulzmann2014}, this needs formal |
|
276 specification. The rough idea is that, in contrast to the so-called GREEDY |
|
277 algorithm, POSIX lexing chooses to match more deeply and using left choices |
|
278 rather than a right choices. For example, note that to match the string |
|
279 @{term "[a, b]"} with the regular expression $(a + \mts)\circ (b+ab)$ the matching |
|
280 will return $( Void, Right(ab))$ rather than $(Left\ a, Left\ b)$. [The |
|
281 regular expression $ab$ is short for $(Lit\ a) \circ (Lit\ b)$.] Similarly, |
|
282 to match {\em ``a''} with $(a+a)$ the leftmost $a$ will be chosen. |
|
283 |
|
284 We use a simple inductive definition to specify this notion, incorporating |
|
285 the POSIX-specific choices into the side-conditions for the rules $R tl |
|
286 +_2$, $R tl\circ$ and $R tl*$ (as they are now called). By contrast, |
|
287 \cite{Sulzmann2014} defines a relation between values and argues that there is a |
|
288 maximum value, as given by the derivative-based algorithm yet to be spelt |
|
289 out. The relation we define is ternary, relating strings, values and regular |
|
290 expressions. |
|
291 |
|
292 *} |
251 |
293 |
252 section {* The Argument by Sulzmmann and Lu *} |
294 section {* The Argument by Sulzmmann and Lu *} |
253 |
295 |
254 section {* Conclusion *} |
296 section {* Conclusion *} |
255 |
297 |
276 @{term "Right v"} $\mid$ |
318 @{term "Right v"} $\mid$ |
277 @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ |
319 @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ |
278 @{term "Stars vs"} |
320 @{term "Stars vs"} |
279 \end{center} |
321 \end{center} |
280 |
322 |
281 \noindent |
|
282 The language of a regular expression |
|
283 |
|
284 \begin{center} |
|
285 \begin{tabular}{lcl} |
|
286 @{thm (lhs) L.simps(1)} & $\dn$ & @{thm (rhs) L.simps(1)}\\ |
|
287 @{thm (lhs) L.simps(2)} & $\dn$ & @{thm (rhs) L.simps(2)}\\ |
|
288 @{thm (lhs) L.simps(3)} & $\dn$ & @{thm (rhs) L.simps(3)}\\ |
|
289 @{thm (lhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
|
290 @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
|
291 @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\ |
|
292 \end{tabular} |
|
293 \end{center} |
|
294 |
|
295 |
323 |
296 |
324 |
297 \noindent |
325 \noindent |
298 The @{const flat} function for values |
326 The @{const flat} function for values |
299 |
327 |