137 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):} |
137 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):} |
138 |
138 |
139 The longest initial substring matched by any regular expression is taken as |
139 The longest initial substring matched by any regular expression is taken as |
140 next token.\smallskip |
140 next token.\smallskip |
141 |
141 |
142 \item[$\bullet$] \underline{Rule Priority:} |
142 \item[$\bullet$] \underline{Priority Rule:} |
143 |
143 |
144 For a particular longest initial substring, the first regular expression |
144 For a particular longest initial substring, the first regular expression |
145 that can match determines the token. |
145 that can match determines the token. |
146 \end{itemize} |
146 \end{itemize} |
147 |
147 |
366 \end{center} |
366 \end{center} |
367 |
367 |
368 \noindent Note that no values are associated with the regular expression |
368 \noindent Note that no values are associated with the regular expression |
369 @{term ZERO}, and that the only value associated with the regular |
369 @{term ZERO}, and that the only value associated with the regular |
370 expression @{term ONE} is @{term Void}, pronounced (if one must) as {\em |
370 expression @{term ONE} is @{term Void}, pronounced (if one must) as {\em |
371 ``Void''}. It is routine to stablish how values ``inhabiting'' a regular |
371 ``Void''}. It is routine to establish how values ``inhabiting'' a regular |
372 expression correspond to the language of a regular expression, namely |
372 expression correspond to the language of a regular expression, namely |
373 |
373 |
374 \begin{proposition} |
374 \begin{proposition} |
375 @{thm L_flat_Prf} |
375 @{thm L_flat_Prf} |
376 \end{proposition} |
376 \end{proposition} |
524 & & \phantom{$|$} @{term "None"} @{text "\<Rightarrow>"} @{term None}\\ |
524 & & \phantom{$|$} @{term "None"} @{text "\<Rightarrow>"} @{term None}\\ |
525 & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"} |
525 & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"} |
526 \end{tabular} |
526 \end{tabular} |
527 \end{center} |
527 \end{center} |
528 |
528 |
529 |
529 \noindent If the regular expression does not match, @{const None} is |
530 |
530 returned. If the regular expression does match the string, then @{const |
531 NOT DONE YET |
531 Some} value is returned. Again the virtues of this algorithm is that it |
532 |
532 can be implemented with ease in a functional programming language and also |
533 Therefore there are, for example, three |
533 in Isabelle/HOL. In the remaining part of this section we prove that |
534 cases for sequence regular expressions (for all possible shapes of the |
534 this algorithm is correct. |
535 value). |
535 |
536 |
536 The well-known idea of POSIX matching is informally defined by the longest |
537 Again the virtues of this algorithm is that it can be |
537 match and priority rule; as correctly argued in \cite{Sulzmann2014}, this |
538 implemented with ease in a functional programming language and |
538 needs formal specification. |
539 also in Isabelle/HOL. |
539 |
540 |
540 We use a simple inductive definition to specify this notion, incorporating |
541 The well-known idea of POSIX lexing is informally defined in (for example) |
541 the POSIX-specific choices into the side-conditions for the rules $R tl |
542 \cite{posix}; as correctly argued in \cite{Sulzmann2014}, this needs formal |
542 +_2$, $R tl\circ$ and $R tl*$ (as they are now called). By contrast, |
543 specification. The rough idea is that, in contrast to the so-called GREEDY |
543 \cite{Sulzmann2014} defines a relation between values and argues that there is a |
544 algorithm, POSIX lexing chooses to match more deeply and using left choices |
544 maximum value, as given by the derivative-based algorithm yet to be spelt |
545 rather than a right choices. For example, note that to match the string |
545 out. The relation we define is ternary, relating strings, values and regular |
546 @{term "[a, b]"} with the regular expression $(a + \mts)\circ (b+ab)$ the matching |
546 expressions. |
547 will return $( Void, Right(ab))$ rather than $(Left\ a, Left\ b)$. [The |
|
548 regular expression $ab$ is short for $(Lit\ a) \circ (Lit\ b)$.] Similarly, |
|
549 to match {\em ``a''} with $(a+a)$ the leftmost $a$ will be chosen. |
|
550 |
|
551 We use a simple inductive definition to specify this notion, incorporating |
|
552 the POSIX-specific choices into the side-conditions for the rules $R tl |
|
553 +_2$, $R tl\circ$ and $R tl*$ (as they are now called). By contrast, |
|
554 \cite{Sulzmann2014} defines a relation between values and argues that there is a |
|
555 maximum value, as given by the derivative-based algorithm yet to be spelt |
|
556 out. The relation we define is ternary, relating strings, values and regular |
|
557 expressions. |
|
558 |
547 |
559 Our Posix relation @{term "s \<in> r \<rightarrow> v"} |
548 Our Posix relation @{term "s \<in> r \<rightarrow> v"} |
560 |
549 |
561 \begin{center} |
550 \begin{center} |
562 \begin{tabular}{c} |
551 \begin{tabular}{c} |