84 experience of doing our proofs has been that this mechanical checking was |
84 experience of doing our proofs has been that this mechanical checking was |
85 absolutely essential: this subject area has hidden snares. This was also |
85 absolutely essential: this subject area has hidden snares. This was also |
86 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching |
86 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching |
87 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
87 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
88 |
88 |
89 If a regular expression matches a string, then in general there are more |
89 If a regular expression matches a string, then in general there is more |
90 than one way of how the string is matched. There are two commonly used |
90 than one way of how the string is matched. There are two commonly used |
91 disambiguation strategies to generate a unique answer: one is called GREEDY |
91 disambiguation strategies to generate a unique answer: one is called GREEDY |
92 matching \cite{Frisch2004} and the other is POSIX |
92 matching \cite{Frisch2004} and the other is POSIX |
93 matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string |
93 matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string |
94 @{term xy} and the regular expression \mbox{@{term "STAR (ALT (ALT x y) xy)"}}. |
94 @{term xy} and the regular expression \mbox{@{term "STAR (ALT (ALT x y) |
95 Either the string can be matched in two `iterations' by the single |
95 xy)"}}. Either the string can be matched in two `iterations' by the single |
96 letter-regular expressions @{term x} and @{term y}, or directly in one |
96 letter-regular expressions @{term x} and @{term y}, or directly in one |
97 iteration by @{term xy}. The first case corresponds to GREEDY matching, |
97 iteration by @{term xy}. The first case corresponds to GREEDY matching, |
98 which first matches with the left-most symbol and only matches the next |
98 which first matches with the left-most symbol and only matches the next |
99 symbol in case of a mismatch (this is greedy in the sense of preferring |
99 symbol in case of a mismatch (this is greedy in the sense of preferring |
100 instant gratification to delayed repletion). The second case is POSIX |
100 instant gratification to delayed repletion). The second case is POSIX |
101 matching, which prefers the longest match. |
101 matching, which prefers the longest match. |
102 |
102 |
103 In the context of lexing, where an input string needs to be separated into a |
103 In the context of lexing, where an input string needs to be split up into a |
104 sequence of tokens, POSIX is the more natural disambiguation strategy for |
104 sequence of tokens, POSIX is the more natural disambiguation strategy for |
105 what programmers consider basic syntactic building blocks in their programs. |
105 what programmers consider basic syntactic building blocks in their programs. |
106 These building blocks are often specified by some regular expressions, say @{text |
106 These building blocks are often specified by some regular expressions, say |
107 "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising |
107 @{text "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising keywords and |
108 keywords and identifiers, respectively. There are two underlying rules |
108 identifiers, respectively. There are two underlying (informal) rules behind |
109 behind tokenising a string in a POSIX fashion: |
109 tokenising a string in a POSIX fashion: |
110 |
110 |
111 \begin{itemize} |
111 \begin{itemize} |
112 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):} |
112 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):} |
113 |
113 |
114 The longest initial substring matched by any regular expression is taken as |
114 The longest initial substring matched by any regular expression is taken as |
192 @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\ |
192 @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\ |
193 \end{tabular} |
193 \end{tabular} |
194 \end{center} |
194 \end{center} |
195 |
195 |
196 \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the |
196 \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the |
197 concatenation of two languages. We use the star-notation for regular |
197 concatenation of two languages (it is also list-append for strings). We |
198 expressions and sets of strings (in the last clause). The star on sets is |
198 use the star-notation for regular expressions and also for languages (in |
199 defined inductively as usual by two clauses for the empty string being in |
199 the last clause). The star for languages is defined inductively as usual |
200 the star of a language and is @{term "s\<^sub>1"} is in a language and |
200 by two clauses for the empty string being in the star of a language and if |
201 @{term "s\<^sub>2"} and in the star of this language then also @{term |
201 @{term "s\<^sub>1"} is in a language and @{term "s\<^sub>2"} in the star of this |
202 "s\<^sub>1 @ s\<^sub>2"} is in the star of this language. |
202 language, then also @{term "s\<^sub>1 @ s\<^sub>2"} is in the star of this language. |
203 |
203 |
204 \emph{Semantic derivatives} of sets of strings are defined as |
204 \emph{Semantic derivatives} of sets of strings are defined as |
205 |
205 |
206 \begin{center} |
206 \begin{center} |
207 \begin{tabular}{lcl} |
207 \begin{tabular}{lcl} |
208 @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\ |
208 @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\ |
209 \end{tabular} |
209 \end{tabular} |
210 \end{center} |
210 \end{center} |
211 |
211 |
212 \noindent where the second definitions lifts the notion of semantic |
212 |
213 derivatives from characters to strings. |
|
214 |
213 |
215 \noindent |
214 \noindent |
216 The nullable function |
215 The nullable function |
217 |
216 |
218 \begin{center} |
217 \begin{center} |
228 |
227 |
229 \noindent |
228 \noindent |
230 The derivative function for characters and strings |
229 The derivative function for characters and strings |
231 |
230 |
232 \begin{center} |
231 \begin{center} |
233 \begin{tabular}{lcp{7.5cm}} |
232 \begin{tabular}{lcp{8cm}} |
234 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ |
233 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ |
235 @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\ |
234 @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\ |
236 @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\ |
235 @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\ |
237 @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\ |
236 @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\ |
238 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\ |
237 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\ |
286 +_2$, $R tl\circ$ and $R tl*$ (as they are now called). By contrast, |
285 +_2$, $R tl\circ$ and $R tl*$ (as they are now called). By contrast, |
287 \cite{Sulzmann2014} defines a relation between values and argues that there is a |
286 \cite{Sulzmann2014} defines a relation between values and argues that there is a |
288 maximum value, as given by the derivative-based algorithm yet to be spelt |
287 maximum value, as given by the derivative-based algorithm yet to be spelt |
289 out. The relation we define is ternary, relating strings, values and regular |
288 out. The relation we define is ternary, relating strings, values and regular |
290 expressions. |
289 expressions. |
|
290 |
|
291 Our Posix relation @{term "s \<in> r \<rightarrow> v"} |
|
292 |
|
293 \begin{center} |
|
294 \begin{tabular}{c} |
|
295 @{thm[mode=Axiom] PMatch.intros(1)} \qquad |
|
296 @{thm[mode=Axiom] PMatch.intros(2)}\medskip\\ |
|
297 @{thm[mode=Rule] PMatch.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\qquad |
|
298 @{thm[mode=Rule] PMatch.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\medskip\\ |
|
299 \multicolumn{1}{p{5cm}}{@{thm[mode=Rule] PMatch.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}\medskip\\ |
|
300 @{thm[mode=Rule] PMatch.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}\medskip\\ |
|
301 @{thm[mode=Axiom] PMatch.intros(7)}\medskip\\ |
|
302 \end{tabular} |
|
303 \end{center} |
291 |
304 |
292 *} |
305 *} |
293 |
306 |
294 section {* The Argument by Sulzmmann and Lu *} |
307 section {* The Argument by Sulzmmann and Lu *} |
295 |
308 |