37 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em |
37 Brzozowski \cite{Brzozowski1964} introduced the notion of the {\em |
38 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a |
38 derivative} @{term "der c r"} of a regular expression @{text r} w.r.t.\ a |
39 character~@{text c}, and showed that it gave a simple solution to the |
39 character~@{text c}, and showed that it gave a simple solution to the |
40 problem of matching a string @{term s} with a regular expression @{term r}: |
40 problem of matching a string @{term s} with a regular expression @{term r}: |
41 if the derivative of @{term r} w.r.t.\ (in succession) all the characters of |
41 if the derivative of @{term r} w.r.t.\ (in succession) all the characters of |
42 the string matches the empty string $\mts$, then @{term r} matches @{term s} |
42 the string matches the empty string, then @{term r} matches @{term s} |
43 (and {\em vice versa}). The derivative has the property (which may be |
43 (and {\em vice versa}). The derivative has the property (which may be |
44 regarded as its specification) that, for every string @{term s} and regular |
44 regarded as its specification) that, for every string @{term s} and regular |
45 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if |
45 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if |
46 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
46 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
47 derivatives is that they are neatly expressible in any functional language, |
47 derivatives is that they are neatly expressible in any functional language, |
74 approach (of which some of the proofs are not published in |
74 approach (of which some of the proofs are not published in |
75 \cite{Sulzmann2014}); perhaps more importantly, we give a simple inductive |
75 \cite{Sulzmann2014}); perhaps more importantly, we give a simple inductive |
76 (and algorithm-independent) definition of what we call being a {\em POSIX |
76 (and algorithm-independent) definition of what we call being a {\em POSIX |
77 value} for a regular expression @{term r} and a string @{term s}; we show |
77 value} for a regular expression @{term r} and a string @{term s}; we show |
78 that the algorithm computes such a value and that such a value is unique. |
78 that the algorithm computes such a value and that such a value is unique. |
79 Proofs are both done by hand and checked in {\em Isabelle/HOL}. The |
79 Proofs are both done by hand and checked in Isabelle/HOL. The |
80 experience of doing our proofs has been that this mechanical checking was |
80 experience of doing our proofs has been that this mechanical checking was |
81 absolutely essential: this subject area has hidden snares. This was also |
81 absolutely essential: this subject area has hidden snares. This was also |
82 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching |
82 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching |
83 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
83 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
84 |
84 |
85 If a regular expression matches a string, then in general there are more |
85 If a regular expression matches a string, then in general there are more |
86 than one possibility of how the string is matched. There are two commonly |
86 than one way of how the string is matched. There are two commonly used |
87 used disambiguation strategies to generate a unique answer: one is called |
87 disambiguation strategies to generate a unique answer: one is called GREEDY |
88 GREEDY matching \cite{Frisch2004} and the other is POSIX |
88 matching \cite{Frisch2004} and the other is POSIX |
89 matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string |
89 matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string |
90 @{term xy} and the regular expression @{term "STAR (ALT (ALT x y) xy)"}. |
90 @{term xy} and the regular expression \mbox{@{term "STAR (ALT (ALT x y) xy)"}}. |
91 Either the string can be matched in two `iterations' by the single |
91 Either the string can be matched in two `iterations' by the single |
92 letter-regular expressions @{term x} and @{term y}, or directly in one |
92 letter-regular expressions @{term x} and @{term y}, or directly in one |
93 iteration by @{term xy}. The first case corresponds to GREEDY matching, |
93 iteration by @{term xy}. The first case corresponds to GREEDY matching, |
94 which first matches with the left-most symbol and only matches the next |
94 which first matches with the left-most symbol and only matches the next |
95 symbol in case of a mismatch. The second case is POSIX matching, which |
95 symbol in case of a mismatch (this is greedy in the sense of preferring |
96 prefers the longest match. |
96 instant gratification to delayed repletion). The second case is POSIX |
97 |
97 matching, which prefers the longest match. |
98 In the context of lexing, where an input string is separated into a sequence |
98 |
99 of tokens, POSIX is the more natural disambiguation strategy for what |
99 In the context of lexing, where an input string needs to be separated into a |
100 programmers consider basic syntactic building blocks in their programs. |
100 sequence of tokens, POSIX is the more natural disambiguation strategy for |
101 There are two underlying rules behind tokenising using POSIX matching: |
101 what programmers consider basic syntactic building blocks in their programs. |
|
102 These building blocks are often specified by some regular expressions, say @{text |
|
103 "r\<^bsub>key\<^esub>"} and @{text "r\<^bsub>id\<^esub>"} for recognising |
|
104 keywords and identifiers, respectively. There are two underlying rules |
|
105 behind tokenising a string in a POSIX fashion: |
102 |
106 |
103 \begin{itemize} |
107 \begin{itemize} |
104 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):} |
108 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):} |
105 |
109 |
106 The longest initial substring matched by any regular expression is taken as |
110 The longest initial substring matched by any regular expression is taken as |
110 |
114 |
111 For a particular longest initial substring, the first regular expression |
115 For a particular longest initial substring, the first regular expression |
112 that can match determines the token. |
116 that can match determines the token. |
113 \end{itemize} |
117 \end{itemize} |
114 |
118 |
115 \noindent Consider for example a regular expression @{text |
119 \noindent Consider for example @{text "r\<^bsub>key\<^esub>"} recognising |
116 "r\<^bsub>key\<^esub>"} that can recognise keywords, like @{text "if"}, |
120 keywords such as @{text "if"}, @{text "then"} and so on; and @{text |
117 @{text "then"} and so on; and another regular expression @{text |
121 "r\<^bsub>id\<^esub>"} recognising identifiers (a single character followed |
118 "r\<^bsub>id\<^esub>"} that can recognise identifiers (such as single |
122 by characters or numbers). Then we can form the regular expression @{text |
119 characters followed by characters or numbers). Then we can form the regular |
123 "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and use POSIX |
120 expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and |
124 matching to tokenise strings, say @{text "iffoo"} and @{text "if"}. In the |
121 use POSIX matching to, for example, tokenise the strings @{text "iffoo"} and |
125 first case we obtain by the longest match rule a single identifier token, |
122 @{text "if"}. In the first case we obtain by the longest match rule a |
126 not a keyword followed by identifier. In the second case we obtain by rule |
123 single identifier token, not a keyword and identifier. In the second case we |
127 priority a keyword token, not an identifier token---even if @{text |
124 obtain by rule priority a keyword token, not an identifier token. |
128 "r\<^bsub>id\<^esub>"} matches also.\bigskip |
125 \bigskip |
|
126 |
129 |
127 \noindent\textcolor{green}{Not Done Yet} |
130 \noindent\textcolor{green}{Not Done Yet} |
128 |
131 |
129 |
132 |
130 \medskip\noindent |
133 \medskip\noindent |
152 |
155 |
153 text {* \noindent Strings in Isabelle/HOL are lists of characters with |
156 text {* \noindent Strings in Isabelle/HOL are lists of characters with |
154 the empty string being represented by the empty list, written @{term |
157 the empty string being represented by the empty list, written @{term |
155 "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. |
158 "[]"}, and list-cons being written as @{term "DUMMY # DUMMY"}. |
156 Often we use the usual bracket notation for strings; for example a |
159 Often we use the usual bracket notation for strings; for example a |
157 string consiting of a single character is written @{term "[c]"}. By |
160 string consisting of a single character is written @{term "[c]"}. By |
158 using the type @{type char} for characters we have a supply of |
161 using the type @{type char} for characters we have a supply of |
159 finitely many characters roughly corresponding to the ASCII |
162 finitely many characters roughly corresponding to the ASCII |
160 character set. Regular expression are as usual and defined as the |
163 character set. Regular expressions are defined as usual as the |
161 following inductive datatype: |
164 following inductive datatype: |
162 |
165 |
163 \begin{center} |
166 \begin{center} |
164 @{text "r :="} |
167 @{text "r :="} |
165 @{const "ZERO"} $\mid$ |
168 @{const "ZERO"} $\mid$ |
184 @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
188 @{thm (lhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) L.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
185 @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\ |
189 @{thm (lhs) L.simps(6)} & $\dn$ & @{thm (rhs) L.simps(6)}\\ |
186 \end{tabular} |
190 \end{tabular} |
187 \end{center} |
191 \end{center} |
188 |
192 |
189 \noindent We use the star-notation for regular expressions and sets of strings. |
193 \noindent In the fourth clause we use @{term "DUMMY ;; DUMMY"} for the |
190 The Kleene-star on sets is defined inductively. |
194 concatenation of two languages. We use the star-notation for regular |
|
195 expressions and sets of strings (in the last clause). The star on sets is |
|
196 defined inductively as usual by two clauses for the empty string being in |
|
197 the star of a language and is @{term "s\<^sub>1"} is in a language and |
|
198 @{term "s\<^sub>2"} and in the star of this language then also @{term |
|
199 "s\<^sub>1 @ s\<^sub>2"} is in the star of this language. |
191 |
200 |
192 \emph{Semantic derivatives} of sets of strings are defined as |
201 \emph{Semantic derivatives} of sets of strings are defined as |
193 |
202 |
194 \begin{center} |
203 \begin{center} |
195 \begin{tabular}{lcl} |
204 \begin{tabular}{lcl} |
196 @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\ |
205 @{thm (lhs) Der_def} & $\dn$ & @{thm (rhs) Der_def}\\ |
197 @{thm (lhs) Ders_def} & $\dn$ & @{thm (rhs) Ders_def}\\ |
|
198 \end{tabular} |
206 \end{tabular} |
199 \end{center} |
207 \end{center} |
200 |
208 |
201 \noindent where the second definitions lifts the notion of semantic |
209 \noindent where the second definitions lifts the notion of semantic |
202 derivatives from characters to strings. |
210 derivatives from characters to strings. |
223 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ |
231 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ |
224 @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\ |
232 @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\ |
225 @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\ |
233 @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\ |
226 @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\ |
234 @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\ |
227 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\ |
235 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\ |
228 @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\medskip\\ |
236 @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)} |
229 |
|
230 @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}\\ |
|
231 @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}\\ |
|
232 \end{tabular} |
237 \end{tabular} |
233 \end{center} |
238 \end{center} |
234 |
239 |
235 It is a relatively easy exercise to prove that |
240 It is a relatively easy exercise to prove that |
236 |
241 |