2 |
2 |
3 \chapter{Regular Expressions and POSIX Lexing} % Main chapter title |
3 \chapter{Regular Expressions and POSIX Lexing} % Main chapter title |
4 |
4 |
5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts |
5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts |
6 %and notations we |
6 %and notations we |
7 %use for describing the lexing algorithm by Sulzmann and Lu, |
7 % used for describing the lexing algorithm by Sulzmann and Lu, |
8 %and then give the algorithm and its variant, and discuss |
8 %and then give the algorithm and its variant and discuss |
9 %why more aggressive simplifications are needed. |
9 %why more aggressive simplifications are needed. |
10 |
10 |
11 In this chapter, we define the basic notions |
11 In this chapter, we define the basic notions |
12 for regular languages and regular expressions. |
12 for regular languages and regular expressions. |
13 This is essentially a description in "English" |
13 This is essentially a description in ``English" |
14 of your formalisation in Isabelle/HOL. |
14 of our formalisation in Isabelle/HOL. |
15 We also give the definition of what $\POSIX$ lexing means. |
15 We also give the definition of what $\POSIX$ lexing means, |
|
16 followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} |
|
17 that produces the output conforming |
|
18 to the $\POSIX$ standard. |
|
19 It is also worth mentioning that |
|
20 we choose to use the ML-style notation |
|
21 for function applications, where |
|
22 the parameters of a function is not enclosed |
|
23 inside a pair of parentheses (e.g. $f \;x \;y$ |
|
24 instead of $f(x,\;y)$). This is mainly |
|
25 to make the text visually more concise. |
16 |
26 |
17 \section{Basic Concepts} |
27 \section{Basic Concepts} |
18 Usually formal language theory starts with an alphabet |
28 Usually, formal language theory starts with an alphabet |
19 denoting a set of characters. |
29 denoting a set of characters. |
20 Here we just use the datatype of characters from Isabelle, |
30 Here we just use the datatype of characters from Isabelle, |
21 which roughly corresponds to the ASCII characters. |
31 which roughly corresponds to the ASCII characters. |
22 In what follows we shall leave the information about the alphabet |
32 In what follows, we shall leave the information about the alphabet |
23 implicit. |
33 implicit. |
24 Then using the usual bracket notation for lists, |
34 Then using the usual bracket notation for lists, |
25 we can define strings made up of characters: |
35 we can define strings made up of characters: |
26 \begin{center} |
36 \begin{center} |
27 \begin{tabular}{lcl} |
37 \begin{tabular}{lcl} |
28 $\textit{s}$ & $\dn$ & $[] \; |\; c :: s$ |
38 $\textit{s}$ & $\dn$ & $[] \; |\; c :: s$ |
29 \end{tabular} |
39 \end{tabular} |
30 \end{center} |
40 \end{center} |
31 Where $c$ is a variable ranging over characters. |
41 Where $c$ is a variable ranging over characters. |
32 Strings can be concatenated to form longer strings in the same |
42 Strings can be concatenated to form longer strings in the same |
33 way as we concatenate two lists, which we write as @. |
43 way as we concatenate two lists, which we shall write as $s_1 @ s_2$. |
34 We omit the precise |
44 We omit the precise |
35 recursive definition here. |
45 recursive definition here. |
36 We overload this concatenation operator for two sets of strings: |
46 We overload this concatenation operator for two sets of strings: |
37 \begin{center} |
47 \begin{center} |
38 \begin{tabular}{lcl} |
48 \begin{tabular}{lcl} |
111 of $A$: |
121 of $A$: |
112 \begin{lemma} |
122 \begin{lemma} |
113 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\ |
123 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\ |
114 \end{lemma} |
124 \end{lemma} |
115 \begin{proof} |
125 \begin{proof} |
|
126 There are too inclusions to prove: |
116 \begin{itemize} |
127 \begin{itemize} |
117 \item{$\subseteq$} |
128 \item{$\subseteq$}:\\ |
118 \noindent |
|
119 The set |
129 The set |
120 \[ \{s \mid c :: s \in A*\} \] |
130 \[ \{s \mid c :: s \in A*\} \] |
121 is enclosed in the set |
131 is enclosed in the set |
122 \[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \] |
132 \[ \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \] |
123 because whenever you have a string starting with a character |
133 because whenever you have a string starting with a character |
124 in the language of a Kleene star $A*$, |
134 in the language of a Kleene star $A*$, |
125 then that character together with some sub-string |
135 then that character together with some sub-string |
126 immediately after it will form the first iteration, |
136 immediately after it will form the first iteration, |
127 and the rest of the string will |
137 and the rest of the string will |
128 be still in $A*$. |
138 be still in $A*$. |
129 \item{$\supseteq$} |
139 \item{$\supseteq$}:\\ |
130 Note that |
140 Note that |
131 \[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \] |
141 \[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \] |
132 and |
142 hold. |
|
143 Also this holds: |
133 \[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \] |
144 \[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \] |
134 where the $\textit{RHS}$ of the above equatioin can be rewritten |
145 where the $\textit{RHS}$ can be rewritten |
135 as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set. |
146 as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \] |
|
147 which of course contains $\Der \; c \; A @ A*$. |
136 \end{itemize} |
148 \end{itemize} |
137 \end{proof} |
149 \end{proof} |
138 |
150 |
139 \noindent |
151 \noindent |
140 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart |
152 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart |
141 for regular languages, we need to first give definitions for regular expressions. |
153 for regular languages, we need to first give definitions for regular expressions. |
142 |
154 |
143 \subsection{Regular Expressions and Their Meaning} |
155 \subsection{Regular Expressions and Their Meaning} |
144 The basic regular expressions are defined inductively |
156 The \emph{basic regular expressions} are defined inductively |
145 by the following grammar: |
157 by the following grammar: |
146 \[ r ::= \ZERO \mid \ONE |
158 \[ r ::= \ZERO \mid \ONE |
147 \mid c |
159 \mid c |
148 \mid r_1 \cdot r_2 |
160 \mid r_1 \cdot r_2 |
149 \mid r_1 + r_2 |
161 \mid r_1 + r_2 |
150 \mid r^* |
162 \mid r^* |
151 \] |
163 \] |
152 \noindent |
164 \noindent |
153 We call them basic because we might introduce |
165 We call them basic because we will introduce |
154 more constructs later such as negation |
166 additional constructors in later chapters such as negation |
155 and bounded repetitions. |
167 and bounded repetitions. |
156 We defined the regular expression containing |
168 We use $\ZERO$ for the regular expression that |
157 nothing as $\ZERO$, note that some authors |
169 matches no string, and $\ONE$ for the regular |
158 also use $\phi$ for that. |
170 expression that matches only the empty string\footnote{ |
159 Similarly, the regular expression denoting the |
171 some authors |
160 singleton set with only $[]$ is sometimes |
172 also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$ |
161 denoted by $\epsilon$, but we use $\ONE$ here. |
173 but we prefer our notation}. |
162 |
174 The sequence regular expression is written $r_1\cdot r_2$ |
163 The language or set of strings denoted |
175 and sometimes we omit the dot if it is clear which |
164 by regular expressions are defined as |
176 regular expression is meant; the alternative |
|
177 is written $r_1 + r_2$. |
|
178 The \emph{language} or meaning of |
|
179 a regular expression is defined recursively as |
|
180 a set of strings: |
165 %TODO: FILL in the other defs |
181 %TODO: FILL in the other defs |
166 \begin{center} |
182 \begin{center} |
167 \begin{tabular}{lcl} |
183 \begin{tabular}{lcl} |
168 $L \; (\ZERO)$ & $\dn$ & $\phi$\\ |
184 $L \; \ZERO$ & $\dn$ & $\phi$\\ |
169 $L \; (\ONE)$ & $\dn$ & $\{[]\}$\\ |
185 $L \; \ONE$ & $\dn$ & $\{[]\}$\\ |
170 $L \; (c)$ & $\dn$ & $\{[c]\}$\\ |
186 $L \; c$ & $\dn$ & $\{[c]\}$\\ |
171 $L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\ |
187 $L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\ |
172 $L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) @ L \; (r_2)$\\ |
188 $L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\ |
173 $L \; (r^*)$ & $\dn$ & $ (L(r))^*$ |
189 $L \; r^*$ & $\dn$ & $ (L\;r)*$ |
174 \end{tabular} |
190 \end{tabular} |
175 \end{center} |
191 \end{center} |
176 \noindent |
192 \noindent |
177 Which is also called the "language interpretation" of |
|
178 a regular expression. |
|
179 |
|
180 Now with semantic derivatives of a language and regular expressions and |
193 Now with semantic derivatives of a language and regular expressions and |
181 their language interpretations in place, we are ready to define derivatives on regexes. |
194 their language interpretations in place, we are ready to define derivatives on regexes. |
182 \subsection{Brzozowski Derivatives and a Regular Expression Matcher} |
195 \subsection{Brzozowski Derivatives and a Regular Expression Matcher} |
183 |
196 %Recall, the language derivative acts on a set of strings |
184 \ChristianComment{Hi this part I want to keep the ordering as is, so that it keeps the |
197 %and essentially chops off a particular character from |
185 readers engaged with a story how we got to the definition of $\backslash$, rather |
198 %all strings in that set, Brzozowski defined a derivative operation on regular expressions |
186 than first "overwhelming" them with the definition of $\nullable$.} |
199 %so that after derivative $L(r\backslash c)$ |
187 |
200 %will look as if it was obtained by doing a language derivative on $L(r)$: |
188 The language derivative acts on a string set and chops off a character from |
201 Recall that the semantic derivative acts on a set of |
189 all strings in that set, we want to define a derivative operation on regular expressions |
202 strings. Brzozowski noticed that this operation |
190 so that after derivative $L(r\backslash c)$ |
203 can be ``mirrored" on regular expressions which |
191 will look as if it was obtained by doing a language derivative on $L(r)$: |
204 he calls the derivative of a regular expression $r$ |
192 \begin{center} |
205 with respect to a character $c$, written |
|
206 $r \backslash c$. |
|
207 He defined this operation such that the following property holds: |
|
208 \begin{center} |
|
209 |
193 \[ |
210 \[ |
194 r\backslash c \dn ? |
211 L(r \backslash c) = \Der \; c \; L(r) |
195 \] |
212 \] |
196 so that |
213 \end{center} |
|
214 \noindent |
|
215 For example in the sequence case we have |
|
216 \begin{center} |
|
217 \begin{tabular}{lcl} |
|
218 $\Der \; c \; (A @ B)$ & $\dn$ & |
|
219 $ \textit{if} \;\, [] \in A \; |
|
220 \textit{then} \;\, ((\Der \; c \; A) @ B ) \cup |
|
221 \Der \; c\; B$\\ |
|
222 & & $\textit{else}\; (\Der \; c \; A) @ B$\\ |
|
223 \end{tabular} |
|
224 \end{center} |
|
225 \noindent |
|
226 This can be translated to |
|
227 regular expressions in the following |
|
228 manner: |
|
229 \begin{center} |
|
230 \begin{tabular}{lcl} |
|
231 $(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\ |
|
232 & & $\textit{else} \; (r_1 \backslash c) \cdot r_2$ |
|
233 \end{tabular} |
|
234 \end{center} |
|
235 |
|
236 \noindent |
|
237 And similarly, the Kleene star's semantic derivative |
|
238 can be expressed as |
197 \[ |
239 \[ |
198 L(r \backslash c) = \Der \; c \; L(r) ? |
240 \textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*) |
199 \] |
241 \] |
200 \end{center} |
242 which translates to |
201 So we mimic the equalities we have for $\Der$ on language concatenation |
|
202 |
|
203 \[ |
243 \[ |
204 \Der \; c \; (A @ B) = \textit{if} \; [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\ |
244 (r^*) \backslash c \dn (r \backslash c)\cdot r^*. |
205 \] |
245 \] |
206 to get the derivative for sequence regular expressions: |
246 In the above definition of $(r_1\cdot r_2) \backslash c$, |
207 \[ |
247 the $\textit{if}$ clause's |
208 (r_1 \cdot r_2 ) \backslash c = \textit{if}\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c \textit{else} (r_1 \backslash c) \cdot r_2 |
248 boolean condition |
209 \] |
249 $[] \in L(r_1)$ needs to be |
210 |
250 somehow recursively computed. |
211 \noindent |
251 We call such a function that checks |
212 and language Kleene star: |
252 whether the empty string $[]$ is |
213 \[ |
253 in the language of a regular expression $\nullable$: |
214 \textit{Der} \;c \;A* = (\textit{Der}\; c A) @ (A*) |
254 \begin{center} |
215 \] |
255 \begin{tabular}{lcl} |
216 to get derivative of the Kleene star regular expression: |
256 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
217 \[ |
257 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
218 r^* \backslash c = (r \backslash c)\cdot r^* |
258 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
219 \] |
259 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
220 Note that although we can formalise the boolean predicate |
260 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
221 $[] \in L(r_1)$ without problems, if we want a function that works |
261 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
222 computationally, then we would have to define a function that tests |
262 \end{tabular} |
223 whether an empty string is in the language of a regular expression. |
263 \end{center} |
224 We call such a function $\nullable$: |
264 \noindent |
225 |
265 The $\ZERO$ regular expression |
226 |
266 does not contain any string and |
227 |
267 therefore is not \emph{nullable}. |
|
268 $\ONE$ is \emph{nullable} |
|
269 by definition. |
|
270 The character regular expression $c$ |
|
271 corresponds to the singleton set $\{c\}$, |
|
272 and therefore does not contain the empty string. |
|
273 The alternative regular expression is nullable |
|
274 if at least one of its children is nullable. |
|
275 The sequence regular expression |
|
276 would require both children to have the empty string |
|
277 to compose an empty string, and the Kleene star |
|
278 is always nullable because it naturally |
|
279 contains the empty string. |
|
280 |
|
281 The derivative function, written $r\backslash c$, |
|
282 defines how a regular expression evolves into |
|
283 a new one after all the string it contains is acted on: |
|
284 if it starts with $c$, then the character is chopped of, |
|
285 if not, that string is removed. |
228 \begin{center} |
286 \begin{center} |
229 \begin{tabular}{lcl} |
287 \begin{tabular}{lcl} |
230 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
288 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
231 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
289 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
232 $d \backslash c$ & $\dn$ & |
290 $d \backslash c$ & $\dn$ & |
237 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
295 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
238 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
296 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
239 \end{tabular} |
297 \end{tabular} |
240 \end{center} |
298 \end{center} |
241 \noindent |
299 \noindent |
242 The function derivative, written $r\backslash c$, |
300 The most involved cases are the sequence case |
243 defines how a regular expression evolves into |
301 and the star case. |
244 a new regular expression after all the string it contains |
|
245 is chopped off a certain head character $c$. |
|
246 The most involved cases are the sequence |
|
247 and star case. |
|
248 The sequence case says that if the first regular expression |
302 The sequence case says that if the first regular expression |
249 contains an empty string then the second component of the sequence |
303 contains an empty string, then the second component of the sequence |
250 might be chosen as the target regular expression to be chopped |
304 needs to be considered, as its derivative will contribute to the |
251 off its head character. |
305 result of this derivative. |
252 The star regular expression's derivative unwraps the iteration of |
306 The star regular expression $r^*$'s derivative |
253 regular expression and attaches the star regular expression |
307 unwraps one iteration of $r$, turns it into $r\backslash c$, |
254 to the sequence's second element to make sure a copy is retained |
308 and attaches the original $r^*$ |
255 for possible more iterations in later phases of lexing. |
309 after $r\backslash c$, so that |
256 |
310 we can further unfold it as many times as needed. |
257 |
311 We have the following correspondence between |
258 To test whether $[] \in L(r_1)$, we need the $\nullable$ function, |
312 derivatives on regular expressions and |
259 which tests whether the empty string $""$ |
313 derivatives on a set of strings: |
260 is in the language of $r$: |
314 \begin{lemma}\label{derDer} |
261 |
|
262 |
|
263 \begin{center} |
|
264 \begin{tabular}{lcl} |
|
265 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
266 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
267 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
268 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
269 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
270 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
271 \end{tabular} |
|
272 \end{center} |
|
273 \noindent |
|
274 The empty set does not contain any string and |
|
275 therefore not the empty string, the empty string |
|
276 regular expression contains the empty string |
|
277 by definition, the character regular expression |
|
278 is the singleton that contains character only, |
|
279 and therefore does not contain the empty string, |
|
280 the alternative regular expression (or "or" expression) |
|
281 might have one of its children regular expressions |
|
282 being nullable and any one of its children being nullable |
|
283 would suffice. The sequence regular expression |
|
284 would require both children to have the empty string |
|
285 to compose an empty string and the Kleene star |
|
286 operation naturally introduced the empty string. |
|
287 |
|
288 We have the following property where the derivative on regular |
|
289 expressions coincides with the derivative on a set of strings: |
|
290 |
|
291 \begin{lemma} |
|
292 $\textit{Der} \; c \; L(r) = L (r\backslash c)$ |
315 $\textit{Der} \; c \; L(r) = L (r\backslash c)$ |
293 \end{lemma} |
316 \end{lemma} |
294 |
317 |
295 \noindent |
318 \noindent |
296 The main property of the derivative operation |
319 The main property of the derivative operation |
297 that enables us to reason about the correctness of |
320 (that enables us to reason about the correctness of |
298 an algorithm using derivatives is |
321 derivative-based matching) |
|
322 is |
299 |
323 |
300 \begin{lemma}\label{derStepwise} |
324 \begin{lemma}\label{derStepwise} |
301 $c\!::\!s \in L(r)$ holds |
325 $c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$. |
302 if and only if $s \in L(r\backslash c)$. |
|
303 \end{lemma} |
326 \end{lemma} |
304 |
327 |
305 \noindent |
328 \noindent |
306 We can generalise the derivative operation shown above for single characters |
329 We can generalise the derivative operation shown above for single characters |
307 to strings as follows: |
330 to strings as follows: |
358 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data}; |
381 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data}; |
359 \end{axis} |
382 \end{axis} |
360 \end{tikzpicture} |
383 \end{tikzpicture} |
361 \caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher} |
384 \caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher} |
362 \end{figure} |
385 \end{figure} |
363 |
386 \end{center} |
364 \noindent |
387 \noindent |
365 For this we need to introduce certain |
388 If we implement the above algorithm naively, however, |
|
389 the algorithm can be excruciatingly slow, as shown in |
|
390 \ref{NaiveMatcher}. |
|
391 Note that both axes are in logarithmic scale. |
|
392 Around two dozens characters |
|
393 would already explode the matcher on regular expression |
|
394 $(a^*)^*b$. |
|
395 For this, we need to introduce certain |
366 rewrite rules for the intermediate results, |
396 rewrite rules for the intermediate results, |
367 such as $r + r \rightarrow r$, |
397 such as $r + r \rightarrow r$, |
368 and make sure those rules do not change the |
398 and make sure those rules do not change the |
369 language of the regular expression. |
399 language of the regular expression. |
370 We have a simplification function (that is as simple as possible |
400 One simpled-minded simplification function |
371 while having much power on making a regex simpler): |
401 that achieves these requirements is given below: |
372 \begin{verbatim} |
402 \begin{center} |
373 def simp(r: Rexp) : Rexp = r match { |
403 \begin{tabular}{lcl} |
374 case SEQ(r1, r2) => |
404 $\simp \; r_1 \cdot r_2 $ & $ \dn$ & |
375 (simp(r1), simp(r2)) match { |
405 $(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\ |
376 case (ZERO, _) => ZERO |
406 & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\ |
377 case (_, ZERO) => ZERO |
407 & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\ |
378 case (ONE, r2s) => r2s |
408 & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\ |
379 case (r1s, ONE) => r1s |
409 & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\ |
380 case (r1s, r2s) => SEQ(r1s, r2s) |
410 & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\ |
381 } |
411 $\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\ |
382 case ALTS(r1, r2) => { |
412 & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\ |
383 (simp(r1), simp(r2)) match { |
413 & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\ |
384 case (ZERO, r2s) => r2s |
414 & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\ |
385 case (r1s, ZERO) => r1s |
415 $\simp \; r$ & $\dn$ & $r$ |
386 case (r1s, r2s) => |
416 |
387 if(r1s == r2s) r1s else ALTS(r1s, r2s) |
417 \end{tabular} |
388 } |
418 \end{center} |
389 } |
419 If we repeatedly apply this simplification |
390 case r => r |
420 function during the matching algorithm, |
391 } |
421 we have a matcher with simplification: |
392 \end{verbatim} |
422 \begin{center} |
393 If we repeatedly incorporate these |
423 \begin{tabular}{lcl} |
394 rules during the matching algorithm, |
424 $\derssimp \; [] \; r$ & $\dn$ & $r$\\ |
395 we have a lexer with simplification: |
425 $\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\ |
396 \begin{verbatim} |
426 $\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$ |
397 def ders_simp(s: List[Char], r: Rexp) : Rexp = s match { |
427 \end{tabular} |
398 case Nil => simp(r) |
428 \end{center} |
399 case c :: cs => ders_simp(cs, simp(der(c, r))) |
429 \begin{figure} |
400 } |
|
401 |
|
402 def simp_matcher(s: String, r: Rexp) : Boolean = |
|
403 nullable(ders_simp(s.toList, r)) |
|
404 |
|
405 \end{verbatim} |
|
406 \noindent |
|
407 After putting in those rules, the example of \ref{NaiveMatcher} |
|
408 is now very tame in the length of inputs: |
|
409 |
|
410 |
|
411 \begin{tikzpicture} |
430 \begin{tikzpicture} |
412 \begin{axis}[ |
431 \begin{axis}[ |
413 xlabel={$n$}, |
432 xlabel={$n$}, |
414 ylabel={time in secs}, |
433 ylabel={time in secs}, |
415 ymode = log, |
434 ymode = log, |
416 xmode = log, |
435 xmode = log, |
|
436 grid = both, |
417 legend entries={Matcher With Simp}, |
437 legend entries={Matcher With Simp}, |
418 legend pos=north west, |
438 legend pos=north west, |
419 legend cell align=left] |
439 legend cell align=left] |
420 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data}; |
440 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data}; |
421 \end{axis} |
441 \end{axis} |
422 \end{tikzpicture} \label{fig:BetterMatcher} |
442 \end{tikzpicture} |
423 |
443 \caption{$(a^*)^*b$ |
424 |
444 against |
425 \noindent |
445 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher} |
426 Note how the x-axis is in logarithmic scale. |
446 \end{figure} |
|
447 \noindent |
|
448 The running time of $\textit{ders}\_\textit{simp}$ |
|
449 on the same example of \ref{NaiveMatcher} |
|
450 is now very tame in terms of the length of inputs, |
|
451 as shown in \ref{BetterMatcher}. |
|
452 |
427 Building derivatives and then testing the existence |
453 Building derivatives and then testing the existence |
428 of empty string in the resulting regular expression's language, |
454 of empty string in the resulting regular expression's language, |
429 and add simplification rules when necessary. |
455 adding simplifications when necessary. |
430 So far, so good. But what if we want to |
456 So far, so good. But what if we want to |
431 do lexing instead of just getting a YES/NO answer? |
457 do lexing instead of just getting a YES/NO answer? |
432 \citeauthor{Sulzmann2014} first came up with a nice and |
458 Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and |
433 elegant (arguably as beautiful as the definition of the original derivative) solution for this. |
459 elegant (arguably as beautiful as the definition of the original derivative) solution for this. |
434 |
460 |
435 \section{Values and the Lexing Algorithm by Sulzmann and Lu} |
461 \section{Values and the Lexing Algorithm by Sulzmann and Lu} |
436 Here we present the hybrid phases of a regular expression lexing |
462 In this section, we present a two-phase regular expression lexing |
437 algorithm using the function $\inj$, as given by Sulzmann and Lu. |
463 algorithm. |
438 They first defined the datatypes for storing the |
464 The first phase takes successive derivatives with |
439 lexing information called a \emph{value} or |
465 respect to the input string, |
440 sometimes also \emph{lexical value}. These values and regular |
466 and the second phase does the reverse, \emph{injecting} back |
|
467 characters, in the meantime constructing a lexing result. |
|
468 We will introduce the injection phase in detail slightly |
|
469 later, but as a preliminary we have to first define |
|
470 the datatype for lexing results, |
|
471 called \emph{value} or |
|
472 sometimes also \emph{lexical value}. Values and regular |
441 expressions correspond to each other as illustrated in the following |
473 expressions correspond to each other as illustrated in the following |
442 table: |
474 table: |
443 |
475 |
444 \begin{center} |
476 \begin{center} |
445 \begin{tabular}{c@{\hspace{20mm}}c} |
477 \begin{tabular}{c@{\hspace{20mm}}c} |
464 & $\mid$ & $\Right(v)$ \\ |
496 & $\mid$ & $\Right(v)$ \\ |
465 & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
497 & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
466 \end{tabular} |
498 \end{tabular} |
467 \end{tabular} |
499 \end{tabular} |
468 \end{center} |
500 \end{center} |
469 |
501 \noindent |
470 \noindent |
502 A value has an underlying string, which |
471 We have a formal binary relation for telling whether the structure |
503 can be calculated by the ``flatten" function $|\_|$: |
472 of a regular expression agrees with the value. |
504 \begin{center} |
|
505 \begin{tabular}{lcl} |
|
506 $|\Empty|$ & $\dn$ & $[]$\\ |
|
507 $|\Char \; c|$ & $ \dn$ & $ [c]$\\ |
|
508 $|\Seq(v_1, v_2)|$ & $ \dn$ & $ v_1| @ |v_2|$\\ |
|
509 $|\Left(v)|$ & $ \dn$ & $ |v|$\\ |
|
510 $|\Right(v)|$ & $ \dn$ & $ |v|$\\ |
|
511 $|\Stars([])|$ & $\dn$ & $[]$\\ |
|
512 $|\Stars(v::vs)|$ & $\dn$ & $ |v| @ |\Stars(vs)|$ |
|
513 \end{tabular} |
|
514 \end{center} |
|
515 Sulzmann and Lu used a binary predicate, written $\vdash v:r $, |
|
516 to indicate that a value $v$ could be generated from a lexing algorithm |
|
517 with input $r$. They call it the value inhabitation relation. |
473 \begin{mathpar} |
518 \begin{mathpar} |
474 \inferrule{}{\vdash \Char(c) : \mathbf{c}} \hspace{2em} |
519 \inferrule{\mbox{}}{\vdash \Char(c) : \mathbf{c}} \hspace{2em} |
475 \inferrule{}{\vdash \Empty : \ONE} \hspace{2em} |
520 |
476 \inferrule{\vdash v_1 : r_1 \\ \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)} |
521 \inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em} |
|
522 |
|
523 \inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)} |
|
524 |
|
525 \inferrule{\vdash v_1 : r_1}{\vdash \Left(v_1):r_1+r_2} |
|
526 |
|
527 \inferrule{\vdash v_2 : r_2}{\vdash \Right(v_2):r_1 + r_2} |
|
528 |
|
529 \inferrule{\forall v \in vs. \vdash v:r \land |v| \neq []}{\vdash \Stars(vs):r^*} |
477 \end{mathpar} |
530 \end{mathpar} |
478 |
531 \noindent |
479 Building on top of Sulzmann and Lu's attempt to formalise the |
532 The condition $|v| \neq []$ in the premise of star's rule |
480 notion of POSIX lexing rules \parencite{Sulzmann2014}, |
533 is to make sure that for a given pair of regular |
481 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled |
534 expression $r$ and string $s$, the number of values |
482 POSIX matching as a ternary relation recursively defined in a |
535 satisfying $|v| = s$ and $\vdash v:r$ is finite. |
483 natural deduction style. |
536 Given the same string and regular expression, there can be |
484 The formal definition of a $\POSIX$ value $v$ for a regular expression |
537 multiple values for it. For example, both |
|
538 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and |
|
539 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold |
|
540 and the values both flatten to $abc$. |
|
541 Lexers therefore have to disambiguate and choose only |
|
542 one of the values to output. $\POSIX$ is one of the |
|
543 disambiguation strategies that is widely adopted. |
|
544 |
|
545 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} |
|
546 formalised the property |
|
547 as a ternary relation. |
|
548 The $\POSIX$ value $v$ for a regular expression |
485 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified |
549 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified |
486 in the following set of rules: |
550 in the following set of rules\footnote{The names of the rules are used |
487 \ChristianComment{Will complete later} |
551 as they were originally given in \cite{AusafDyckhoffUrban2016}}: |
488 \newcommand*{\inference}[3][t]{% |
552 \noindent |
489 \begingroup |
553 \begin{figure} |
490 \def\and{\\}% |
554 \begin{mathpar} |
491 \begin{tabular}[#1]{@{\enspace}c@{\enspace}} |
555 \inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty} |
492 #2 \\ |
556 |
493 \hline |
557 \inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c} |
494 #3 |
558 |
495 \end{tabular}% |
559 \inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1} |
496 \endgroup |
560 |
497 } |
561 \inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2} |
498 \begin{center} |
562 |
499 \inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$ }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ } |
563 \inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\ |
500 \end{center} |
564 \nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land |
501 \noindent |
565 s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow |
502 The above $\POSIX$ rules could be explained intuitionally as |
566 \Seq \; v_1 \; v_2} |
|
567 |
|
568 \inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])} |
|
569 |
|
570 \inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\ |
|
571 |v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land |
|
572 s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \; |
|
573 (v::vs)} |
|
574 \end{mathpar} |
|
575 \caption{POSIX Lexing Rules} |
|
576 \end{figure} |
|
577 \noindent |
|
578 The above $\POSIX$ rules follows the intuition described below: |
503 \begin{itemize} |
579 \begin{itemize} |
504 \item |
580 \item (Left Priority)\\ |
505 match the leftmost regular expression when multiple options of matching |
581 Match the leftmost regular expression when multiple options of matching |
506 are available |
582 are available. |
507 \item |
583 \item (Maximum munch)\\ |
508 always match a subpart as much as possible before proceeding |
584 Always match a subpart as much as possible before proceeding |
509 to the next token. |
585 to the next token. |
510 \end{itemize} |
586 \end{itemize} |
511 |
587 \noindent |
512 The reason why we are interested in $\POSIX$ values is that they can |
588 These disambiguation strategies can be |
513 be practically used in the lexing phase of a compiler front end. |
589 quite practical. |
514 For instance, when lexing a code snippet |
590 For instance, when lexing a code snippet |
515 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized |
591 \[ |
516 as an identifier rather than a keyword. |
592 \textit{iffoo} = 3 |
517 \ChristianComment{Do I also introduce lexical values $LV$ here?} |
593 \] |
518 We know that $\POSIX$ values are also part of the normal values: |
594 using the regular expression (with |
|
595 keyword and identifier having their |
|
596 usualy definitions on any formal |
|
597 language textbook, for instance |
|
598 keyword is a nonempty string starting with letters |
|
599 followed by alphanumeric characters or underscores): |
|
600 \[ |
|
601 \textit{keyword} + \textit{identifier}, |
|
602 \] |
|
603 we want $\textit{iffoo}$ to be recognized |
|
604 as an identifier rather than a keyword (if) |
|
605 followed by |
|
606 an identifier (foo). |
|
607 POSIX lexing achieves this. |
|
608 |
|
609 We know that a $\POSIX$ value is also a normal underlying |
|
610 value: |
519 \begin{lemma} |
611 \begin{lemma} |
520 $(r, s) \rightarrow v \implies \vdash v: r$ |
612 $(r, s) \rightarrow v \implies \vdash v: r$ |
521 \end{lemma} |
613 \end{lemma} |
522 \noindent |
614 \noindent |
523 The good property about a $\POSIX$ value is that |
615 The good property about a $\POSIX$ value is that |
542 $(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$, |
637 $(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$, |
543 Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$, |
638 Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$, |
544 which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$ |
639 which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$ |
545 is the same as $\Seq(v_1, v_2)$. |
640 is the same as $\Seq(v_1, v_2)$. |
546 \end{proof} |
641 \end{proof} |
547 Now we know what a $\POSIX$ value is, the problem is how do we achieve |
642 Now we know what a $\POSIX$ value is; the problem is how do we achieve |
548 such a value in a lexing algorithm, using derivatives? |
643 such a value in a lexing algorithm, using derivatives? |
549 |
644 |
550 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} |
645 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} |
551 |
646 |
552 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
647 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
553 algorithm by a second phase (the first phase being building successive |
648 algorithm by a second phase (the first phase being building successive |
554 derivatives---see \ref{graph:successive_ders}). In this second phase, a POSIX value |
649 derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string. |
555 is generated if the regular expression matches the string. |
|
556 Two functions are involved: $\inj$ and $\mkeps$. |
650 Two functions are involved: $\inj$ and $\mkeps$. |
557 The function $\mkeps$ constructs a value from the last |
651 The function $\mkeps$ constructs a value from the last |
558 one of all the successive derivatives: |
652 one of all the successive derivatives: |
559 \begin{ceqn} |
653 \begin{ceqn} |
560 \begin{equation}\label{graph:mkeps} |
654 \begin{equation}\label{graph:mkeps} |