|
1 % Chapter Template |
|
2 |
|
3 \chapter{Regular Expressions and POSIX Lexing} % Main chapter title |
|
4 |
|
5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts |
|
6 %and notations we |
|
7 %use for describing the lexing algorithm by Sulzmann and Lu, |
|
8 %and then give the algorithm and its variant, and discuss |
|
9 %why more aggressive simplifications are needed. |
|
10 |
|
11 |
|
12 \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions} |
|
13 We have a primitive datatype char, denoting characters. |
|
14 \[ char ::= a |
|
15 \mid b |
|
16 \mid c |
|
17 \mid \ldots |
|
18 \mid z |
|
19 \] |
|
20 (which one is better?) |
|
21 \begin{center} |
|
22 \begin{tabular}{lcl} |
|
23 $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\ |
|
24 \end{tabular} |
|
25 \end{center} |
|
26 They can form strings by lists: |
|
27 \begin{center} |
|
28 \begin{tabular}{lcl} |
|
29 $\textit{string}$ & $\dn$ & $[] | c :: cs$\\ |
|
30 & & $(c\; \text{has char type})$ |
|
31 \end{tabular} |
|
32 \end{center} |
|
33 And strings can be concatenated to form longer strings: |
|
34 \begin{center} |
|
35 \begin{tabular}{lcl} |
|
36 $[] @ s_2$ & $\dn$ & $s_2$\\ |
|
37 $(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$ |
|
38 \end{tabular} |
|
39 \end{center} |
|
40 |
|
41 A set of strings can operate with another set of strings: |
|
42 \begin{center} |
|
43 \begin{tabular}{lcl} |
|
44 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\ |
|
45 \end{tabular} |
|
46 \end{center} |
|
47 We also call the above "language concatenation". |
|
48 The power of a language is defined recursively, using the |
|
49 concatenation operator $@$: |
|
50 \begin{center} |
|
51 \begin{tabular}{lcl} |
|
52 $A^0 $ & $\dn$ & $\{ [] \}$\\ |
|
53 $A^{n+1}$ & $\dn$ & $A^n @ A$ |
|
54 \end{tabular} |
|
55 \end{center} |
|
56 The union of all the natural number powers of a language |
|
57 is denoted by the Kleene star operator: |
|
58 \begin{center} |
|
59 \begin{tabular}{lcl} |
|
60 $\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\ |
|
61 \end{tabular} |
|
62 \end{center} |
|
63 |
|
64 In Isabelle of course we cannot easily get a counterpart of |
|
65 the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star |
|
66 as an inductive set: |
|
67 \begin{center} |
|
68 \begin{tabular}{lcl} |
|
69 $[] \in A^*$ & &\\ |
|
70 $s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\ |
|
71 \end{tabular} |
|
72 \end{center} |
|
73 \subsection{Language Derivatives} |
|
74 We also define an operation of chopping off a character from |
|
75 a language: |
|
76 \begin{center} |
|
77 \begin{tabular}{lcl} |
|
78 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ |
|
79 \end{tabular} |
|
80 \end{center} |
|
81 |
|
82 This can be generalised to chopping off a string from all strings within set $A$: |
|
83 \begin{center} |
|
84 \begin{tabular}{lcl} |
|
85 $\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\ |
|
86 \end{tabular} |
|
87 \end{center} |
|
88 |
|
89 which is essentially the left quotient $A \backslash L'$ of $A$ against |
|
90 the singleton language $L' = \{w\}$ |
|
91 in formal language theory. |
|
92 For this dissertation the $\textit{Ders}$ notation would suffice, there is |
|
93 no need for a more general derivative definition. |
|
94 |
|
95 With the sequencing, Kleene star, and $\textit{Der}$ operator on languages, |
|
96 we have a few properties of how the language derivative can be defined using |
|
97 sub-languages. |
|
98 \begin{lemma} |
|
99 $\Der \; c \; (A @ B) = \textit{if} \; [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$ |
|
100 \end{lemma} |
|
101 \noindent |
|
102 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it |
|
103 and get to $B$. |
|
104 |
|
105 The language $A^*$'s derivative can be described using the language derivative |
|
106 of $A$: |
|
107 \begin{lemma} |
|
108 $\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\ |
|
109 \end{lemma} |
|
110 \begin{proof} |
|
111 \begin{itemize} |
|
112 \item{$\subseteq$} |
|
113 The set |
|
114 \[ \{s \mid c :: s \in A^*\} \] |
|
115 is enclosed in the set |
|
116 \[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \] |
|
117 because whenever you have a string starting with a character |
|
118 in the language of a Kleene star $A^*$, then that character together with some sub-string |
|
119 immediately after it will form the first iteration, and the rest of the string will |
|
120 be still in $A^*$. |
|
121 \item{$\supseteq$} |
|
122 Note that |
|
123 \[ \Der \; c \; A^* = \Der \; c \; (\{ [] \} \cup (A @ A^*) ) \] |
|
124 and |
|
125 \[ \Der \; c \; (\{ [] \} \cup (A @ A^*) ) = \Der\; c \; (A @ A^*) \] |
|
126 where the $\textit{RHS}$ of the above equatioin can be rewritten |
|
127 as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set. |
|
128 \end{itemize} |
|
129 \end{proof} |
|
130 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart |
|
131 for regular languages, we need to first give definitions for regular expressions. |
|
132 |
|
133 \subsection{Regular Expressions and Their Language Interpretation} |
|
134 Suppose we have an alphabet $\Sigma$, the strings whose characters |
|
135 are from $\Sigma$ |
|
136 can be expressed as $\Sigma^*$. |
|
137 |
|
138 We use patterns to define a set of strings concisely. Regular expressions |
|
139 are one of such patterns systems: |
|
140 The basic regular expressions are defined inductively |
|
141 by the following grammar: |
|
142 \[ r ::= \ZERO \mid \ONE |
|
143 \mid c |
|
144 \mid r_1 \cdot r_2 |
|
145 \mid r_1 + r_2 |
|
146 \mid r^* |
|
147 \] |
|
148 |
|
149 The language or set of strings denoted |
|
150 by regular expressions are defined as |
|
151 %TODO: FILL in the other defs |
|
152 \begin{center} |
|
153 \begin{tabular}{lcl} |
|
154 $L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\ |
|
155 $L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\ |
|
156 \end{tabular} |
|
157 \end{center} |
|
158 Which is also called the "language interpretation" of |
|
159 a regular expression. |
|
160 |
|
161 Now with semantic derivatives of a language and regular expressions and |
|
162 their language interpretations in place, we are ready to define derivatives on regexes. |
|
163 \subsection{Brzozowski Derivatives and a Regular Expression Matcher} |
|
164 The language derivatives acts on a string set and chops off a character from |
|
165 all strings in that set, we want to define a derivative operation on regular expressions |
|
166 so that after derivative $L(r\backslash c)$ |
|
167 will look as if it was obtained by doing a language derivative on $L(r)$: |
|
168 \[ |
|
169 L(r \backslash c) = \Der \; c \; L(r) |
|
170 \] |
|
171 So we mimic the equalities we have for $\Der$ on language concatenation |
|
172 |
|
173 \[ |
|
174 \Der \; c \; (A @ B) = \textit{if} \; [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\ |
|
175 \] |
|
176 to get the derivative for sequence regular expressions: |
|
177 \[ |
|
178 (r_1 \cdot r_2 ) \backslash c = \textit{if}\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c \textit{else} (r_1 \backslash c) \cdot r_2 |
|
179 \] |
|
180 |
|
181 \noindent |
|
182 and language Kleene star: |
|
183 \[ |
|
184 \textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*) |
|
185 \] |
|
186 to get derivative of the Kleene star regular expression: |
|
187 \[ |
|
188 r^* \backslash c = (r \backslash c)\cdot r^* |
|
189 \] |
|
190 Note that although we can formalise the boolean predicate |
|
191 $[] \in L(r_1)$ without problems, if we want a function that works |
|
192 computationally, then we would have to define a function that tests |
|
193 whether an empty string is in the language of a regular expression. |
|
194 We call such a function $\nullable$: |
|
195 |
|
196 |
|
197 |
|
198 \begin{center} |
|
199 \begin{tabular}{lcl} |
|
200 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
201 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
202 $d \backslash c$ & $\dn$ & |
|
203 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
204 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
205 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
206 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
207 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
208 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
209 \end{tabular} |
|
210 \end{center} |
|
211 \noindent |
|
212 The function derivative, written $r\backslash c$, |
|
213 defines how a regular expression evolves into |
|
214 a new regular expression after all the string it contains |
|
215 is chopped off a certain head character $c$. |
|
216 The most involved cases are the sequence |
|
217 and star case. |
|
218 The sequence case says that if the first regular expression |
|
219 contains an empty string then the second component of the sequence |
|
220 might be chosen as the target regular expression to be chopped |
|
221 off its head character. |
|
222 The star regular expression's derivative unwraps the iteration of |
|
223 regular expression and attaches the star regular expression |
|
224 to the sequence's second element to make sure a copy is retained |
|
225 for possible more iterations in later phases of lexing. |
|
226 |
|
227 |
|
228 The $\nullable$ function tests whether the empty string $""$ |
|
229 is in the language of $r$: |
|
230 |
|
231 |
|
232 \begin{center} |
|
233 \begin{tabular}{lcl} |
|
234 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
235 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
236 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
237 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
238 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
239 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
240 \end{tabular} |
|
241 \end{center} |
|
242 \noindent |
|
243 The empty set does not contain any string and |
|
244 therefore not the empty string, the empty string |
|
245 regular expression contains the empty string |
|
246 by definition, the character regular expression |
|
247 is the singleton that contains character only, |
|
248 and therefore does not contain the empty string, |
|
249 the alternative regular expression (or "or" expression) |
|
250 might have one of its children regular expressions |
|
251 being nullable and any one of its children being nullable |
|
252 would suffice. The sequence regular expression |
|
253 would require both children to have the empty string |
|
254 to compose an empty string and the Kleene star |
|
255 operation naturally introduced the empty string. |
|
256 |
|
257 We have the following property where the derivative on regular |
|
258 expressions coincides with the derivative on a set of strings: |
|
259 |
|
260 \begin{lemma} |
|
261 $\textit{Der} \; c \; L(r) = L (r\backslash c)$ |
|
262 \end{lemma} |
|
263 |
|
264 \noindent |
|
265 |
|
266 |
|
267 The main property of the derivative operation |
|
268 that enables us to reason about the correctness of |
|
269 an algorithm using derivatives is |
|
270 |
|
271 \begin{center} |
|
272 $c\!::\!s \in L(r)$ holds |
|
273 if and only if $s \in L(r\backslash c)$. |
|
274 \end{center} |
|
275 |
|
276 \noindent |
|
277 We can generalise the derivative operation shown above for single characters |
|
278 to strings as follows: |
|
279 |
|
280 \begin{center} |
|
281 \begin{tabular}{lcl} |
|
282 $r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\ |
|
283 $r \backslash [\,] $ & $\dn$ & $r$ |
|
284 \end{tabular} |
|
285 \end{center} |
|
286 |
|
287 \noindent |
|
288 When there is no ambiguity we will use $\backslash$ to denote |
|
289 string derivatives for brevity. |
|
290 |
|
291 and then define Brzozowski's regular-expression matching algorithm as: |
|
292 |
|
293 \begin{definition} |
|
294 $match\;s\;r \;\dn\; nullable(r\backslash s)$ |
|
295 \end{definition} |
|
296 |
|
297 \noindent |
|
298 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, |
|
299 this algorithm presented graphically is as follows: |
|
300 |
|
301 \begin{equation}\label{graph:successive_ders} |
|
302 \begin{tikzcd} |
|
303 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
|
304 \end{tikzcd} |
|
305 \end{equation} |
|
306 |
|
307 \noindent |
|
308 where we start with a regular expression $r_0$, build successive |
|
309 derivatives until we exhaust the string and then use \textit{nullable} |
|
310 to test whether the result can match the empty string. It can be |
|
311 relatively easily shown that this matcher is correct (that is given |
|
312 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
|
313 |
|
314 Beautiful and simple definition. |
|
315 |
|
316 If we implement the above algorithm naively, however, |
|
317 the algorithm can be excruciatingly slow. |
|
318 |
|
319 |
|
320 \begin{figure} |
|
321 \centering |
|
322 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
323 \begin{tikzpicture} |
|
324 \begin{axis}[ |
|
325 xlabel={$n$}, |
|
326 x label style={at={(1.05,-0.05)}}, |
|
327 ylabel={time in secs}, |
|
328 enlargelimits=false, |
|
329 xtick={0,5,...,30}, |
|
330 xmax=33, |
|
331 ymax=10000, |
|
332 ytick={0,1000,...,10000}, |
|
333 scaled ticks=false, |
|
334 axis lines=left, |
|
335 width=5cm, |
|
336 height=4cm, |
|
337 legend entries={JavaScript}, |
|
338 legend pos=north west, |
|
339 legend cell align=left] |
|
340 \addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data}; |
|
341 \end{axis} |
|
342 \end{tikzpicture}\\ |
|
343 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
|
344 of the form $\underbrace{aa..a}_{n}$.} |
|
345 \end{tabular} |
|
346 \caption{EightThousandNodes} \label{fig:EightThousandNodes} |
|
347 \end{figure} |
|
348 |
|
349 |
|
350 (8000 node data to be added here) |
|
351 For example, when starting with the regular |
|
352 expression $(a + aa)^*$ and building a few successive derivatives (around 10) |
|
353 w.r.t.~the character $a$, one obtains a derivative regular expression |
|
354 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}. |
|
355 The reason why $(a + aa) ^*$ explodes so drastically is that without |
|
356 pruning, the algorithm will keep records of all possible ways of matching: |
|
357 \begin{center} |
|
358 $(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$ |
|
359 \end{center} |
|
360 |
|
361 \noindent |
|
362 Each of the above alternative branches correspond to the match |
|
363 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete). |
|
364 These different ways of matching will grow exponentially with the string length, |
|
365 and without simplifications that throw away some of these very similar matchings, |
|
366 it is no surprise that these expressions grow so quickly. |
|
367 Operations like |
|
368 $\backslash$ and $\nullable$ need to traverse such trees and |
|
369 consequently the bigger the size of the derivative the slower the |
|
370 algorithm. |
|
371 |
|
372 Brzozowski was quick in finding that during this process a lot useless |
|
373 $\ONE$s and $\ZERO$s are generated and therefore not optimal. |
|
374 He also introduced some "similarity rules", such |
|
375 as $P+(Q+R) = (P+Q)+R$ to merge syntactically |
|
376 different but language-equivalent sub-regexes to further decrease the size |
|
377 of the intermediate regexes. |
|
378 |
|
379 More simplifications are possible, such as deleting duplicates |
|
380 and opening up nested alternatives to trigger even more simplifications. |
|
381 And suppose we apply simplification after each derivative step, and compose |
|
382 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn |
|
383 \textit{simp}(a \backslash c)$. Then we can build |
|
384 a matcher with simpler regular expressions. |
|
385 |
|
386 If we want the size of derivatives in the algorithm to |
|
387 stay even lower, we would need more aggressive simplifications. |
|
388 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
|
389 delete duplicates whenever possible. For example, the parentheses in |
|
390 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
|
391 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
|
392 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
|
393 $a^*+a+\ONE$. These more aggressive simplification rules are for |
|
394 a very tight size bound, possibly as low |
|
395 as that of the \emph{partial derivatives}\parencite{Antimirov1995}. |
|
396 |
|
397 Building derivatives and then simplifying them. |
|
398 So far, so good. But what if we want to |
|
399 do lexing instead of just getting a YES/NO answer? |
|
400 This requires us to go back again to the world |
|
401 without simplification first for a moment. |
|
402 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
|
403 elegant(arguably as beautiful as the original |
|
404 derivatives definition) solution for this. |
|
405 |
|
406 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} |
|
407 |
|
408 |
|
409 They first defined the datatypes for storing the |
|
410 lexing information called a \emph{value} or |
|
411 sometimes also \emph{lexical value}. These values and regular |
|
412 expressions correspond to each other as illustrated in the following |
|
413 table: |
|
414 |
|
415 \begin{center} |
|
416 \begin{tabular}{c@{\hspace{20mm}}c} |
|
417 \begin{tabular}{@{}rrl@{}} |
|
418 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
|
419 $r$ & $::=$ & $\ZERO$\\ |
|
420 & $\mid$ & $\ONE$ \\ |
|
421 & $\mid$ & $c$ \\ |
|
422 & $\mid$ & $r_1 \cdot r_2$\\ |
|
423 & $\mid$ & $r_1 + r_2$ \\ |
|
424 \\ |
|
425 & $\mid$ & $r^*$ \\ |
|
426 \end{tabular} |
|
427 & |
|
428 \begin{tabular}{@{\hspace{0mm}}rrl@{}} |
|
429 \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ |
|
430 $v$ & $::=$ & \\ |
|
431 & & $\Empty$ \\ |
|
432 & $\mid$ & $\Char(c)$ \\ |
|
433 & $\mid$ & $\Seq\,v_1\, v_2$\\ |
|
434 & $\mid$ & $\Left(v)$ \\ |
|
435 & $\mid$ & $\Right(v)$ \\ |
|
436 & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
|
437 \end{tabular} |
|
438 \end{tabular} |
|
439 \end{center} |
|
440 |
|
441 \noindent |
|
442 |
|
443 Building on top of Sulzmann and Lu's attempt to formalise the |
|
444 notion of POSIX lexing rules \parencite{Sulzmann2014}, |
|
445 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled |
|
446 POSIX matching as a ternary relation recursively defined in a |
|
447 natural deduction style. |
|
448 With the formally-specified rules for what a POSIX matching is, |
|
449 they proved in Isabelle/HOL that the algorithm gives correct results. |
|
450 |
|
451 But having a correct result is still not enough, |
|
452 we want at least some degree of $\mathbf{efficiency}$. |
|
453 |
|
454 |
|
455 |
|
456 One regular expression can have multiple lexical values. For example |
|
457 for the regular expression $(a+b)^*$, it has a infinite list of |
|
458 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, |
|
459 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, |
|
460 $\ldots$, and vice versa. |
|
461 Even for the regular expression matching a particular string, there could |
|
462 be more than one value corresponding to it. |
|
463 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
|
464 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
465 If we do not allow any empty iterations in its lexical values, |
|
466 there will be $n - 1$ "splitting points" on $s$ we can choose to |
|
467 split or not so that each sub-string |
|
468 segmented by those chosen splitting points will form different iterations: |
|
469 \begin{center} |
|
470 \begin{tabular}{lcr} |
|
471 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$\\ |
|
472 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$\\ |
|
473 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$\\ |
|
474 & $\textit{etc}.$ & |
|
475 \end{tabular} |
|
476 \end{center} |
|
477 |
|
478 And for each iteration, there are still multiple ways to split |
|
479 between the two $a^*$s. |
|
480 It is not surprising there are exponentially many lexical values |
|
481 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and |
|
482 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
483 |
|
484 A lexer to keep all the possible values will naturally |
|
485 have an exponential runtime on ambiguous regular expressions. |
|
486 Somehow one has to decide which lexical value to keep and |
|
487 output in a lexing algorithm. |
|
488 In practice, we are usually |
|
489 interested in POSIX values, which by intuition always |
|
490 \begin{itemize} |
|
491 \item |
|
492 match the leftmost regular expression when multiple options of matching |
|
493 are available |
|
494 \item |
|
495 always match a subpart as much as possible before proceeding |
|
496 to the next token. |
|
497 \end{itemize} |
|
498 The formal definition of a $\POSIX$ value $v$ for a regular expression |
|
499 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified |
|
500 in the following set of rules: |
|
501 (TODO: write the entire set of inference rules for POSIX ) |
|
502 \newcommand*{\inference}[3][t]{% |
|
503 \begingroup |
|
504 \def\and{\\}% |
|
505 \begin{tabular}[#1]{@{\enspace}c@{\enspace}} |
|
506 #2 \\ |
|
507 \hline |
|
508 #3 |
|
509 \end{tabular}% |
|
510 \endgroup |
|
511 } |
|
512 \begin{center} |
|
513 \inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$ }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ } |
|
514 \end{center} |
|
515 |
|
516 The reason why we are interested in $\POSIX$ values is that they can |
|
517 be practically used in the lexing phase of a compiler front end. |
|
518 For instance, when lexing a code snippet |
|
519 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized |
|
520 as an identifier rather than a keyword. |
|
521 |
|
522 For example, the above $r= (a^*\cdot a^*)^*$ and |
|
523 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value |
|
524 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
|
525 The output of an algorithm we want would be a POSIX matching |
|
526 encoded as a value. |
|
527 |
|
528 |
|
529 |
|
530 |
|
531 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
|
532 algorithm by a second phase (the first phase being building successive |
|
533 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value |
|
534 is generated if the regular expression matches the string. |
|
535 How can we construct a value out of regular expressions and character |
|
536 sequences only? |
|
537 Two functions are involved: $\inj$ and $\mkeps$. |
|
538 The function $\mkeps$ constructs a value from the last |
|
539 one of all the successive derivatives: |
|
540 \begin{ceqn} |
|
541 \begin{equation}\label{graph:mkeps} |
|
542 \begin{tikzcd} |
|
543 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\ |
|
544 & & & v_n |
|
545 \end{tikzcd} |
|
546 \end{equation} |
|
547 \end{ceqn} |
|
548 |
|
549 It tells us how can an empty string be matched by a |
|
550 regular expression, in a $\POSIX$ way: |
|
551 |
|
552 \begin{center} |
|
553 \begin{tabular}{lcl} |
|
554 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
|
555 $\mkeps(r_{1}+r_{2})$ & $\dn$ |
|
556 & \textit{if} $\nullable(r_{1})$\\ |
|
557 & & \textit{then} $\Left(\mkeps(r_{1}))$\\ |
|
558 & & \textit{else} $\Right(\mkeps(r_{2}))$\\ |
|
559 $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ |
|
560 $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ |
|
561 \end{tabular} |
|
562 \end{center} |
|
563 |
|
564 |
|
565 \noindent |
|
566 We favour the left to match an empty string if there is a choice. |
|
567 When there is a star for us to match the empty string, |
|
568 we give the $\Stars$ constructor an empty list, meaning |
|
569 no iterations are taken. |
|
570 |
|
571 |
|
572 After the $\mkeps$-call, we inject back the characters one by one in order to build |
|
573 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
|
574 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
|
575 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
|
576 matches $s$. The POSIX value is maintained throughout the process. |
|
577 For this Sulzmann and Lu defined a function that reverses |
|
578 the ``chopping off'' of characters during the derivative phase. The |
|
579 corresponding function is called \emph{injection}, written |
|
580 $\textit{inj}$; it takes three arguments: the first one is a regular |
|
581 expression ${r_{i-1}}$, before the character is chopped off, the second |
|
582 is a character ${c_{i-1}}$, the character we want to inject and the |
|
583 third argument is the value ${v_i}$, into which one wants to inject the |
|
584 character (it corresponds to the regular expression after the character |
|
585 has been chopped off). The result of this function is a new value. |
|
586 \begin{ceqn} |
|
587 \begin{equation}\label{graph:inj} |
|
588 \begin{tikzcd} |
|
589 r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
590 v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] |
|
591 \end{tikzcd} |
|
592 \end{equation} |
|
593 \end{ceqn} |
|
594 |
|
595 |
|
596 \noindent |
|
597 The |
|
598 definition of $\textit{inj}$ is as follows: |
|
599 |
|
600 \begin{center} |
|
601 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
602 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
|
603 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
|
604 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
|
605 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
606 $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
607 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
|
608 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
|
609 \end{tabular} |
|
610 \end{center} |
|
611 |
|
612 \noindent This definition is by recursion on the ``shape'' of regular |
|
613 expressions and values. |
|
614 The clauses do one thing--identifying the ``hole'' on a |
|
615 value to inject the character back into. |
|
616 For instance, in the last clause for injecting back to a value |
|
617 that would turn into a new star value that corresponds to a star, |
|
618 we know it must be a sequence value. And we know that the first |
|
619 value of that sequence corresponds to the child regex of the star |
|
620 with the first character being chopped off--an iteration of the star |
|
621 that had just been unfolded. This value is followed by the already |
|
622 matched star iterations we collected before. So we inject the character |
|
623 back to the first value and form a new value with this latest iteration |
|
624 being added to the previous list of iterations, all under the $\Stars$ |
|
625 top level. |
|
626 |
|
627 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together, |
|
628 we have a lexer with the following recursive definition: |
|
629 \begin{center} |
|
630 \begin{tabular}{lcr} |
|
631 $\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\ |
|
632 $\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$ |
|
633 \end{tabular} |
|
634 \end{center} |
|
635 |
|
636 Pictorially, the algorithm is as follows: |
|
637 |
|
638 \begin{ceqn} |
|
639 \begin{equation}\label{graph:2} |
|
640 \begin{tikzcd} |
|
641 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
642 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
|
643 \end{tikzcd} |
|
644 \end{equation} |
|
645 \end{ceqn} |
|
646 |
|
647 |
|
648 \noindent |
|
649 For convenience, we shall employ the following notations: the regular |
|
650 expression we start with is $r_0$, and the given string $s$ is composed |
|
651 of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
|
652 left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
|
653 to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
|
654 the derivative $r_n$. We test whether this derivative is |
|
655 $\textit{nullable}$ or not. If not, we know the string does not match |
|
656 $r$, and no value needs to be generated. If yes, we start building the |
|
657 values incrementally by \emph{injecting} back the characters into the |
|
658 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
|
659 algorithm from right to left. For the first value $v_n$, we call the |
|
660 function $\textit{mkeps}$, which builds a POSIX lexical value |
|
661 for how the empty string has been matched by the (nullable) regular |
|
662 expression $r_n$. This function is defined as |
|
663 |
|
664 |
|
665 |
|
666 We have mentioned before that derivatives without simplification |
|
667 can get clumsy, and this is true for values as well--they reflect |
|
668 the size of the regular expression by definition. |
|
669 |
|
670 One can introduce simplification on the regex and values but have to |
|
671 be careful not to break the correctness, as the injection |
|
672 function heavily relies on the structure of the regexes and values |
|
673 being correct and matching each other. |
|
674 It can be achieved by recording some extra rectification functions |
|
675 during the derivatives step, and applying these rectifications in |
|
676 each run during the injection phase. |
|
677 And we can prove that the POSIX value of how |
|
678 regular expressions match strings will not be affected---although it is much harder |
|
679 to establish. |
|
680 Some initial results in this regard have been |
|
681 obtained in \cite{AusafDyckhoffUrban2016}. |
|
682 |
|
683 |
|
684 |
|
685 %Brzozowski, after giving the derivatives and simplification, |
|
686 %did not explore lexing with simplification, or he may well be |
|
687 %stuck on an efficient simplification with proof. |
|
688 %He went on to examine the use of derivatives together with |
|
689 %automaton, and did not try lexing using products. |
|
690 |
|
691 We want to get rid of the complex and fragile rectification of values. |
|
692 Can we not create those intermediate values $v_1,\ldots v_n$, |
|
693 and get the lexing information that should be already there while |
|
694 doing derivatives in one pass, without a second injection phase? |
|
695 In the meantime, can we make sure that simplifications |
|
696 are easily handled without breaking the correctness of the algorithm? |
|
697 |
|
698 Sulzmann and Lu solved this problem by |
|
699 introducing additional information to the |
|
700 regular expressions called \emph{bitcodes}. |
|
701 |
|
702 |
|
703 |
|
704 |
|
705 |
|
706 |
|
707 |