1 % Chapter Template |
1 % Chapter Template |
2 |
2 |
3 \chapter{Chapter Title Here} % Main chapter title |
3 \chapter{Regular Expressions and Sulzmanna and Lu's Lexing Algorithm Without Bitcodes} % Main chapter title |
4 |
4 |
5 \label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX} |
5 \label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts |
|
6 %and notations we |
|
7 %use for describing the lexing algorithm by Sulzmann and Lu, |
|
8 %and then give the algorithm and its variant, and discuss |
|
9 %why more aggressive simplifications are needed. |
|
10 |
|
11 |
|
12 \section{Preliminaries} |
|
13 |
|
14 Suppose we have an alphabet $\Sigma$, the strings whose characters |
|
15 are from $\Sigma$ |
|
16 can be expressed as $\Sigma^*$. |
|
17 |
|
18 We use patterns to define a set of strings concisely. Regular expressions |
|
19 are one of such patterns systems: |
|
20 The basic regular expressions are defined inductively |
|
21 by the following grammar: |
|
22 \[ r ::= \ZERO \mid \ONE |
|
23 \mid c |
|
24 \mid r_1 \cdot r_2 |
|
25 \mid r_1 + r_2 |
|
26 \mid r^* |
|
27 \] |
|
28 |
|
29 The language or set of strings defined by regular expressions are defined as |
|
30 %TODO: FILL in the other defs |
|
31 \begin{center} |
|
32 \begin{tabular}{lcl} |
|
33 $L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\ |
|
34 $L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\ |
|
35 \end{tabular} |
|
36 \end{center} |
|
37 Which are also called the "language interpretation". |
|
38 |
|
39 |
|
40 |
|
41 The Brzozowski derivative w.r.t character $c$ is an operation on the regex, |
|
42 where the operation transforms the regex to a new one containing |
|
43 strings without the head character $c$. |
|
44 |
|
45 Formally, we define first such a transformation on any string set, which |
|
46 we call semantic derivative: |
|
47 \begin{center} |
|
48 $\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$ |
|
49 \end{center} |
|
50 Mathematically, it can be expressed as the |
|
51 |
|
52 If the $\textit{StringSet}$ happen to have some structure, for example, |
|
53 if it is regular, then we have that it |
|
54 |
|
55 % Derivatives of a |
|
56 %regular expression, written $r \backslash c$, give a simple solution |
|
57 %to the problem of matching a string $s$ with a regular |
|
58 %expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
59 %succession) all the characters of the string matches the empty string, |
|
60 %then $r$ matches $s$ (and {\em vice versa}). |
|
61 |
|
62 The the derivative of regular expression, denoted as |
|
63 $r \backslash c$, is a function that takes parameters |
|
64 $r$ and $c$, and returns another regular expression $r'$, |
|
65 which is computed by the following recursive function: |
|
66 |
|
67 \begin{center} |
|
68 \begin{tabular}{lcl} |
|
69 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
70 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
71 $d \backslash c$ & $\dn$ & |
|
72 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
73 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
74 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
75 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
76 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
77 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
78 \end{tabular} |
|
79 \end{center} |
|
80 \noindent |
|
81 \noindent |
|
82 |
|
83 The $\nullable$ function tests whether the empty string $""$ |
|
84 is in the language of $r$: |
|
85 |
|
86 |
|
87 \begin{center} |
|
88 \begin{tabular}{lcl} |
|
89 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
90 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
91 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
92 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
93 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
94 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
95 \end{tabular} |
|
96 \end{center} |
|
97 \noindent |
|
98 The empty set does not contain any string and |
|
99 therefore not the empty string, the empty string |
|
100 regular expression contains the empty string |
|
101 by definition, the character regular expression |
|
102 is the singleton that contains character only, |
|
103 and therefore does not contain the empty string, |
|
104 the alternative regular expression(or "or" expression) |
|
105 might have one of its children regular expressions |
|
106 being nullable and any one of its children being nullable |
|
107 would suffice. The sequence regular expression |
|
108 would require both children to have the empty string |
|
109 to compose an empty string and the Kleene star |
|
110 operation naturally introduced the empty string. |
|
111 |
|
112 We can give the meaning of regular expressions derivatives |
|
113 by language interpretation: |
|
114 |
|
115 |
|
116 |
|
117 |
|
118 \begin{center} |
|
119 \begin{tabular}{lcl} |
|
120 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
121 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
122 $d \backslash c$ & $\dn$ & |
|
123 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
124 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
125 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
126 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
127 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
128 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
129 \end{tabular} |
|
130 \end{center} |
|
131 \noindent |
|
132 \noindent |
|
133 The function derivative, written $\backslash c$, |
|
134 defines how a regular expression evolves into |
|
135 a new regular expression after all the string it contains |
|
136 is chopped off a certain head character $c$. |
|
137 The most involved cases are the sequence |
|
138 and star case. |
|
139 The sequence case says that if the first regular expression |
|
140 contains an empty string then the second component of the sequence |
|
141 might be chosen as the target regular expression to be chopped |
|
142 off its head character. |
|
143 The star regular expression's derivative unwraps the iteration of |
|
144 regular expression and attaches the star regular expression |
|
145 to the sequence's second element to make sure a copy is retained |
|
146 for possible more iterations in later phases of lexing. |
|
147 |
|
148 |
|
149 The main property of the derivative operation |
|
150 that enables us to reason about the correctness of |
|
151 an algorithm using derivatives is |
|
152 |
|
153 \begin{center} |
|
154 $c\!::\!s \in L(r)$ holds |
|
155 if and only if $s \in L(r\backslash c)$. |
|
156 \end{center} |
|
157 |
|
158 \noindent |
|
159 We can generalise the derivative operation shown above for single characters |
|
160 to strings as follows: |
|
161 |
|
162 \begin{center} |
|
163 \begin{tabular}{lcl} |
|
164 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
|
165 $r \backslash [\,] $ & $\dn$ & $r$ |
|
166 \end{tabular} |
|
167 \end{center} |
|
168 |
|
169 \noindent |
|
170 and then define Brzozowski's regular-expression matching algorithm as: |
|
171 |
|
172 \[ |
|
173 match\;s\;r \;\dn\; nullable(r\backslash s) |
|
174 \] |
|
175 |
|
176 \noindent |
|
177 Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, |
|
178 this algorithm presented graphically is as follows: |
|
179 |
|
180 \begin{equation}\label{graph:*} |
|
181 \begin{tikzcd} |
|
182 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
|
183 \end{tikzcd} |
|
184 \end{equation} |
|
185 |
|
186 \noindent |
|
187 where we start with a regular expression $r_0$, build successive |
|
188 derivatives until we exhaust the string and then use \textit{nullable} |
|
189 to test whether the result can match the empty string. It can be |
|
190 relatively easily shown that this matcher is correct (that is given |
|
191 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
|
192 |
|
193 Beautiful and simple definition. |
|
194 |
|
195 If we implement the above algorithm naively, however, |
|
196 the algorithm can be excruciatingly slow. |
|
197 |
|
198 |
|
199 \begin{figure} |
|
200 \centering |
|
201 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
202 \begin{tikzpicture} |
|
203 \begin{axis}[ |
|
204 xlabel={$n$}, |
|
205 x label style={at={(1.05,-0.05)}}, |
|
206 ylabel={time in secs}, |
|
207 enlargelimits=false, |
|
208 xtick={0,5,...,30}, |
|
209 xmax=33, |
|
210 ymax=10000, |
|
211 ytick={0,1000,...,10000}, |
|
212 scaled ticks=false, |
|
213 axis lines=left, |
|
214 width=5cm, |
|
215 height=4cm, |
|
216 legend entries={JavaScript}, |
|
217 legend pos=north west, |
|
218 legend cell align=left] |
|
219 \addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data}; |
|
220 \end{axis} |
|
221 \end{tikzpicture}\\ |
|
222 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
|
223 of the form $\underbrace{aa..a}_{n}$.} |
|
224 \end{tabular} |
|
225 \caption{EightThousandNodes} \label{fig:EightThousandNodes} |
|
226 \end{figure} |
|
227 |
|
228 |
|
229 (8000 node data to be added here) |
|
230 For example, when starting with the regular |
|
231 expression $(a + aa)^*$ and building a few successive derivatives (around 10) |
|
232 w.r.t.~the character $a$, one obtains a derivative regular expression |
|
233 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}. |
|
234 The reason why $(a + aa) ^*$ explodes so drastically is that without |
|
235 pruning, the algorithm will keep records of all possible ways of matching: |
|
236 \begin{center} |
|
237 $(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$ |
|
238 \end{center} |
|
239 |
|
240 \noindent |
|
241 Each of the above alternative branches correspond to the match |
|
242 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete). |
|
243 These different ways of matching will grow exponentially with the string length, |
|
244 and without simplifications that throw away some of these very similar matchings, |
|
245 it is no surprise that these expressions grow so quickly. |
|
246 Operations like |
|
247 $\backslash$ and $\nullable$ need to traverse such trees and |
|
248 consequently the bigger the size of the derivative the slower the |
|
249 algorithm. |
|
250 |
|
251 Brzozowski was quick in finding that during this process a lot useless |
|
252 $\ONE$s and $\ZERO$s are generated and therefore not optimal. |
|
253 He also introduced some "similarity rules" such |
|
254 as $P+(Q+R) = (P+Q)+R$ to merge syntactically |
|
255 different but language-equivalent sub-regexes to further decrease the size |
|
256 of the intermediate regexes. |
|
257 |
|
258 More simplifications are possible, such as deleting duplicates |
|
259 and opening up nested alternatives to trigger even more simplifications. |
|
260 And suppose we apply simplification after each derivative step, and compose |
|
261 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn |
|
262 \textit{simp}(a \backslash c)$. Then we can build |
|
263 a matcher without having cumbersome regular expressions. |
|
264 |
|
265 |
|
266 If we want the size of derivatives in the algorithm to |
|
267 stay even lower, we would need more aggressive simplifications. |
|
268 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
|
269 deleting duplicates whenever possible. For example, the parentheses in |
|
270 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
|
271 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
|
272 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
|
273 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
|
274 to achieve a very tight size bound, namely, |
|
275 the same size bound as that of the \emph{partial derivatives}. |
|
276 |
|
277 Building derivatives and then simplify them. |
|
278 So far so good. But what if we want to |
|
279 do lexing instead of just a YES/NO answer? |
|
280 This requires us to go back again to the world |
|
281 without simplification first for a moment. |
|
282 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
|
283 elegant(arguably as beautiful as the original |
|
284 derivatives definition) solution for this. |
|
285 |
|
286 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} |
|
287 |
|
288 |
|
289 They first defined the datatypes for storing the |
|
290 lexing information called a \emph{value} or |
|
291 sometimes also \emph{lexical value}. These values and regular |
|
292 expressions correspond to each other as illustrated in the following |
|
293 table: |
|
294 |
|
295 \begin{center} |
|
296 \begin{tabular}{c@{\hspace{20mm}}c} |
|
297 \begin{tabular}{@{}rrl@{}} |
|
298 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
|
299 $r$ & $::=$ & $\ZERO$\\ |
|
300 & $\mid$ & $\ONE$ \\ |
|
301 & $\mid$ & $c$ \\ |
|
302 & $\mid$ & $r_1 \cdot r_2$\\ |
|
303 & $\mid$ & $r_1 + r_2$ \\ |
|
304 \\ |
|
305 & $\mid$ & $r^*$ \\ |
|
306 \end{tabular} |
|
307 & |
|
308 \begin{tabular}{@{\hspace{0mm}}rrl@{}} |
|
309 \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ |
|
310 $v$ & $::=$ & \\ |
|
311 & & $\Empty$ \\ |
|
312 & $\mid$ & $\Char(c)$ \\ |
|
313 & $\mid$ & $\Seq\,v_1\, v_2$\\ |
|
314 & $\mid$ & $\Left(v)$ \\ |
|
315 & $\mid$ & $\Right(v)$ \\ |
|
316 & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
|
317 \end{tabular} |
|
318 \end{tabular} |
|
319 \end{center} |
|
320 |
|
321 \noindent |
|
322 |
|
323 Building on top of Sulzmann and Lu's attempt to formalize the |
|
324 notion of POSIX lexing rules \parencite{Sulzmann2014}, |
|
325 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled |
|
326 POSIX matching as a ternary relation recursively defined in a |
|
327 natural deduction style. |
|
328 With the formally-specified rules for what a POSIX matching is, |
|
329 they proved in Isabelle/HOL that the algorithm gives correct results. |
|
330 |
|
331 But having a correct result is still not enough, |
|
332 we want at least some degree of $\mathbf{efficiency}$. |
|
333 |
|
334 |
|
335 |
|
336 One regular expression can have multiple lexical values. For example |
|
337 for the regular expression $(a+b)^*$, it has a infinite list of |
|
338 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, |
|
339 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, |
|
340 $\ldots$, and vice versa. |
|
341 Even for the regular expression matching a certain string, there could |
|
342 still be more than one value corresponding to it. |
|
343 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
|
344 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
345 The number of different ways of matching |
|
346 without allowing any value under a star to be flattened |
|
347 to an empty string can be given by the following formula: |
|
348 \begin{equation} |
|
349 C_n = (n+1)+n C_1+\ldots + 2 C_{n-1} |
|
350 \end{equation} |
|
351 and a closed form formula can be calculated to be |
|
352 \begin{equation} |
|
353 C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} |
|
354 \end{equation} |
|
355 which is clearly in exponential order. |
|
356 |
|
357 A lexer aimed at getting all the possible values has an exponential |
|
358 worst case runtime. Therefore it is impractical to try to generate |
|
359 all possible matches in a run. In practice, we are usually |
|
360 interested about POSIX values, which by intuition always |
|
361 \begin{itemize} |
|
362 \item |
|
363 match the leftmost regular expression when multiple options of matching |
|
364 are available |
|
365 \item |
|
366 always match a subpart as much as possible before proceeding |
|
367 to the next token. |
|
368 \end{itemize} |
|
369 |
|
370 |
|
371 For example, the above example has the POSIX value |
|
372 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
|
373 The output of an algorithm we want would be a POSIX matching |
|
374 encoded as a value. |
|
375 The reason why we are interested in $\POSIX$ values is that they can |
|
376 be practically used in the lexing phase of a compiler front end. |
|
377 For instance, when lexing a code snippet |
|
378 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized |
|
379 as an identifier rather than a keyword. |
|
380 |
|
381 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
|
382 algorithm by a second phase (the first phase being building successive |
|
383 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
|
384 is generated in case the regular expression matches the string. |
|
385 Pictorially, the Sulzmann and Lu algorithm is as follows: |
|
386 |
|
387 \begin{ceqn} |
|
388 \begin{equation}\label{graph:2} |
|
389 \begin{tikzcd} |
|
390 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
391 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
|
392 \end{tikzcd} |
|
393 \end{equation} |
|
394 \end{ceqn} |
|
395 |
|
396 |
|
397 \noindent |
|
398 For convenience, we shall employ the following notations: the regular |
|
399 expression we start with is $r_0$, and the given string $s$ is composed |
|
400 of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
|
401 left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
|
402 to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
|
403 the derivative $r_n$. We test whether this derivative is |
|
404 $\textit{nullable}$ or not. If not, we know the string does not match |
|
405 $r$ and no value needs to be generated. If yes, we start building the |
|
406 values incrementally by \emph{injecting} back the characters into the |
|
407 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
|
408 algorithm from the right to left. For the first value $v_n$, we call the |
|
409 function $\textit{mkeps}$, which builds a POSIX lexical value |
|
410 for how the empty string has been matched by the (nullable) regular |
|
411 expression $r_n$. This function is defined as |
|
412 |
|
413 \begin{center} |
|
414 \begin{tabular}{lcl} |
|
415 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
|
416 $\mkeps(r_{1}+r_{2})$ & $\dn$ |
|
417 & \textit{if} $\nullable(r_{1})$\\ |
|
418 & & \textit{then} $\Left(\mkeps(r_{1}))$\\ |
|
419 & & \textit{else} $\Right(\mkeps(r_{2}))$\\ |
|
420 $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ |
|
421 $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ |
|
422 \end{tabular} |
|
423 \end{center} |
|
424 |
|
425 |
|
426 \noindent |
|
427 After the $\mkeps$-call, we inject back the characters one by one in order to build |
|
428 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
|
429 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
|
430 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
|
431 matches $s$. The POSIX value is maintained throught out the process. |
|
432 For this Sulzmann and Lu defined a function that reverses |
|
433 the ``chopping off'' of characters during the derivative phase. The |
|
434 corresponding function is called \emph{injection}, written |
|
435 $\textit{inj}$; it takes three arguments: the first one is a regular |
|
436 expression ${r_{i-1}}$, before the character is chopped off, the second |
|
437 is a character ${c_{i-1}}$, the character we want to inject and the |
|
438 third argument is the value ${v_i}$, into which one wants to inject the |
|
439 character (it corresponds to the regular expression after the character |
|
440 has been chopped off). The result of this function is a new value. The |
|
441 definition of $\textit{inj}$ is as follows: |
|
442 |
|
443 \begin{center} |
|
444 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
445 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
|
446 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
|
447 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
|
448 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
449 $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
450 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
|
451 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
|
452 \end{tabular} |
|
453 \end{center} |
|
454 |
|
455 \noindent This definition is by recursion on the ``shape'' of regular |
|
456 expressions and values. |
|
457 The clauses basically do one thing--identifying the ``holes'' on |
|
458 value to inject the character back into. |
|
459 For instance, in the last clause for injecting back to a value |
|
460 that would turn into a new star value that corresponds to a star, |
|
461 we know it must be a sequence value. And we know that the first |
|
462 value of that sequence corresponds to the child regex of the star |
|
463 with the first character being chopped off--an iteration of the star |
|
464 that had just been unfolded. This value is followed by the already |
|
465 matched star iterations we collected before. So we inject the character |
|
466 back to the first value and form a new value with this new iteration |
|
467 being added to the previous list of iterations, all under the $Stars$ |
|
468 top level. |
|
469 |
|
470 We have mentioned before that derivatives without simplification |
|
471 can get clumsy, and this is true for values as well--they reflect |
|
472 the regular expressions size by definition. |
|
473 |
|
474 One can introduce simplification on the regex and values, but have to |
|
475 be careful in not breaking the correctness as the injection |
|
476 function heavily relies on the structure of the regexes and values |
|
477 being correct and match each other. |
|
478 It can be achieved by recording some extra rectification functions |
|
479 during the derivatives step, and applying these rectifications in |
|
480 each run during the injection phase. |
|
481 And we can prove that the POSIX value of how |
|
482 regular expressions match strings will not be affected---although is much harder |
|
483 to establish. |
|
484 Some initial results in this regard have been |
|
485 obtained in \cite{AusafDyckhoffUrban2016}. |
|
486 |
|
487 |
|
488 |
|
489 %Brzozowski, after giving the derivatives and simplification, |
|
490 %did not explore lexing with simplification or he may well be |
|
491 %stuck on an efficient simplificaiton with a proof. |
|
492 %He went on to explore the use of derivatives together with |
|
493 %automaton, and did not try lexing using derivatives. |
|
494 |
|
495 We want to get rid of complex and fragile rectification of values. |
|
496 Can we not create those intermediate values $v_1,\ldots v_n$, |
|
497 and get the lexing information that should be already there while |
|
498 doing derivatives in one pass, without a second phase of injection? |
|
499 In the meantime, can we make sure that simplifications |
|
500 are easily handled without breaking the correctness of the algorithm? |
|
501 |
|
502 Sulzmann and Lu solved this problem by |
|
503 introducing additional informtaion to the |
|
504 regular expressions called \emph{bitcodes}. |
|
505 |
|
506 \subsection*{Bit-coded Algorithm} |
|
507 Bits and bitcodes (lists of bits) are defined as: |
|
508 |
|
509 \begin{center} |
|
510 $b ::= 1 \mid 0 \qquad |
|
511 bs ::= [] \mid b::bs |
|
512 $ |
|
513 \end{center} |
|
514 |
|
515 \noindent |
|
516 The $1$ and $0$ are not in bold in order to avoid |
|
517 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or |
|
518 bit-lists) can be used to encode values (or potentially incomplete values) in a |
|
519 compact form. This can be straightforwardly seen in the following |
|
520 coding function from values to bitcodes: |
|
521 |
|
522 \begin{center} |
|
523 \begin{tabular}{lcl} |
|
524 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
|
525 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
|
526 $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\ |
|
527 $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\ |
|
528 $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ |
|
529 $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\ |
|
530 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\; |
|
531 code(\Stars\,vs)$ |
|
532 \end{tabular} |
|
533 \end{center} |
|
534 |
|
535 \noindent |
|
536 Here $\textit{code}$ encodes a value into a bitcodes by converting |
|
537 $\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty |
|
538 star iteration by $1$. The border where a local star terminates |
|
539 is marked by $0$. This coding is lossy, as it throws away the information about |
|
540 characters, and also does not encode the ``boundary'' between two |
|
541 sequence values. Moreover, with only the bitcode we cannot even tell |
|
542 whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The |
|
543 reason for choosing this compact way of storing information is that the |
|
544 relatively small size of bits can be easily manipulated and ``moved |
|
545 around'' in a regular expression. In order to recover values, we will |
|
546 need the corresponding regular expression as an extra information. This |
|
547 means the decoding function is defined as: |
|
548 |
|
549 |
|
550 %\begin{definition}[Bitdecoding of Values]\mbox{} |
|
551 \begin{center} |
|
552 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
|
553 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
|
554 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
|
555 $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
556 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; |
|
557 (\Left\,v, bs_1)$\\ |
|
558 $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
559 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; |
|
560 (\Right\,v, bs_1)$\\ |
|
561 $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & |
|
562 $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ |
|
563 & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ |
|
564 & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ |
|
565 $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ |
|
566 $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & |
|
567 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
568 & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ |
|
569 & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ |
|
570 |
|
571 $\textit{decode}\,bs\,r$ & $\dn$ & |
|
572 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
573 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
|
574 \textit{else}\;\textit{None}$ |
|
575 \end{tabular} |
|
576 \end{center} |
|
577 %\end{definition} |
|
578 |
|
579 Sulzmann and Lu's integrated the bitcodes into regular expressions to |
|
580 create annotated regular expressions \cite{Sulzmann2014}. |
|
581 \emph{Annotated regular expressions} are defined by the following |
|
582 grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$} |
|
583 |
|
584 \begin{center} |
|
585 \begin{tabular}{lcl} |
|
586 $\textit{a}$ & $::=$ & $\ZERO$\\ |
|
587 & $\mid$ & $_{bs}\ONE$\\ |
|
588 & $\mid$ & $_{bs}{\bf c}$\\ |
|
589 & $\mid$ & $_{bs}\sum\,as$\\ |
|
590 & $\mid$ & $_{bs}a_1\cdot a_2$\\ |
|
591 & $\mid$ & $_{bs}a^*$ |
|
592 \end{tabular} |
|
593 \end{center} |
|
594 %(in \textit{ALTS}) |
|
595 |
|
596 \noindent |
|
597 where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular |
|
598 expressions and $as$ for a list of annotated regular expressions. |
|
599 The alternative constructor($\sum$) has been generalized to |
|
600 accept a list of annotated regular expressions rather than just 2. |
|
601 We will show that these bitcodes encode information about |
|
602 the (POSIX) value that should be generated by the Sulzmann and Lu |
|
603 algorithm. |
|
604 |
|
605 |
|
606 To do lexing using annotated regular expressions, we shall first |
|
607 transform the usual (un-annotated) regular expressions into annotated |
|
608 regular expressions. This operation is called \emph{internalisation} and |
|
609 defined as follows: |
|
610 |
|
611 %\begin{definition} |
|
612 \begin{center} |
|
613 \begin{tabular}{lcl} |
|
614 $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ |
|
615 $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ |
|
616 $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ |
|
617 $(r_1 + r_2)^\uparrow$ & $\dn$ & |
|
618 $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\, |
|
619 \textit{fuse}\,[1]\,r_2^\uparrow]$\\ |
|
620 $(r_1\cdot r_2)^\uparrow$ & $\dn$ & |
|
621 $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ |
|
622 $(r^*)^\uparrow$ & $\dn$ & |
|
623 $_{[]}(r^\uparrow)^*$\\ |
|
624 \end{tabular} |
|
625 \end{center} |
|
626 %\end{definition} |
|
627 |
|
628 \noindent |
|
629 We use up arrows here to indicate that the basic un-annotated regular |
|
630 expressions are ``lifted up'' into something slightly more complex. In the |
|
631 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to |
|
632 attach bits to the front of an annotated regular expression. Its |
|
633 definition is as follows: |
|
634 |
|
635 \begin{center} |
|
636 \begin{tabular}{lcl} |
|
637 $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ |
|
638 $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & |
|
639 $_{bs @ bs'}\ONE$\\ |
|
640 $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & |
|
641 $_{bs@bs'}{\bf c}$\\ |
|
642 $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & |
|
643 $_{bs@bs'}\sum\textit{as}$\\ |
|
644 $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & |
|
645 $_{bs@bs'}a_1 \cdot a_2$\\ |
|
646 $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & |
|
647 $_{bs @ bs'}a^*$ |
|
648 \end{tabular} |
|
649 \end{center} |
|
650 |
|
651 \noindent |
|
652 After internalising the regular expression, we perform successive |
|
653 derivative operations on the annotated regular expressions. This |
|
654 derivative operation is the same as what we had previously for the |
|
655 basic regular expressions, except that we beed to take care of |
|
656 the bitcodes: |
|
657 |
|
658 |
|
659 \iffalse |
|
660 %\begin{definition}{bder} |
|
661 \begin{center} |
|
662 \begin{tabular}{@{}lcl@{}} |
|
663 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
664 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
665 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & |
|
666 $\textit{if}\;c=d\; \;\textit{then}\; |
|
667 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
|
668 $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & |
|
669 $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\ |
|
670 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
671 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
672 & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ |
|
673 & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
|
674 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ |
|
675 $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & |
|
676 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, |
|
677 (\textit{STAR}\,[]\,r)$ |
|
678 \end{tabular} |
|
679 \end{center} |
|
680 %\end{definition} |
|
681 |
|
682 \begin{center} |
|
683 \begin{tabular}{@{}lcl@{}} |
|
684 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
685 $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
686 $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & |
|
687 $\textit{if}\;c=d\; \;\textit{then}\; |
|
688 _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ |
|
689 $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & |
|
690 $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ |
|
691 $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
692 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
693 & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ |
|
694 & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
|
695 & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ |
|
696 $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & |
|
697 $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\, |
|
698 (_{bs}\textit{STAR}\,[]\,r)$ |
|
699 \end{tabular} |
|
700 \end{center} |
|
701 %\end{definition} |
|
702 \fi |
|
703 |
|
704 \begin{center} |
|
705 \begin{tabular}{@{}lcl@{}} |
|
706 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
707 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
708 $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & |
|
709 $\textit{if}\;c=d\; \;\textit{then}\; |
|
710 _{bs}\ONE\;\textit{else}\;\ZERO$\\ |
|
711 $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & |
|
712 $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\ |
|
713 $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & |
|
714 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
715 & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ |
|
716 & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ |
|
717 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ |
|
718 $(_{bs}a^*)\,\backslash c$ & $\dn$ & |
|
719 $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot |
|
720 (_{[]}r^*))$ |
|
721 \end{tabular} |
|
722 \end{center} |
|
723 |
|
724 %\end{definition} |
|
725 \noindent |
|
726 For instance, when we do derivative of $_{bs}a^*$ with respect to c, |
|
727 we need to unfold it into a sequence, |
|
728 and attach an additional bit $0$ to the front of $r \backslash c$ |
|
729 to indicate one more star iteration. Also the sequence clause |
|
730 is more subtle---when $a_1$ is $\textit{bnullable}$ (here |
|
731 \textit{bnullable} is exactly the same as $\textit{nullable}$, except |
|
732 that it is for annotated regular expressions, therefore we omit the |
|
733 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how |
|
734 $a_1$ matches the string prior to character $c$ (more on this later), |
|
735 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2 |
|
736 \backslash c)$ will collapse the regular expression $a_1$(as it has |
|
737 already been fully matched) and store the parsing information at the |
|
738 head of the regular expression $a_2 \backslash c$ by fusing to it. The |
|
739 bitsequence $\textit{bs}$, which was initially attached to the |
|
740 first element of the sequence $a_1 \cdot a_2$, has |
|
741 now been elevated to the top-level of $\sum$, as this information will be |
|
742 needed whichever way the sequence is matched---no matter whether $c$ belongs |
|
743 to $a_1$ or $ a_2$. After building these derivatives and maintaining all |
|
744 the lexing information, we complete the lexing by collecting the |
|
745 bitcodes using a generalised version of the $\textit{mkeps}$ function |
|
746 for annotated regular expressions, called $\textit{bmkeps}$: |
|
747 |
|
748 |
|
749 %\begin{definition}[\textit{bmkeps}]\mbox{} |
|
750 \begin{center} |
|
751 \begin{tabular}{lcl} |
|
752 $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ |
|
753 $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & |
|
754 $\textit{if}\;\textit{bnullable}\,a$\\ |
|
755 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ |
|
756 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\ |
|
757 $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & |
|
758 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
|
759 $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & |
|
760 $bs \,@\, [0]$ |
|
761 \end{tabular} |
|
762 \end{center} |
|
763 %\end{definition} |
|
764 |
|
765 \noindent |
|
766 This function completes the value information by travelling along the |
|
767 path of the regular expression that corresponds to a POSIX value and |
|
768 collecting all the bitcodes, and using $S$ to indicate the end of star |
|
769 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and |
|
770 decode them, we get the value we expect. The corresponding lexing |
|
771 algorithm looks as follows: |
|
772 |
|
773 \begin{center} |
|
774 \begin{tabular}{lcl} |
|
775 $\textit{blexer}\;r\,s$ & $\dn$ & |
|
776 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
|
777 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
778 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
779 & & $\;\;\textit{else}\;\textit{None}$ |
|
780 \end{tabular} |
|
781 \end{center} |
|
782 |
|
783 \noindent |
|
784 In this definition $\_\backslash s$ is the generalisation of the derivative |
|
785 operation from characters to strings (just like the derivatives for un-annotated |
|
786 regular expressions). |
|
787 |
|
788 Now we introduce the simplifications, which is why we introduce the |
|
789 bitcodes in the first place. |
|
790 |
|
791 \subsection*{Simplification Rules} |
|
792 |
|
793 This section introduces aggressive (in terms of size) simplification rules |
|
794 on annotated regular expressions |
|
795 to keep derivatives small. Such simplifications are promising |
|
796 as we have |
|
797 generated test data that show |
|
798 that a good tight bound can be achieved. We could only |
|
799 partially cover the search space as there are infinitely many regular |
|
800 expressions and strings. |
|
801 |
|
802 One modification we introduced is to allow a list of annotated regular |
|
803 expressions in the $\sum$ constructor. This allows us to not just |
|
804 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but |
|
805 also unnecessary ``copies'' of regular expressions (very similar to |
|
806 simplifying $r + r$ to just $r$, but in a more general setting). Another |
|
807 modification is that we use simplification rules inspired by Antimirov's |
|
808 work on partial derivatives. They maintain the idea that only the first |
|
809 ``copy'' of a regular expression in an alternative contributes to the |
|
810 calculation of a POSIX value. All subsequent copies can be pruned away from |
|
811 the regular expression. A recursive definition of our simplification function |
|
812 that looks somewhat similar to our Scala code is given below: |
|
813 %\comment{Use $\ZERO$, $\ONE$ and so on. |
|
814 %Is it $ALTS$ or $ALTS$?}\\ |
|
815 |
|
816 \begin{center} |
|
817 \begin{tabular}{@{}lcl@{}} |
|
818 |
|
819 $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
|
820 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
|
821 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
|
822 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
|
823 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
|
824 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ |
|
825 |
|
826 $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\ |
|
827 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
|
828 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
|
829 &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ |
|
830 |
|
831 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
832 \end{tabular} |
|
833 \end{center} |
|
834 |
|
835 \noindent |
|
836 The simplification does a pattern matching on the regular expression. |
|
837 When it detected that the regular expression is an alternative or |
|
838 sequence, it will try to simplify its child regular expressions |
|
839 recursively and then see if one of the children turns into $\ZERO$ or |
|
840 $\ONE$, which might trigger further simplification at the current level. |
|
841 The most involved part is the $\sum$ clause, where we use two |
|
842 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested |
|
843 alternatives and reduce as many duplicates as possible. Function |
|
844 $\textit{distinct}$ keeps the first occurring copy only and removes all later ones |
|
845 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. |
|
846 Its recursive definition is given below: |
|
847 |
|
848 \begin{center} |
|
849 \begin{tabular}{@{}lcl@{}} |
|
850 $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; |
|
851 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
|
852 $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ |
|
853 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) |
|
854 \end{tabular} |
|
855 \end{center} |
|
856 |
|
857 \noindent |
|
858 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
|
859 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
|
860 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
|
861 |
|
862 Having defined the $\simp$ function, |
|
863 we can use the previous notation of natural |
|
864 extension from derivative w.r.t.~character to derivative |
|
865 w.r.t.~string:%\comment{simp in the [] case?} |
|
866 |
|
867 \begin{center} |
|
868 \begin{tabular}{lcl} |
|
869 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
|
870 $r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
|
871 \end{tabular} |
|
872 \end{center} |
|
873 |
|
874 \noindent |
|
875 to obtain an optimised version of the algorithm: |
|
876 |
|
877 \begin{center} |
|
878 \begin{tabular}{lcl} |
|
879 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
880 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
|
881 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
882 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
883 & & $\;\;\textit{else}\;\textit{None}$ |
|
884 \end{tabular} |
|
885 \end{center} |
|
886 |
|
887 \noindent |
|
888 This algorithm keeps the regular expression size small, for example, |
|
889 with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
890 will be reduced to just 6 and stays constant, no matter how long the |
|
891 input string is. |
|
892 |
|
893 |
6 |
894 |
7 |
895 |
8 |
896 |
9 |
897 |
10 |
898 |
15 To be completed. |
903 To be completed. |
16 |
904 |
17 |
905 |
18 |
906 |
19 |
907 |
20 %----------------------------------- |
908 %---------------------------------------------------------------------------------------- |
21 % SECTION ? |
909 % SECTION correctness proof |
22 %----------------------------------- |
910 %---------------------------------------------------------------------------------------- |
23 \section{preparatory properties for getting a finiteness bound} |
911 \section{Correctness of Bit-coded Algorithm (Without Simplification) |
24 Before we get to the proof that says the intermediate result of our lexer will |
912 We now give the proof the correctness of the algorithm with bit-codes. |
25 remain finitely bounded, which is an important efficiency/liveness guarantee, |
913 |
26 we shall first develop a few preparatory properties and definitions to |
914 Ausaf and Urban cleverly defined an auxiliary function called $\flex$, |
27 make the process of proving that a breeze. |
915 defined as |
28 |
916 \[ |
29 We define rewriting relations for $\rrexp$s, which allows us to do the |
917 \flex \; r \; f \; [] \; v \; = \; f\; v |
30 same trick as we did for the correctness proof, |
918 \flex \; r \; f \; c :: s \; v = \flex r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v |
31 but this time we will have stronger equalities established. |
919 \] |
32 \subsection{"hrewrite" relation} |
920 which accumulates the characters that needs to be injected back, |
33 List of 1-step rewrite rules for regular expressions simplification without bitcodes: |
921 and does the injection in a stack-like manner (last taken derivative first injected). |
34 \begin{center} |
922 $\flex$ is connected to the $\lexer$: |
35 \begin{tabular}{lcl} |
|
36 $r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$ |
|
37 \end{tabular} |
|
38 \end{center} |
|
39 |
|
40 And we define an "grewrite" relation that works on lists: |
|
41 \begin{center} |
|
42 \begin{tabular}{lcl} |
|
43 $ \ZERO :: rs$ & $\rightarrow_g$ & $rs$ |
|
44 \end{tabular} |
|
45 \end{center} |
|
46 |
|
47 |
|
48 |
|
49 With these relations we prove |
|
50 \begin{lemma} |
923 \begin{lemma} |
51 $rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$ |
924 $\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$ |
52 \end{lemma} |
925 \end{lemma} |
53 which enables us to prove "closed forms" equalities such as |
926 $\flex$ provides us a bridge between $\lexer$ and $\blexer$. |
54 \begin{lemma} |
927 What is even better about $\flex$ is that it allows us to |
55 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ |
928 directly operate on the value $\mkeps (r\backslash v)$, |
|
929 which is pivotal in the definition of $\lexer $ and $\blexer$, but not visible as an argument. |
|
930 When the value created by $\mkeps$ becomes available, one can |
|
931 prove some stepwise properties of lexing nicely: |
|
932 \begin{lemma}\label{flexStepwise} |
|
933 $\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $ |
56 \end{lemma} |
934 \end{lemma} |
57 |
935 |
58 But the most involved part of the above lemma is proving the following: |
936 And for $\blexer$ we have something with stepwise properties like $\flex$ as well, |
59 \begin{lemma} |
937 called $\retrieve$: |
60 $\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ |
938 \[ |
|
939 \retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs} |
|
940 \] |
|
941 $\retrieve$ takes bit-codes from annotated regular expressions |
|
942 guided by a value. |
|
943 $\retrieve$ is connected to the $\blexer$ in the following way: |
|
944 \begin{lemma}\label{blexer_retrieve} |
|
945 $\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$ |
61 \end{lemma} |
946 \end{lemma} |
62 which is needed in proving things like |
947 If you take derivative of an annotated regular expression, |
63 \begin{lemma} |
948 you can $\retrieve$ the same bit-codes as before the derivative took place, |
64 $r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$ |
949 provided that you use the corresponding value: |
|
950 |
|
951 \begin{lemma}\label{retrieveStepwise} |
|
952 $\retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ |
65 \end{lemma} |
953 \end{lemma} |
66 |
954 The other good thing about $\retrieve$ is that it can be connected to $\flex$: |
67 Fortunately this is provable by induction on the list $rs$ |
955 %centralLemma1 |
68 |
956 \begin{lemma}\label{flex_retrieve} |
69 |
957 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ |
70 |
958 \end{lemma} |
71 %----------------------------------- |
959 \begin{proof} |
72 % SECTION 2 |
960 By induction on $s$. The induction tactic is reverse induction on strings. |
73 %----------------------------------- |
961 $v$ is allowed to be arbitrary. |
74 |
962 The crucial point is to rewrite |
75 \section{Finiteness Property} |
963 \[ |
76 In this section let us describe our argument for why the size of the simplified |
964 \retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) |
77 derivatives with the aggressive simplification function is finitely bounded. |
|
78 Suppose |
|
79 we have a size function for bitcoded regular expressions, written |
|
80 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree |
|
81 |
|
82 \begin{center} |
|
83 \begin{tabular}{ccc} |
|
84 $\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\ |
|
85 \end{tabular} |
|
86 \end{center} |
|
87 (TODO: COMPLETE this defn and for $rs$) |
|
88 |
|
89 The size is based on a recursive function on the structure of the regex, |
|
90 not the bitcodes. |
|
91 Therefore we may as well talk about size of an annotated regular expression |
|
92 in their un-annotated form: |
|
93 \begin{center} |
|
94 $\asize(a) = \size(\erase(a))$. |
|
95 \end{center} |
|
96 (TODO: turn equals to roughly equals) |
|
97 |
|
98 But there is a minor nuisance here: |
|
99 the erase function actually messes with the structure of the regular expression: |
|
100 \begin{center} |
|
101 \begin{tabular}{ccc} |
|
102 $\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\ |
|
103 \end{tabular} |
|
104 \end{center} |
|
105 (TODO: complete this definition with singleton r in alts) |
|
106 |
|
107 An alternative regular expression with an empty list of children |
|
108 is turned into an $\ZERO$ during the |
|
109 $\erase$ function, thereby changing the size and structure of the regex. |
|
110 These will likely be fixable if we really want to use plain $\rexp$s for dealing |
|
111 with size, but we choose a more straightforward (or stupid) method by |
|
112 defining a new datatype that is similar to plain $\rexp$s but can take |
|
113 non-binary arguments for its alternative constructor, |
|
114 which we call $\rrexp$ to denote |
|
115 the difference between it and plain regular expressions. |
|
116 \[ \rrexp ::= \RZERO \mid \RONE |
|
117 \mid \RCHAR{c} |
|
118 \mid \RSEQ{r_1}{r_2} |
|
119 \mid \RALTS{rs} |
|
120 \mid \RSTAR{r} |
|
121 \] |
965 \] |
122 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, |
966 as |
123 but keep everything else intact. |
967 \[ |
124 It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved |
968 \retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\; \mkeps (r \backslash s@[c])) |
125 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton |
969 \]. |
126 $\ALTS$). |
970 This enables us to equate |
127 We denote the operation of erasing the bits and turning an annotated regular expression |
971 \[ |
128 into an $\rrexp{}$ as $\rerase{}$. |
972 \retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) |
129 \begin{center} |
973 \] |
130 \begin{tabular}{lcl} |
974 with |
131 $\rerase{\AZERO}$ & $=$ & $\RZERO$\\ |
975 \[ |
132 $\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ |
976 \flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c]))) |
133 $\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$ |
977 \], |
134 \end{tabular} |
978 which in turn can be rewritten as |
135 \end{center} |
979 \[ |
136 %TODO: FINISH definition of rerase |
980 \flex \; r \; \textit{id} \; s@[c] \; (\mkeps (r\backslash s@[c])) |
137 Similarly we could define the derivative and simplification on |
981 \]. |
138 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, |
982 \end{proof} |
139 except that now they can operate on alternatives taking multiple arguments. |
983 |
140 |
984 With the above lemma we can now link $\flex$ and $\blexer$. |
141 \begin{center} |
985 |
142 \begin{tabular}{lcr} |
986 \begin{lemma}\label{flex_blexer} |
143 $\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\ |
987 $\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s) = \blexer \; r \; s$ |
144 (other clauses omitted) |
|
145 \end{tabular} |
|
146 \end{center} |
|
147 |
|
148 Now that $\rrexp$s do not have bitcodes on them, we can do the |
|
149 duplicate removal without an equivalence relation: |
|
150 \begin{center} |
|
151 \begin{tabular}{lcl} |
|
152 $\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ |
|
153 & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$ |
|
154 \end{tabular} |
|
155 \end{center} |
|
156 %TODO: definition of rsimp (maybe only the alternative clause) |
|
157 |
|
158 |
|
159 The reason why these definitions mirror precisely the corresponding operations |
|
160 on annotated regualar expressions is that we can calculate sizes more easily: |
|
161 |
|
162 \begin{lemma} |
|
163 $\rsize{\rerase a} = \asize a$ |
|
164 \end{lemma} |
988 \end{lemma} |
165 This lemma says that the size of an r-erased regex is the same as the annotated regex. |
|
166 this does not hold for plain $\rexp$s due to their ways of how alternatives are represented. |
|
167 \begin{lemma} |
|
168 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ |
|
169 \end{lemma} |
|
170 Putting these two together we get a property that allows us to estimate the |
|
171 size of an annotated regular expression derivative using its un-annotated counterpart: |
|
172 \begin{lemma} |
|
173 $\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$ |
|
174 \end{lemma} |
|
175 Unless stated otherwise in this chapter all $\textit{rexp}$s without |
|
176 bitcodes are seen as $\rrexp$s. |
|
177 We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably |
|
178 as the former suits people's intuitive way of stating a binary alternative |
|
179 regular expression. |
|
180 |
|
181 |
|
182 \begin{theorem} |
|
183 For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$ |
|
184 \end{theorem} |
|
185 |
|
186 \noindent |
|
187 \begin{proof} |
989 \begin{proof} |
188 We prove this by induction on $r$. The base cases for $\AZERO$, |
990 Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}. |
189 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is |
|
190 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction |
|
191 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and |
|
192 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows |
|
193 % |
|
194 \begin{center} |
|
195 \begin{tabular}{lcll} |
|
196 & & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\ |
|
197 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} :: |
|
198 [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\ |
|
199 & $\leq$ & |
|
200 $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) :: |
|
201 [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\ |
|
202 & $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket + |
|
203 \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\ |
|
204 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + |
|
205 \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\ |
|
206 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5) |
|
207 \end{tabular} |
|
208 \end{center} |
|
209 |
|
210 |
|
211 \noindent |
|
212 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). |
|
213 The reason why we could write the derivatives of a sequence this way is described in section 2. |
|
214 The term (2) is used to control (1). |
|
215 That is because one can obtain an overall |
|
216 smaller regex list |
|
217 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. |
|
218 Section 3 is dedicated to its proof. |
|
219 In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is |
|
220 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
|
221 than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands |
|
222 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
|
223 We reason similarly for $\STAR$.\medskip |
|
224 \end{proof} |
991 \end{proof} |
225 |
992 Finally |
226 What guarantee does this bound give us? |
|
227 |
|
228 Whatever the regex is, it will not grow indefinitely. |
|
229 Take our previous example $(a + aa)^*$ as an example: |
|
230 \begin{center} |
|
231 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
232 \begin{tikzpicture} |
|
233 \begin{axis}[ |
|
234 xlabel={number of $a$'s}, |
|
235 x label style={at={(1.05,-0.05)}}, |
|
236 ylabel={regex size}, |
|
237 enlargelimits=false, |
|
238 xtick={0,5,...,30}, |
|
239 xmax=33, |
|
240 ymax= 40, |
|
241 ytick={0,10,...,40}, |
|
242 scaled ticks=false, |
|
243 axis lines=left, |
|
244 width=5cm, |
|
245 height=4cm, |
|
246 legend entries={$(a + aa)^*$}, |
|
247 legend pos=north west, |
|
248 legend cell align=left] |
|
249 \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; |
|
250 \end{axis} |
|
251 \end{tikzpicture} |
|
252 \end{tabular} |
|
253 \end{center} |
|
254 We are able to limit the size of the regex $(a + aa)^*$'s derivatives |
|
255 with our simplification |
|
256 rules very effectively. |
|
257 |
|
258 |
|
259 As the graphs of some more randomly generated regexes show, the size of |
|
260 the derivative might grow quickly at the start of the input, |
|
261 but after a sufficiently long input string, the trend will stop. |
|
262 |
|
263 |
|
264 %a few sample regular experessions' derivatives |
|
265 %size change |
|
266 %TODO: giving graphs showing a few regexes' size changes |
|
267 %w;r;t the input characters number |
|
268 %a*, aa*, aaa*, ..... |
|
269 %randomly generated regexes |
|
270 \begin{center} |
|
271 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
272 \begin{tikzpicture} |
|
273 \begin{axis}[ |
|
274 xlabel={number of $a$'s}, |
|
275 x label style={at={(1.05,-0.05)}}, |
|
276 ylabel={regex size}, |
|
277 enlargelimits=false, |
|
278 xtick={0,5,...,30}, |
|
279 xmax=33, |
|
280 ymax=1000, |
|
281 ytick={0,100,...,1000}, |
|
282 scaled ticks=false, |
|
283 axis lines=left, |
|
284 width=5cm, |
|
285 height=4cm, |
|
286 legend entries={regex1}, |
|
287 legend pos=north west, |
|
288 legend cell align=left] |
|
289 \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; |
|
290 \end{axis} |
|
291 \end{tikzpicture} |
|
292 & |
|
293 \begin{tikzpicture} |
|
294 \begin{axis}[ |
|
295 xlabel={$n$}, |
|
296 x label style={at={(1.05,-0.05)}}, |
|
297 %ylabel={time in secs}, |
|
298 enlargelimits=false, |
|
299 xtick={0,5,...,30}, |
|
300 xmax=33, |
|
301 ymax=1000, |
|
302 ytick={0,100,...,1000}, |
|
303 scaled ticks=false, |
|
304 axis lines=left, |
|
305 width=5cm, |
|
306 height=4cm, |
|
307 legend entries={regex2}, |
|
308 legend pos=north west, |
|
309 legend cell align=left] |
|
310 \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; |
|
311 \end{axis} |
|
312 \end{tikzpicture} |
|
313 & |
|
314 \begin{tikzpicture} |
|
315 \begin{axis}[ |
|
316 xlabel={$n$}, |
|
317 x label style={at={(1.05,-0.05)}}, |
|
318 %ylabel={time in secs}, |
|
319 enlargelimits=false, |
|
320 xtick={0,5,...,30}, |
|
321 xmax=33, |
|
322 ymax=1000, |
|
323 ytick={0,100,...,1000}, |
|
324 scaled ticks=false, |
|
325 axis lines=left, |
|
326 width=5cm, |
|
327 height=4cm, |
|
328 legend entries={regex3}, |
|
329 legend pos=north west, |
|
330 legend cell align=left] |
|
331 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; |
|
332 \end{axis} |
|
333 \end{tikzpicture}\\ |
|
334 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.} |
|
335 \end{tabular} |
|
336 \end{center} |
|
337 |
|
338 |
|
339 |
|
340 |
|
341 |
|
342 \noindent |
|
343 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is |
|
344 far from the actual bound we can expect. |
|
345 In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound |
|
346 is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$. |
|
347 Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$ |
|
348 inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function |
|
349 $f(x) = x * 2^x$. |
|
350 This means the bound we have will surge up at least |
|
351 tower-exponentially with a linear increase of the depth. |
|
352 For a regex of depth $n$, the bound |
|
353 would be approximately $4^n$. |
|
354 %TODO: change this to tower exponential! |
|
355 |
|
356 We can do better than this, but this does not improve |
|
357 the finiteness property we are proving. If we are interested in a polynomial bound, |
|
358 one would hope to obtain a similar tight bound as for partial |
|
359 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with |
|
360 $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in |
|
361 partial derivatives). |
|
362 To obtain the exact same bound would mean |
|
363 we need to introduce simplifications, such as |
|
364 $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, |
|
365 which exist for partial derivatives. |
|
366 |
|
367 However, if we introduce them in our |
|
368 setting we would lose the POSIX property of our calculated values. |
|
369 A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$. |
|
370 If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer |
|
371 would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of |
|
372 what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information |
|
373 in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$ |
|
374 occurs, and apply them in the right order once we get |
|
375 a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value. |
|
376 This is unlike the simplification we had before, where the rewriting rules |
|
377 such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value. |
|
378 We will discuss better |
|
379 bounds in the last section of this chapter.\\[-6.5mm] |
|
380 |
|
381 |
|
382 |
|
383 |
|
384 %---------------------------------------------------------------------------------------- |
|
385 % SECTION ?? |
|
386 %---------------------------------------------------------------------------------------- |
|
387 |
|
388 \section{"Closed Forms" of regular expressions' derivatives w.r.t strings} |
|
389 To embark on getting the "closed forms" of regexes, we first |
|
390 define a few auxiliary definitions to make expressing them smoothly. |
|
391 |
|
392 \begin{center} |
|
393 \begin{tabular}{ccc} |
|
394 $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ |
|
395 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\ |
|
396 $\sflataux r$ & $=$ & $ [r]$ |
|
397 \end{tabular} |
|
398 \end{center} |
|
399 The intuition behind $\sflataux{\_}$ is to break up nested regexes |
|
400 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape |
|
401 into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. |
|
402 It will return the singleton list $[r]$ otherwise. |
|
403 |
|
404 $\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps |
|
405 the output type a regular expression, not a list: |
|
406 \begin{center} |
|
407 \begin{tabular}{ccc} |
|
408 $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ |
|
409 $\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\ |
|
410 $\sflat r$ & $=$ & $ [r]$ |
|
411 \end{tabular} |
|
412 \end{center} |
|
413 $\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the |
|
414 first element of the list of children of |
|
415 an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression |
|
416 $r_1 \cdot r_2 \backslash s$. |
|
417 |
|
418 With $\sflat{\_}$ and $\sflataux{\_}$ we make |
|
419 precise what "closed forms" we have for the sequence derivatives and their simplifications, |
|
420 in other words, how can we construct $(r_1 \cdot r_2) \backslash s$ |
|
421 and $\bderssimp{(r_1\cdot r_2)}{s}$, |
|
422 if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$) |
|
423 and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over |
|
424 the substring of $s$? |
|
425 First let's look at a series of derivatives steps on a sequence |
|
426 regular expression, (assuming) that each time the first |
|
427 component of the sequence is always nullable): |
|
428 \begin{center} |
|
429 |
|
430 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ |
|
431 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad |
|
432 \ldots$ |
|
433 |
|
434 \end{center} |
|
435 %TODO: cite indian paper |
|
436 Indianpaper have come up with a slightly more formal way of putting the above process: |
|
437 \begin{center} |
|
438 $r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) + |
|
439 \delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots |
|
440 + \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$ |
|
441 \end{center} |
|
442 where $\delta(b, r)$ will produce $r$ |
|
443 if $b$ evaluates to true, |
|
444 and $\ZERO$ otherwise. |
|
445 |
|
446 But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical |
|
447 equivalence. To make this intuition useful |
|
448 for a formal proof, we need something |
|
449 stronger than language equivalence. |
|
450 With the help of $\sflat{\_}$ we can state the equation in Indianpaper |
|
451 more rigorously: |
|
452 \begin{lemma} |
|
453 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ |
|
454 \end{lemma} |
|
455 |
|
456 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string: |
|
457 |
|
458 \begin{center} |
|
459 \begin{tabular}{lcl} |
|
460 $\vsuf{[]}{\_} $ & $=$ & $[]$\\ |
|
461 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ |
|
462 && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ |
|
463 \end{tabular} |
|
464 \end{center} |
|
465 It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable, |
|
466 and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in |
|
467 the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$. |
|
468 In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing |
|
469 the entire result of $(r_1 \cdot r_2) \backslash s$, |
|
470 it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$. |
|
471 |
|
472 With this we can also add simplifications to both sides and get |
|
473 \begin{lemma} |
|
474 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ |
|
475 \end{lemma} |
|
476 Together with the idempotency property of $\rsimp$, |
|
477 %TODO: state the idempotency property of rsimp |
|
478 \begin{lemma} |
|
479 $\rsimp(r) = \rsimp (\rsimp(r))$ |
|
480 \end{lemma} |
|
481 it is possible to convert the above lemma to obtain a "closed form" |
|
482 for derivatives nested with simplification: |
|
483 \begin{lemma} |
|
484 $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ |
|
485 \end{lemma} |
|
486 And now the reason we have (1) in section 1 is clear: |
|
487 $\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, |
|
488 is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$ |
|
489 as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$. |
|
490 |
|
491 %---------------------------------------------------------------------------------------- |
|
492 % SECTION 3 |
|
493 %---------------------------------------------------------------------------------------- |
|
494 |
|
495 \section{interaction between $\distinctWith$ and $\flts$} |
|
496 Note that it is not immediately obvious that |
|
497 \begin{center} |
|
498 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $. |
|
499 \end{center} |
|
500 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of |
|
501 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. |
|
502 |
|
503 |
|
504 %---------------------------------------------------------------------------------------- |
|
505 % SECTION ALTS CLOSED FORM |
|
506 %---------------------------------------------------------------------------------------- |
|
507 \section{A Closed Form for \textit{ALTS}} |
|
508 Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. |
|
509 |
|
510 |
|
511 There are a few key steps, one of these steps is |
|
512 $rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {}))) |
|
513 = rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))$ |
|
514 |
|
515 One might want to prove this by something a simple statement like: |
|
516 $map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$. |
|
517 |
|
518 For this to hold we want the $\textit{distinct}$ function to pick up |
|
519 the elements before and after derivatives correctly: |
|
520 $r \in rset \equiv (rder x r) \in (rder x rset)$. |
|
521 which essentially requires that the function $\backslash$ is an injective mapping. |
|
522 |
|
523 Unfortunately the function $\backslash c$ is not an injective mapping. |
|
524 |
|
525 \subsection{function $\backslash c$ is not injective (1-to-1)} |
|
526 \begin{center} |
|
527 The derivative $w.r.t$ character $c$ is not one-to-one. |
|
528 Formally, |
|
529 $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ |
|
530 \end{center} |
|
531 This property is trivially true for the |
|
532 character regex example: |
|
533 \begin{center} |
|
534 $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ |
|
535 \end{center} |
|
536 But apart from the cases where the derivative |
|
537 output is $\ZERO$, are there non-trivial results |
|
538 of derivatives which contain strings? |
|
539 The answer is yes. |
|
540 For example, |
|
541 \begin{center} |
|
542 Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ |
|
543 where $a$ is not nullable.\\ |
|
544 $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ |
|
545 $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ |
|
546 \end{center} |
|
547 We start with two syntactically different regexes, |
|
548 and end up with the same derivative result. |
|
549 This is not surprising as we have such |
|
550 equality as below in the style of Arden's lemma:\\ |
|
551 \begin{center} |
|
552 $L(A^*B) = L(A\cdot A^* \cdot B + B)$ |
|
553 \end{center} |
|
554 |
|
555 %---------------------------------------------------------------------------------------- |
|
556 % SECTION 4 |
|
557 %---------------------------------------------------------------------------------------- |
|
558 \section{A Bound for the Star Regular Expression} |
|
559 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using |
|
560 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then |
|
561 the property of the $\distinct$ function. |
|
562 Now we try to get a bound on $r^* \backslash s$ as well. |
|
563 Again, we first look at how a star's derivatives evolve, if they grow maximally: |
|
564 \begin{center} |
|
565 |
|
566 $r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad |
|
567 r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad |
|
568 (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''} |
|
569 \quad \ldots$ |
|
570 |
|
571 \end{center} |
|
572 When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, |
|
573 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, |
|
574 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size |
|
575 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not |
|
576 count the possible size explosions of $r \backslash c$ themselves. |
|
577 |
|
578 Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like |
|
579 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
|
580 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ |
|
581 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$). |
|
582 For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$: |
|
583 %TODO: definitions of and \hflataux \hflat |
|
584 \begin{center} |
|
585 \begin{tabular}{ccc} |
|
586 $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ |
|
587 $\hflataux r$ & $=$ & $ [r]$ |
|
588 \end{tabular} |
|
589 \end{center} |
|
590 |
|
591 \begin{center} |
|
592 \begin{tabular}{ccc} |
|
593 $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\ |
|
594 $\hflat r$ & $=$ & $ r$ |
|
595 \end{tabular} |
|
596 \end{center}s |
|
597 Again these definitions are tailor-made for dealing with alternatives that have |
|
598 originated from a star's derivatives, so we don't attempt to open up all possible |
|
599 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 |
|
600 elements. |
|
601 We give a predicate for such "star-created" regular expressions: |
|
602 \begin{center} |
|
603 \begin{tabular}{lcr} |
|
604 & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ |
|
605 $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ |
|
606 \end{tabular} |
|
607 \end{center} |
|
608 |
|
609 These definitions allows us the flexibility to talk about |
|
610 regular expressions in their most convenient format, |
|
611 for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $ |
|
612 instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$. |
|
613 These definitions help express that certain classes of syntatically |
|
614 distinct regular expressions are actually the same under simplification. |
|
615 This is not entirely true for annotated regular expressions: |
|
616 %TODO: bsimp bders \neq bderssimp |
|
617 \begin{center} |
|
618 $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ |
|
619 \end{center} |
|
620 For bit-codes, the order in which simplification is applied |
|
621 might cause a difference in the location they are placed. |
|
622 If we want something like |
|
623 \begin{center} |
|
624 $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ |
|
625 \end{center} |
|
626 Some "canonicalization" procedure is required, |
|
627 which either pushes all the common bitcodes to nodes |
|
628 as senior as possible: |
|
629 \begin{center} |
|
630 $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ |
|
631 \end{center} |
|
632 or does the reverse. However bitcodes are not of interest if we are talking about |
|
633 the $\llbracket r \rrbracket$ size of a regex. |
|
634 Therefore for the ease and simplicity of producing a |
|
635 proof for a size bound, we are happy to restrict ourselves to |
|
636 unannotated regular expressions, and obtain such equalities as |
|
637 \begin{lemma} |
|
638 $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ |
|
639 \end{lemma} |
|
640 |
|
641 \begin{proof} |
|
642 By using the rewriting relation $\rightsquigarrow$ |
|
643 \end{proof} |
|
644 %TODO: rsimp sflat |
|
645 And from this we obtain a proof that a star's derivative will be the same |
|
646 as if it had all its nested alternatives created during deriving being flattened out: |
|
647 For example, |
|
648 \begin{lemma} |
|
649 $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
|
650 \end{lemma} |
|
651 \begin{proof} |
|
652 By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$. |
|
653 \end{proof} |
|
654 % The simplification of a flattened out regular expression, provided it comes |
|
655 %from the derivative of a star, is the same as the one nested. |
|
656 |
|
657 |
|
658 |
|
659 |
|
660 |
|
661 |
|
662 |
|
663 |
|
664 |
|
665 One might wonder the actual bound rather than the loose bound we gave |
|
666 for the convenience of an easier proof. |
|
667 How much can the regex $r^* \backslash s$ grow? |
|
668 As earlier graphs have shown, |
|
669 %TODO: reference that graph where size grows quickly |
|
670 they can grow at a maximum speed |
|
671 exponential $w.r.t$ the number of characters, |
|
672 but will eventually level off when the string $s$ is long enough. |
|
673 If they grow to a size exponential $w.r.t$ the original regex, our algorithm |
|
674 would still be slow. |
|
675 And unfortunately, we have concrete examples |
|
676 where such regexes grew exponentially large before levelling off: |
|
677 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
|
678 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum |
|
679 size that is exponential on the number $n$ |
|
680 under our current simplification rules: |
|
681 %TODO: graph of a regex whose size increases exponentially. |
|
682 \begin{center} |
|
683 \begin{tikzpicture} |
|
684 \begin{axis}[ |
|
685 height=0.5\textwidth, |
|
686 width=\textwidth, |
|
687 xlabel=number of a's, |
|
688 xtick={0,...,9}, |
|
689 ylabel=maximum size, |
|
690 ymode=log, |
|
691 log basis y={2} |
|
692 ] |
|
693 \addplot[mark=*,blue] table {re-chengsong.data}; |
|
694 \end{axis} |
|
695 \end{tikzpicture} |
|
696 \end{center} |
|
697 |
|
698 For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
|
699 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
|
700 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. |
|
701 The exponential size is triggered by that the regex |
|
702 $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ |
|
703 inside the $(\ldots) ^*$ having exponentially many |
|
704 different derivatives, despite those difference being minor. |
|
705 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ |
|
706 will therefore contain the following terms (after flattening out all nested |
|
707 alternatives): |
|
708 \begin{center} |
|
709 $(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ |
|
710 $(1 \leq m' \leq m )$ |
|
711 \end{center} |
|
712 These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix). |
|
713 With each new input character taking the derivative against the intermediate result, more and more such distinct |
|
714 terms will accumulate, |
|
715 until the length reaches $L.C.M.(1, \ldots, n)$. |
|
716 $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms |
|
717 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
|
718 |
|
719 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
|
720 where $m' \neq m''$ \\ |
|
721 as they are slightly different. |
|
722 This means that with our current simplification methods, |
|
723 we will not be able to control the derivative so that |
|
724 $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$ |
|
725 as there are already exponentially many terms. |
|
726 These terms are similar in the sense that the head of those terms |
|
727 are all consisted of sub-terms of the form: |
|
728 $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. |
|
729 For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most |
|
730 $n * (n + 1) / 2$ such terms. |
|
731 For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives |
|
732 can be described by 6 terms: |
|
733 $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, |
|
734 $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. |
|
735 The total number of different "head terms", $n * (n + 1) / 2$, |
|
736 is proportional to the number of characters in the regex |
|
737 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. |
|
738 This suggests a slightly different notion of size, which we call the |
|
739 alphabetic width: |
|
740 %TODO: |
|
741 (TODO: Alphabetic width def.) |
|
742 |
|
743 |
|
744 Antimirov\parencite{Antimirov95} has proven that |
|
745 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$. |
|
746 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms |
|
747 created by doing derivatives of $r$ against all possible strings. |
|
748 If we can make sure that at any moment in our lexing algorithm our |
|
749 intermediate result hold at most one copy of each of the |
|
750 subterms then we can get the same bound as Antimirov's. |
|
751 This leads to the algorithm in the next section. |
|
752 |
|
753 %---------------------------------------------------------------------------------------- |
|
754 % SECTION 5 |
|
755 %---------------------------------------------------------------------------------------- |
|
756 \section{A Stronger Version of Simplification Inspired by Antimirov} |
|
757 %TODO: search for isabelle proofs of algorithms that check equivalence |
|
758 |
|
759 |
|
760 |
|
761 |
|
762 |
|
763 |
993 |
764 |
994 |
765 |
995 |
766 |
996 |