|
1 % Chapter Template |
|
2 |
|
3 \chapter{Finiteness Bound} % Main chapter title |
|
4 |
|
5 \label{Finite} |
|
6 % In Chapter 4 \ref{Chapter4} we give the second guarantee |
|
7 %of our bitcoded algorithm, that is a finite bound on the size of any |
|
8 %regex's derivatives. |
|
9 |
|
10 In this chapter we give a guarantee in terms of time complexity: |
|
11 given a regular expression $r$, for any string $s$ |
|
12 our algorithm's internal data structure is finitely bounded. |
|
13 To obtain such a proof, we need to |
|
14 \begin{itemize} |
|
15 \item |
|
16 Define an new datatype for regular expressions that makes it easy |
|
17 to reason about the size of an annotated regular expression. |
|
18 \item |
|
19 A set of equalities for this new datatype that enables one to |
|
20 rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc. |
|
21 by their children regexes $r_1$, $r_2$, and $r$. |
|
22 \item |
|
23 Using those equalities to actually get those rewriting equations, which we call |
|
24 "closed forms". |
|
25 \item |
|
26 Bound the closed forms, thereby bounding the original |
|
27 $\blexersimp$'s internal data structures. |
|
28 \end{itemize} |
|
29 |
|
30 \section{the $\mathbf{r}$-rexp datatype and the size functions} |
|
31 |
|
32 We have a size function for bitcoded regular expressions, written |
|
33 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree |
|
34 |
|
35 \begin{center} |
|
36 \begin{tabular}{ccc} |
|
37 $\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\ |
|
38 \end{tabular} |
|
39 \end{center} |
|
40 (TODO: COMPLETE this defn and for $rs$) |
|
41 |
|
42 The size is based on a recursive function on the structure of the regex, |
|
43 not the bitcodes. |
|
44 Therefore we may as well talk about size of an annotated regular expression |
|
45 in their un-annotated form: |
|
46 \begin{center} |
|
47 $\asize(a) = \size(\erase(a))$. |
|
48 \end{center} |
|
49 (TODO: turn equals to roughly equals) |
|
50 |
|
51 But there is a minor nuisance here: |
|
52 the erase function actually messes with the structure of the regular expression: |
|
53 \begin{center} |
|
54 \begin{tabular}{ccc} |
|
55 $\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\ |
|
56 \end{tabular} |
|
57 \end{center} |
|
58 (TODO: complete this definition with singleton r in alts) |
|
59 |
|
60 An alternative regular expression with an empty list of children |
|
61 is turned into an $\ZERO$ during the |
|
62 $\erase$ function, thereby changing the size and structure of the regex. |
|
63 These will likely be fixable if we really want to use plain $\rexp$s for dealing |
|
64 with size, but we choose a more straightforward (or stupid) method by |
|
65 defining a new datatype that is similar to plain $\rexp$s but can take |
|
66 non-binary arguments for its alternative constructor, |
|
67 which we call $\rrexp$ to denote |
|
68 the difference between it and plain regular expressions. |
|
69 \[ \rrexp ::= \RZERO \mid \RONE |
|
70 \mid \RCHAR{c} |
|
71 \mid \RSEQ{r_1}{r_2} |
|
72 \mid \RALTS{rs} |
|
73 \mid \RSTAR{r} |
|
74 \] |
|
75 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, |
|
76 but keep everything else intact. |
|
77 It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved |
|
78 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton |
|
79 $\ALTS$). |
|
80 We denote the operation of erasing the bits and turning an annotated regular expression |
|
81 into an $\rrexp{}$ as $\rerase{}$. |
|
82 \begin{center} |
|
83 \begin{tabular}{lcl} |
|
84 $\rerase{\AZERO}$ & $=$ & $\RZERO$\\ |
|
85 $\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ |
|
86 $\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$ |
|
87 \end{tabular} |
|
88 \end{center} |
|
89 %TODO: FINISH definition of rerase |
|
90 Similarly we could define the derivative and simplification on |
|
91 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, |
|
92 except that now they can operate on alternatives taking multiple arguments. |
|
93 |
|
94 \begin{center} |
|
95 \begin{tabular}{lcr} |
|
96 $\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\ |
|
97 (other clauses omitted) |
|
98 \end{tabular} |
|
99 \end{center} |
|
100 |
|
101 Now that $\rrexp$s do not have bitcodes on them, we can do the |
|
102 duplicate removal without an equivalence relation: |
|
103 \begin{center} |
|
104 \begin{tabular}{lcl} |
|
105 $\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ |
|
106 & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$ |
|
107 \end{tabular} |
|
108 \end{center} |
|
109 %TODO: definition of rsimp (maybe only the alternative clause) |
|
110 |
|
111 |
|
112 The reason why these definitions mirror precisely the corresponding operations |
|
113 on annotated regualar expressions is that we can calculate sizes more easily: |
|
114 |
|
115 \begin{lemma} |
|
116 $\rsize{\rerase a} = \asize a$ |
|
117 \end{lemma} |
|
118 This lemma says that the size of an r-erased regex is the same as the annotated regex. |
|
119 this does not hold for plain $\rexp$s due to their ways of how alternatives are represented. |
|
120 \begin{lemma} |
|
121 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ |
|
122 \end{lemma} |
|
123 Putting these two together we get a property that allows us to estimate the |
|
124 size of an annotated regular expression derivative using its un-annotated counterpart: |
|
125 \begin{lemma} |
|
126 $\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$ |
|
127 \end{lemma} |
|
128 Unless stated otherwise in this chapter all $\textit{rexp}$s without |
|
129 bitcodes are seen as $\rrexp$s. |
|
130 We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably |
|
131 as the former suits people's intuitive way of stating a binary alternative |
|
132 regular expression. |
|
133 |
|
134 |
|
135 |
|
136 %----------------------------------- |
|
137 % SECTION ? |
|
138 %----------------------------------- |
|
139 \section{preparatory properties for getting a finiteness bound} |
|
140 Before we get to the proof that says the intermediate result of our lexer will |
|
141 remain finitely bounded, which is an important efficiency/liveness guarantee, |
|
142 we shall first develop a few preparatory properties and definitions to |
|
143 make the process of proving that a breeze. |
|
144 |
|
145 We define rewriting relations for $\rrexp$s, which allows us to do the |
|
146 same trick as we did for the correctness proof, |
|
147 but this time we will have stronger equalities established. |
|
148 \subsection{"hrewrite" relation} |
|
149 List of 1-step rewrite rules for regular expressions simplification without bitcodes: |
|
150 \begin{figure} |
|
151 \caption{the "h-rewrite" rules} |
|
152 \[ |
|
153 r_1 \cdot \ZERO \rightarrow_h \ZERO \] |
|
154 |
|
155 \[ |
|
156 \ZERO \cdot r_2 \rightarrow \ZERO |
|
157 \] |
|
158 \end{figure} |
|
159 And we define an "grewrite" relation that works on lists: |
|
160 \begin{center} |
|
161 \begin{tabular}{lcl} |
|
162 $ \ZERO :: rs$ & $\rightarrow_g$ & $rs$ |
|
163 \end{tabular} |
|
164 \end{center} |
|
165 |
|
166 |
|
167 |
|
168 With these relations we prove |
|
169 \begin{lemma} |
|
170 $rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$ |
|
171 \end{lemma} |
|
172 which enables us to prove "closed forms" equalities such as |
|
173 \begin{lemma} |
|
174 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\suffix \; s \; r_1 ))}$ |
|
175 \end{lemma} |
|
176 |
|
177 But the most involved part of the above lemma is proving the following: |
|
178 \begin{lemma} |
|
179 $\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ |
|
180 \end{lemma} |
|
181 which is needed in proving things like |
|
182 \begin{lemma} |
|
183 $r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$ |
|
184 \end{lemma} |
|
185 |
|
186 Fortunately this is provable by induction on the list $rs$ |
|
187 |
|
188 |
|
189 |
|
190 %----------------------------------- |
|
191 % SECTION 2 |
|
192 %----------------------------------- |
|
193 |
|
194 \begin{theorem} |
|
195 For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$ |
|
196 \end{theorem} |
|
197 |
|
198 \noindent |
|
199 \begin{proof} |
|
200 We prove this by induction on $r$. The base cases for $\AZERO$, |
|
201 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is |
|
202 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction |
|
203 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and |
|
204 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows |
|
205 % |
|
206 \begin{center} |
|
207 \begin{tabular}{lcll} |
|
208 & & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\ |
|
209 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} :: |
|
210 [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\ |
|
211 & $\leq$ & |
|
212 $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) :: |
|
213 [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\ |
|
214 & $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket + |
|
215 \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\ |
|
216 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + |
|
217 \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\ |
|
218 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5) |
|
219 \end{tabular} |
|
220 \end{center} |
|
221 |
|
222 |
|
223 \noindent |
|
224 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). |
|
225 The reason why we could write the derivatives of a sequence this way is described in section 2. |
|
226 The term (2) is used to control (1). |
|
227 That is because one can obtain an overall |
|
228 smaller regex list |
|
229 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. |
|
230 Section 3 is dedicated to its proof. |
|
231 In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is |
|
232 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
|
233 than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands |
|
234 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
|
235 We reason similarly for $\STAR$.\medskip |
|
236 \end{proof} |
|
237 |
|
238 What guarantee does this bound give us? |
|
239 |
|
240 Whatever the regex is, it will not grow indefinitely. |
|
241 Take our previous example $(a + aa)^*$ as an example: |
|
242 \begin{center} |
|
243 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
244 \begin{tikzpicture} |
|
245 \begin{axis}[ |
|
246 xlabel={number of $a$'s}, |
|
247 x label style={at={(1.05,-0.05)}}, |
|
248 ylabel={regex size}, |
|
249 enlargelimits=false, |
|
250 xtick={0,5,...,30}, |
|
251 xmax=33, |
|
252 ymax= 40, |
|
253 ytick={0,10,...,40}, |
|
254 scaled ticks=false, |
|
255 axis lines=left, |
|
256 width=5cm, |
|
257 height=4cm, |
|
258 legend entries={$(a + aa)^*$}, |
|
259 legend pos=north west, |
|
260 legend cell align=left] |
|
261 \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; |
|
262 \end{axis} |
|
263 \end{tikzpicture} |
|
264 \end{tabular} |
|
265 \end{center} |
|
266 We are able to limit the size of the regex $(a + aa)^*$'s derivatives |
|
267 with our simplification |
|
268 rules very effectively. |
|
269 |
|
270 |
|
271 In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound |
|
272 is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$. |
|
273 Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$ |
|
274 inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function |
|
275 $f(x) = x * 2^x$. |
|
276 This means the bound we have will surge up at least |
|
277 tower-exponentially with a linear increase of the depth. |
|
278 For a regex of depth $n$, the bound |
|
279 would be approximately $4^n$. |
|
280 |
|
281 Test data in the graphs from randomly generated regular expressions |
|
282 shows that the giant bounds are far from being hit. |
|
283 %a few sample regular experessions' derivatives |
|
284 %size change |
|
285 %TODO: giving regex1_size_change.data showing a few regexes' size changes |
|
286 %w;r;t the input characters number, where the size is usually cubic in terms of original size |
|
287 %a*, aa*, aaa*, ..... |
|
288 %randomly generated regexes |
|
289 \begin{center} |
|
290 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
291 \begin{tikzpicture} |
|
292 \begin{axis}[ |
|
293 xlabel={number of $a$'s}, |
|
294 x label style={at={(1.05,-0.05)}}, |
|
295 ylabel={regex size}, |
|
296 enlargelimits=false, |
|
297 xtick={0,5,...,30}, |
|
298 xmax=33, |
|
299 ymax=1000, |
|
300 ytick={0,100,...,1000}, |
|
301 scaled ticks=false, |
|
302 axis lines=left, |
|
303 width=5cm, |
|
304 height=4cm, |
|
305 legend entries={regex1}, |
|
306 legend pos=north west, |
|
307 legend cell align=left] |
|
308 \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; |
|
309 \end{axis} |
|
310 \end{tikzpicture} |
|
311 & |
|
312 \begin{tikzpicture} |
|
313 \begin{axis}[ |
|
314 xlabel={$n$}, |
|
315 x label style={at={(1.05,-0.05)}}, |
|
316 %ylabel={time in secs}, |
|
317 enlargelimits=false, |
|
318 xtick={0,5,...,30}, |
|
319 xmax=33, |
|
320 ymax=1000, |
|
321 ytick={0,100,...,1000}, |
|
322 scaled ticks=false, |
|
323 axis lines=left, |
|
324 width=5cm, |
|
325 height=4cm, |
|
326 legend entries={regex2}, |
|
327 legend pos=north west, |
|
328 legend cell align=left] |
|
329 \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; |
|
330 \end{axis} |
|
331 \end{tikzpicture} |
|
332 & |
|
333 \begin{tikzpicture} |
|
334 \begin{axis}[ |
|
335 xlabel={$n$}, |
|
336 x label style={at={(1.05,-0.05)}}, |
|
337 %ylabel={time in secs}, |
|
338 enlargelimits=false, |
|
339 xtick={0,5,...,30}, |
|
340 xmax=33, |
|
341 ymax=1000, |
|
342 ytick={0,100,...,1000}, |
|
343 scaled ticks=false, |
|
344 axis lines=left, |
|
345 width=5cm, |
|
346 height=4cm, |
|
347 legend entries={regex3}, |
|
348 legend pos=north west, |
|
349 legend cell align=left] |
|
350 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; |
|
351 \end{axis} |
|
352 \end{tikzpicture}\\ |
|
353 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.} |
|
354 \end{tabular} |
|
355 \end{center} |
|
356 |
|
357 |
|
358 |
|
359 |
|
360 |
|
361 \noindent |
|
362 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the |
|
363 original size. |
|
364 This suggests a link towrads "partial derivatives" |
|
365 introduced by Antimirov \cite{Antimirov95}. |
|
366 |
|
367 \section{Antimirov's partial derivatives} |
|
368 The idea behind Antimirov's partial derivatives |
|
369 is to do derivatives in a similar way as suggested by Brzozowski, |
|
370 but maintain a set of regular expressions instead of a single one: |
|
371 |
|
372 %TODO: antimirov proposition 3.1, needs completion |
|
373 \begin{center} |
|
374 \begin{tabular}{ccc} |
|
375 $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\ |
|
376 $\partial_x(\ONE)$ & $=$ & $\phi$ |
|
377 \end{tabular} |
|
378 \end{center} |
|
379 |
|
380 Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together |
|
381 using the alternatives constructor, Antimirov cleverly chose to put them into |
|
382 a set instead. This breaks the terms in a derivative regular expression up, |
|
383 allowing us to understand what are the "atomic" components of it. |
|
384 For example, To compute what regular expression $x^*(xx + y)^*$'s |
|
385 derivative against $x$ is made of, one can do a partial derivative |
|
386 of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$ |
|
387 from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$. |
|
388 To get all the "atomic" components of a regular expression's possible derivatives, |
|
389 there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes |
|
390 whatever character is available at the head of the string inside the language of a |
|
391 regular expression, and gives back the character and the derivative regular expression |
|
392 as a pair (which he called "monomial"): |
|
393 \begin{center} |
|
394 \begin{tabular}{ccc} |
|
395 $\lf(\ONE)$ & $=$ & $\phi$\\ |
|
396 $\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\ |
|
397 $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ |
|
398 $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\ |
|
399 \end{tabular} |
|
400 \end{center} |
|
401 %TODO: completion |
|
402 |
|
403 There is a slight difference in the last three clauses compared |
|
404 with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular |
|
405 expression $r$ with every element inside $\textit{rset}$ to create a set of |
|
406 sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates |
|
407 on a set of monomials (which Antimirov called "linear form") and a regular |
|
408 expression, and returns a linear form: |
|
409 \begin{center} |
|
410 \begin{tabular}{ccc} |
|
411 $l \bigodot (\ZERO)$ & $=$ & $\phi$\\ |
|
412 $l \bigodot (\ONE)$ & $=$ & $l$\\ |
|
413 $\phi \bigodot t$ & $=$ & $\phi$\\ |
|
414 $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\ |
|
415 $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\ |
|
416 $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\ |
|
417 $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ |
|
418 $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\ |
|
419 \end{tabular} |
|
420 \end{center} |
|
421 %TODO: completion |
|
422 |
|
423 Some degree of simplification is applied when doing $\bigodot$, for example, |
|
424 $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$, |
|
425 and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and |
|
426 $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$, |
|
427 and so on. |
|
428 |
|
429 With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regex $r$ with |
|
430 an iterative procedure: |
|
431 \begin{center} |
|
432 \begin{tabular}{llll} |
|
433 $\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\ |
|
434 & $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\ |
|
435 & $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\ |
|
436 $\partial_{UNIV}(r)$ & $=$ & $\PD$ & |
|
437 \end{tabular} |
|
438 \end{center} |
|
439 |
|
440 |
|
441 $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, |
|
442 |
|
443 |
|
444 However, if we introduce them in our |
|
445 setting we would lose the POSIX property of our calculated values. |
|
446 A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$. |
|
447 If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer |
|
448 would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of |
|
449 what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information |
|
450 in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$ |
|
451 occurs, and apply them in the right order once we get |
|
452 a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value. |
|
453 This is unlike the simplification we had before, where the rewriting rules |
|
454 such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value. |
|
455 We will discuss better |
|
456 bounds in the last section of this chapter.\\[-6.5mm] |
|
457 |
|
458 |
|
459 |
|
460 |
|
461 %---------------------------------------------------------------------------------------- |
|
462 % SECTION ?? |
|
463 %---------------------------------------------------------------------------------------- |
|
464 |
|
465 \section{"Closed Forms" of regular expressions' derivatives w.r.t strings} |
|
466 To embark on getting the "closed forms" of regexes, we first |
|
467 define a few auxiliary definitions to make expressing them smoothly. |
|
468 |
|
469 \begin{center} |
|
470 \begin{tabular}{ccc} |
|
471 $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ |
|
472 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\ |
|
473 $\sflataux r$ & $=$ & $ [r]$ |
|
474 \end{tabular} |
|
475 \end{center} |
|
476 The intuition behind $\sflataux{\_}$ is to break up nested regexes |
|
477 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape |
|
478 into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. |
|
479 It will return the singleton list $[r]$ otherwise. |
|
480 |
|
481 $\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps |
|
482 the output type a regular expression, not a list: |
|
483 \begin{center} |
|
484 \begin{tabular}{ccc} |
|
485 $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ |
|
486 $\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\ |
|
487 $\sflat r$ & $=$ & $ [r]$ |
|
488 \end{tabular} |
|
489 \end{center} |
|
490 $\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the |
|
491 first element of the list of children of |
|
492 an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression |
|
493 $r_1 \cdot r_2 \backslash s$. |
|
494 |
|
495 With $\sflat{\_}$ and $\sflataux{\_}$ we make |
|
496 precise what "closed forms" we have for the sequence derivatives and their simplifications, |
|
497 in other words, how can we construct $(r_1 \cdot r_2) \backslash s$ |
|
498 and $\bderssimp{(r_1\cdot r_2)}{s}$, |
|
499 if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$) |
|
500 and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over |
|
501 the substring of $s$? |
|
502 First let's look at a series of derivatives steps on a sequence |
|
503 regular expression, (assuming) that each time the first |
|
504 component of the sequence is always nullable): |
|
505 \begin{center} |
|
506 |
|
507 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ |
|
508 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad |
|
509 \ldots$ |
|
510 |
|
511 \end{center} |
|
512 %TODO: cite indian paper |
|
513 Indianpaper have come up with a slightly more formal way of putting the above process: |
|
514 \begin{center} |
|
515 $r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) + |
|
516 \delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots |
|
517 + \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$ |
|
518 \end{center} |
|
519 where $\delta(b, r)$ will produce $r$ |
|
520 if $b$ evaluates to true, |
|
521 and $\ZERO$ otherwise. |
|
522 |
|
523 But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical |
|
524 equivalence. To make this intuition useful |
|
525 for a formal proof, we need something |
|
526 stronger than language equivalence. |
|
527 With the help of $\sflat{\_}$ we can state the equation in Indianpaper |
|
528 more rigorously: |
|
529 \begin{lemma} |
|
530 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ |
|
531 \end{lemma} |
|
532 |
|
533 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string: |
|
534 |
|
535 \begin{center} |
|
536 \begin{tabular}{lcl} |
|
537 $\vsuf{[]}{\_} $ & $=$ & $[]$\\ |
|
538 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ |
|
539 && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ |
|
540 \end{tabular} |
|
541 \end{center} |
|
542 It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable, |
|
543 and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in |
|
544 the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$. |
|
545 In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing |
|
546 the entire result of $(r_1 \cdot r_2) \backslash s$, |
|
547 it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$. |
|
548 |
|
549 With this we can also add simplifications to both sides and get |
|
550 \begin{lemma} |
|
551 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ |
|
552 \end{lemma} |
|
553 Together with the idempotency property of $\rsimp$, |
|
554 %TODO: state the idempotency property of rsimp |
|
555 \begin{lemma} |
|
556 $\rsimp{r} = \rsimp{\rsimp{r}}$ |
|
557 \end{lemma} |
|
558 it is possible to convert the above lemma to obtain a "closed form" |
|
559 for derivatives nested with simplification: |
|
560 \begin{lemma} |
|
561 $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ |
|
562 \end{lemma} |
|
563 And now the reason we have (1) in section 1 is clear: |
|
564 $\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, |
|
565 is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$ |
|
566 as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$. |
|
567 |
|
568 %---------------------------------------------------------------------------------------- |
|
569 % SECTION 3 |
|
570 %---------------------------------------------------------------------------------------- |
|
571 |
|
572 \section{interaction between $\distinctWith$ and $\flts$} |
|
573 Note that it is not immediately obvious that |
|
574 \begin{center} |
|
575 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $. |
|
576 \end{center} |
|
577 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of |
|
578 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. |
|
579 |
|
580 |
|
581 %----------------------------------- |
|
582 % SECTION syntactic equivalence under simp |
|
583 %----------------------------------- |
|
584 \section{Syntactic Equivalence Under $\simp$} |
|
585 We prove that minor differences can be annhilated |
|
586 by $\simp$. |
|
587 For example, |
|
588 \begin{center} |
|
589 $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = |
|
590 \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ |
|
591 \end{center} |
|
592 |
|
593 |
|
594 %---------------------------------------------------------------------------------------- |
|
595 % SECTION ALTS CLOSED FORM |
|
596 %---------------------------------------------------------------------------------------- |
|
597 \section{A Closed Form for \textit{ALTS}} |
|
598 Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. |
|
599 |
|
600 |
|
601 There are a few key steps, one of these steps is |
|
602 \[ |
|
603 rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {}))) |
|
604 = rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {})) |
|
605 \] |
|
606 |
|
607 |
|
608 One might want to prove this by something a simple statement like: |
|
609 $map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$. |
|
610 |
|
611 For this to hold we want the $\textit{distinct}$ function to pick up |
|
612 the elements before and after derivatives correctly: |
|
613 $r \in rset \equiv (rder x r) \in (rder x rset)$. |
|
614 which essentially requires that the function $\backslash$ is an injective mapping. |
|
615 |
|
616 Unfortunately the function $\backslash c$ is not an injective mapping. |
|
617 |
|
618 \subsection{function $\backslash c$ is not injective (1-to-1)} |
|
619 \begin{center} |
|
620 The derivative $w.r.t$ character $c$ is not one-to-one. |
|
621 Formally, |
|
622 $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ |
|
623 \end{center} |
|
624 This property is trivially true for the |
|
625 character regex example: |
|
626 \begin{center} |
|
627 $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ |
|
628 \end{center} |
|
629 But apart from the cases where the derivative |
|
630 output is $\ZERO$, are there non-trivial results |
|
631 of derivatives which contain strings? |
|
632 The answer is yes. |
|
633 For example, |
|
634 \begin{center} |
|
635 Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ |
|
636 where $a$ is not nullable.\\ |
|
637 $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ |
|
638 $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ |
|
639 \end{center} |
|
640 We start with two syntactically different regexes, |
|
641 and end up with the same derivative result. |
|
642 This is not surprising as we have such |
|
643 equality as below in the style of Arden's lemma:\\ |
|
644 \begin{center} |
|
645 $L(A^*B) = L(A\cdot A^* \cdot B + B)$ |
|
646 \end{center} |
|
647 |
|
648 %---------------------------------------------------------------------------------------- |
|
649 % SECTION 4 |
|
650 %---------------------------------------------------------------------------------------- |
|
651 \section{A Bound for the Star Regular Expression} |
|
652 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using |
|
653 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then |
|
654 the property of the $\distinct$ function. |
|
655 Now we try to get a bound on $r^* \backslash s$ as well. |
|
656 Again, we first look at how a star's derivatives evolve, if they grow maximally: |
|
657 \begin{center} |
|
658 |
|
659 $r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad |
|
660 r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad |
|
661 (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''} |
|
662 \quad \ldots$ |
|
663 |
|
664 \end{center} |
|
665 When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, |
|
666 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, |
|
667 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size |
|
668 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not |
|
669 count the possible size explosions of $r \backslash c$ themselves. |
|
670 |
|
671 Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like |
|
672 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
|
673 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ |
|
674 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$). |
|
675 For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$: |
|
676 %TODO: definitions of and \hflataux \hflat |
|
677 \begin{center} |
|
678 \begin{tabular}{ccc} |
|
679 $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ |
|
680 $\hflataux r$ & $=$ & $ [r]$ |
|
681 \end{tabular} |
|
682 \end{center} |
|
683 |
|
684 \begin{center} |
|
685 \begin{tabular}{ccc} |
|
686 $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\ |
|
687 $\hflat r$ & $=$ & $ r$ |
|
688 \end{tabular} |
|
689 \end{center}s |
|
690 Again these definitions are tailor-made for dealing with alternatives that have |
|
691 originated from a star's derivatives, so we don't attempt to open up all possible |
|
692 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 |
|
693 elements. |
|
694 We give a predicate for such "star-created" regular expressions: |
|
695 \begin{center} |
|
696 \begin{tabular}{lcr} |
|
697 & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ |
|
698 $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ |
|
699 \end{tabular} |
|
700 \end{center} |
|
701 |
|
702 These definitions allows us the flexibility to talk about |
|
703 regular expressions in their most convenient format, |
|
704 for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $ |
|
705 instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$. |
|
706 These definitions help express that certain classes of syntatically |
|
707 distinct regular expressions are actually the same under simplification. |
|
708 This is not entirely true for annotated regular expressions: |
|
709 %TODO: bsimp bders \neq bderssimp |
|
710 \begin{center} |
|
711 $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ |
|
712 \end{center} |
|
713 For bit-codes, the order in which simplification is applied |
|
714 might cause a difference in the location they are placed. |
|
715 If we want something like |
|
716 \begin{center} |
|
717 $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ |
|
718 \end{center} |
|
719 Some "canonicalization" procedure is required, |
|
720 which either pushes all the common bitcodes to nodes |
|
721 as senior as possible: |
|
722 \begin{center} |
|
723 $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ |
|
724 \end{center} |
|
725 or does the reverse. However bitcodes are not of interest if we are talking about |
|
726 the $\llbracket r \rrbracket$ size of a regex. |
|
727 Therefore for the ease and simplicity of producing a |
|
728 proof for a size bound, we are happy to restrict ourselves to |
|
729 unannotated regular expressions, and obtain such equalities as |
|
730 \begin{lemma} |
|
731 $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ |
|
732 \end{lemma} |
|
733 |
|
734 \begin{proof} |
|
735 By using the rewriting relation $\rightsquigarrow$ |
|
736 \end{proof} |
|
737 %TODO: rsimp sflat |
|
738 And from this we obtain a proof that a star's derivative will be the same |
|
739 as if it had all its nested alternatives created during deriving being flattened out: |
|
740 For example, |
|
741 \begin{lemma} |
|
742 $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
|
743 \end{lemma} |
|
744 \begin{proof} |
|
745 By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$. |
|
746 \end{proof} |
|
747 % The simplification of a flattened out regular expression, provided it comes |
|
748 %from the derivative of a star, is the same as the one nested. |
|
749 |
|
750 |
|
751 |
|
752 |
|
753 |
|
754 |
|
755 |
|
756 |
|
757 |
|
758 One might wonder the actual bound rather than the loose bound we gave |
|
759 for the convenience of an easier proof. |
|
760 How much can the regex $r^* \backslash s$ grow? |
|
761 As earlier graphs have shown, |
|
762 %TODO: reference that graph where size grows quickly |
|
763 they can grow at a maximum speed |
|
764 exponential $w.r.t$ the number of characters, |
|
765 but will eventually level off when the string $s$ is long enough. |
|
766 If they grow to a size exponential $w.r.t$ the original regex, our algorithm |
|
767 would still be slow. |
|
768 And unfortunately, we have concrete examples |
|
769 where such regexes grew exponentially large before levelling off: |
|
770 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
|
771 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum |
|
772 size that is exponential on the number $n$ |
|
773 under our current simplification rules: |
|
774 %TODO: graph of a regex whose size increases exponentially. |
|
775 \begin{center} |
|
776 \begin{tikzpicture} |
|
777 \begin{axis}[ |
|
778 height=0.5\textwidth, |
|
779 width=\textwidth, |
|
780 xlabel=number of a's, |
|
781 xtick={0,...,9}, |
|
782 ylabel=maximum size, |
|
783 ymode=log, |
|
784 log basis y={2} |
|
785 ] |
|
786 \addplot[mark=*,blue] table {re-chengsong.data}; |
|
787 \end{axis} |
|
788 \end{tikzpicture} |
|
789 \end{center} |
|
790 |
|
791 For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
|
792 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
|
793 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. |
|
794 The exponential size is triggered by that the regex |
|
795 $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ |
|
796 inside the $(\ldots) ^*$ having exponentially many |
|
797 different derivatives, despite those difference being minor. |
|
798 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ |
|
799 will therefore contain the following terms (after flattening out all nested |
|
800 alternatives): |
|
801 \begin{center} |
|
802 $(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ |
|
803 $(1 \leq m' \leq m )$ |
|
804 \end{center} |
|
805 These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix). |
|
806 With each new input character taking the derivative against the intermediate result, more and more such distinct |
|
807 terms will accumulate, |
|
808 until the length reaches $L.C.M.(1, \ldots, n)$. |
|
809 $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms |
|
810 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
|
811 |
|
812 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
|
813 where $m' \neq m''$ \\ |
|
814 as they are slightly different. |
|
815 This means that with our current simplification methods, |
|
816 we will not be able to control the derivative so that |
|
817 $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$ |
|
818 as there are already exponentially many terms. |
|
819 These terms are similar in the sense that the head of those terms |
|
820 are all consisted of sub-terms of the form: |
|
821 $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. |
|
822 For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most |
|
823 $n * (n + 1) / 2$ such terms. |
|
824 For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives |
|
825 can be described by 6 terms: |
|
826 $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, |
|
827 $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. |
|
828 The total number of different "head terms", $n * (n + 1) / 2$, |
|
829 is proportional to the number of characters in the regex |
|
830 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. |
|
831 This suggests a slightly different notion of size, which we call the |
|
832 alphabetic width: |
|
833 %TODO: |
|
834 (TODO: Alphabetic width def.) |
|
835 |
|
836 |
|
837 Antimirov\parencite{Antimirov95} has proven that |
|
838 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$. |
|
839 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms |
|
840 created by doing derivatives of $r$ against all possible strings. |
|
841 If we can make sure that at any moment in our lexing algorithm our |
|
842 intermediate result hold at most one copy of each of the |
|
843 subterms then we can get the same bound as Antimirov's. |
|
844 This leads to the algorithm in the next chapter. |
|
845 |
|
846 |
|
847 |
|
848 |
|
849 |
|
850 %---------------------------------------------------------------------------------------- |
|
851 % SECTION 1 |
|
852 %---------------------------------------------------------------------------------------- |
|
853 |
|
854 \section{Idempotency of $\simp$} |
|
855 |
|
856 \begin{equation} |
|
857 \simp \;r = \simp\; \simp \; r |
|
858 \end{equation} |
|
859 This property means we do not have to repeatedly |
|
860 apply simplification in each step, which justifies |
|
861 our definition of $\blexersimp$. |
|
862 It will also be useful in future proofs where properties such as |
|
863 closed forms are needed. |
|
864 The proof is by structural induction on $r$. |
|
865 |
|
866 %----------------------------------- |
|
867 % SUBSECTION 1 |
|
868 %----------------------------------- |
|
869 \subsection{Syntactic Equivalence Under $\simp$} |
|
870 We prove that minor differences can be annhilated |
|
871 by $\simp$. |
|
872 For example, |
|
873 \begin{center} |
|
874 $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = |
|
875 \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ |
|
876 \end{center} |
|
877 |