|
1 % Chapter Template |
|
2 |
|
3 \chapter{Finiteness Bound} % Main chapter title |
|
4 |
|
5 \label{ChapterFinite} |
|
6 % In Chapter 4 \ref{Chapter4} we give the second guarantee |
|
7 %of our bitcoded algorithm, that is a finite bound on the size of any |
|
8 %regex's derivatives. |
|
9 |
|
10 |
|
11 %----------------------------------- |
|
12 % SECTION ? |
|
13 %----------------------------------- |
|
14 \section{preparatory properties for getting a finiteness bound} |
|
15 Before we get to the proof that says the intermediate result of our lexer will |
|
16 remain finitely bounded, which is an important efficiency/liveness guarantee, |
|
17 we shall first develop a few preparatory properties and definitions to |
|
18 make the process of proving that a breeze. |
|
19 |
|
20 We define rewriting relations for $\rrexp$s, which allows us to do the |
|
21 same trick as we did for the correctness proof, |
|
22 but this time we will have stronger equalities established. |
|
23 \subsection{"hrewrite" relation} |
|
24 List of 1-step rewrite rules for regular expressions simplification without bitcodes: |
|
25 \begin{center} |
|
26 \begin{tabular}{lcl} |
|
27 $r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$ |
|
28 \end{tabular} |
|
29 \end{center} |
|
30 |
|
31 And we define an "grewrite" relation that works on lists: |
|
32 \begin{center} |
|
33 \begin{tabular}{lcl} |
|
34 $ \ZERO :: rs$ & $\rightarrow_g$ & $rs$ |
|
35 \end{tabular} |
|
36 \end{center} |
|
37 |
|
38 |
|
39 |
|
40 With these relations we prove |
|
41 \begin{lemma} |
|
42 $rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$ |
|
43 \end{lemma} |
|
44 which enables us to prove "closed forms" equalities such as |
|
45 \begin{lemma} |
|
46 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ |
|
47 \end{lemma} |
|
48 |
|
49 But the most involved part of the above lemma is proving the following: |
|
50 \begin{lemma} |
|
51 $\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ |
|
52 \end{lemma} |
|
53 which is needed in proving things like |
|
54 \begin{lemma} |
|
55 $r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$ |
|
56 \end{lemma} |
|
57 |
|
58 Fortunately this is provable by induction on the list $rs$ |
|
59 |
|
60 |
|
61 |
|
62 %----------------------------------- |
|
63 % SECTION 2 |
|
64 %----------------------------------- |
|
65 |
|
66 \section{Finiteness Property} |
|
67 In this section let us describe our argument for why the size of the simplified |
|
68 derivatives with the aggressive simplification function is finitely bounded. |
|
69 Suppose |
|
70 we have a size function for bitcoded regular expressions, written |
|
71 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree |
|
72 |
|
73 \begin{center} |
|
74 \begin{tabular}{ccc} |
|
75 $\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\ |
|
76 \end{tabular} |
|
77 \end{center} |
|
78 (TODO: COMPLETE this defn and for $rs$) |
|
79 |
|
80 The size is based on a recursive function on the structure of the regex, |
|
81 not the bitcodes. |
|
82 Therefore we may as well talk about size of an annotated regular expression |
|
83 in their un-annotated form: |
|
84 \begin{center} |
|
85 $\asize(a) = \size(\erase(a))$. |
|
86 \end{center} |
|
87 (TODO: turn equals to roughly equals) |
|
88 |
|
89 But there is a minor nuisance here: |
|
90 the erase function actually messes with the structure of the regular expression: |
|
91 \begin{center} |
|
92 \begin{tabular}{ccc} |
|
93 $\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\ |
|
94 \end{tabular} |
|
95 \end{center} |
|
96 (TODO: complete this definition with singleton r in alts) |
|
97 |
|
98 An alternative regular expression with an empty list of children |
|
99 is turned into an $\ZERO$ during the |
|
100 $\erase$ function, thereby changing the size and structure of the regex. |
|
101 These will likely be fixable if we really want to use plain $\rexp$s for dealing |
|
102 with size, but we choose a more straightforward (or stupid) method by |
|
103 defining a new datatype that is similar to plain $\rexp$s but can take |
|
104 non-binary arguments for its alternative constructor, |
|
105 which we call $\rrexp$ to denote |
|
106 the difference between it and plain regular expressions. |
|
107 \[ \rrexp ::= \RZERO \mid \RONE |
|
108 \mid \RCHAR{c} |
|
109 \mid \RSEQ{r_1}{r_2} |
|
110 \mid \RALTS{rs} |
|
111 \mid \RSTAR{r} |
|
112 \] |
|
113 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, |
|
114 but keep everything else intact. |
|
115 It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved |
|
116 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton |
|
117 $\ALTS$). |
|
118 We denote the operation of erasing the bits and turning an annotated regular expression |
|
119 into an $\rrexp{}$ as $\rerase{}$. |
|
120 \begin{center} |
|
121 \begin{tabular}{lcl} |
|
122 $\rerase{\AZERO}$ & $=$ & $\RZERO$\\ |
|
123 $\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ |
|
124 $\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$ |
|
125 \end{tabular} |
|
126 \end{center} |
|
127 %TODO: FINISH definition of rerase |
|
128 Similarly we could define the derivative and simplification on |
|
129 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, |
|
130 except that now they can operate on alternatives taking multiple arguments. |
|
131 |
|
132 \begin{center} |
|
133 \begin{tabular}{lcr} |
|
134 $\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\ |
|
135 (other clauses omitted) |
|
136 \end{tabular} |
|
137 \end{center} |
|
138 |
|
139 Now that $\rrexp$s do not have bitcodes on them, we can do the |
|
140 duplicate removal without an equivalence relation: |
|
141 \begin{center} |
|
142 \begin{tabular}{lcl} |
|
143 $\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ |
|
144 & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$ |
|
145 \end{tabular} |
|
146 \end{center} |
|
147 %TODO: definition of rsimp (maybe only the alternative clause) |
|
148 |
|
149 |
|
150 The reason why these definitions mirror precisely the corresponding operations |
|
151 on annotated regualar expressions is that we can calculate sizes more easily: |
|
152 |
|
153 \begin{lemma} |
|
154 $\rsize{\rerase a} = \asize a$ |
|
155 \end{lemma} |
|
156 This lemma says that the size of an r-erased regex is the same as the annotated regex. |
|
157 this does not hold for plain $\rexp$s due to their ways of how alternatives are represented. |
|
158 \begin{lemma} |
|
159 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ |
|
160 \end{lemma} |
|
161 Putting these two together we get a property that allows us to estimate the |
|
162 size of an annotated regular expression derivative using its un-annotated counterpart: |
|
163 \begin{lemma} |
|
164 $\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$ |
|
165 \end{lemma} |
|
166 Unless stated otherwise in this chapter all $\textit{rexp}$s without |
|
167 bitcodes are seen as $\rrexp$s. |
|
168 We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably |
|
169 as the former suits people's intuitive way of stating a binary alternative |
|
170 regular expression. |
|
171 |
|
172 |
|
173 \begin{theorem} |
|
174 For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$ |
|
175 \end{theorem} |
|
176 |
|
177 \noindent |
|
178 \begin{proof} |
|
179 We prove this by induction on $r$. The base cases for $\AZERO$, |
|
180 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is |
|
181 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction |
|
182 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and |
|
183 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows |
|
184 % |
|
185 \begin{center} |
|
186 \begin{tabular}{lcll} |
|
187 & & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\ |
|
188 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} :: |
|
189 [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\ |
|
190 & $\leq$ & |
|
191 $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) :: |
|
192 [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\ |
|
193 & $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket + |
|
194 \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\ |
|
195 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + |
|
196 \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\ |
|
197 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5) |
|
198 \end{tabular} |
|
199 \end{center} |
|
200 |
|
201 |
|
202 \noindent |
|
203 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). |
|
204 The reason why we could write the derivatives of a sequence this way is described in section 2. |
|
205 The term (2) is used to control (1). |
|
206 That is because one can obtain an overall |
|
207 smaller regex list |
|
208 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. |
|
209 Section 3 is dedicated to its proof. |
|
210 In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is |
|
211 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
|
212 than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands |
|
213 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
|
214 We reason similarly for $\STAR$.\medskip |
|
215 \end{proof} |
|
216 |
|
217 What guarantee does this bound give us? |
|
218 |
|
219 Whatever the regex is, it will not grow indefinitely. |
|
220 Take our previous example $(a + aa)^*$ as an example: |
|
221 \begin{center} |
|
222 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
223 \begin{tikzpicture} |
|
224 \begin{axis}[ |
|
225 xlabel={number of $a$'s}, |
|
226 x label style={at={(1.05,-0.05)}}, |
|
227 ylabel={regex size}, |
|
228 enlargelimits=false, |
|
229 xtick={0,5,...,30}, |
|
230 xmax=33, |
|
231 ymax= 40, |
|
232 ytick={0,10,...,40}, |
|
233 scaled ticks=false, |
|
234 axis lines=left, |
|
235 width=5cm, |
|
236 height=4cm, |
|
237 legend entries={$(a + aa)^*$}, |
|
238 legend pos=north west, |
|
239 legend cell align=left] |
|
240 \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; |
|
241 \end{axis} |
|
242 \end{tikzpicture} |
|
243 \end{tabular} |
|
244 \end{center} |
|
245 We are able to limit the size of the regex $(a + aa)^*$'s derivatives |
|
246 with our simplification |
|
247 rules very effectively. |
|
248 |
|
249 |
|
250 As the graphs of some more randomly generated regexes show, the size of |
|
251 the derivative might grow quickly at the start of the input, |
|
252 but after a sufficiently long input string, the trend will stop. |
|
253 |
|
254 |
|
255 %a few sample regular experessions' derivatives |
|
256 %size change |
|
257 %TODO: giving graphs showing a few regexes' size changes |
|
258 %w;r;t the input characters number |
|
259 %a*, aa*, aaa*, ..... |
|
260 %randomly generated regexes |
|
261 \begin{center} |
|
262 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
263 \begin{tikzpicture} |
|
264 \begin{axis}[ |
|
265 xlabel={number of $a$'s}, |
|
266 x label style={at={(1.05,-0.05)}}, |
|
267 ylabel={regex size}, |
|
268 enlargelimits=false, |
|
269 xtick={0,5,...,30}, |
|
270 xmax=33, |
|
271 ymax=1000, |
|
272 ytick={0,100,...,1000}, |
|
273 scaled ticks=false, |
|
274 axis lines=left, |
|
275 width=5cm, |
|
276 height=4cm, |
|
277 legend entries={regex1}, |
|
278 legend pos=north west, |
|
279 legend cell align=left] |
|
280 \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; |
|
281 \end{axis} |
|
282 \end{tikzpicture} |
|
283 & |
|
284 \begin{tikzpicture} |
|
285 \begin{axis}[ |
|
286 xlabel={$n$}, |
|
287 x label style={at={(1.05,-0.05)}}, |
|
288 %ylabel={time in secs}, |
|
289 enlargelimits=false, |
|
290 xtick={0,5,...,30}, |
|
291 xmax=33, |
|
292 ymax=1000, |
|
293 ytick={0,100,...,1000}, |
|
294 scaled ticks=false, |
|
295 axis lines=left, |
|
296 width=5cm, |
|
297 height=4cm, |
|
298 legend entries={regex2}, |
|
299 legend pos=north west, |
|
300 legend cell align=left] |
|
301 \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; |
|
302 \end{axis} |
|
303 \end{tikzpicture} |
|
304 & |
|
305 \begin{tikzpicture} |
|
306 \begin{axis}[ |
|
307 xlabel={$n$}, |
|
308 x label style={at={(1.05,-0.05)}}, |
|
309 %ylabel={time in secs}, |
|
310 enlargelimits=false, |
|
311 xtick={0,5,...,30}, |
|
312 xmax=33, |
|
313 ymax=1000, |
|
314 ytick={0,100,...,1000}, |
|
315 scaled ticks=false, |
|
316 axis lines=left, |
|
317 width=5cm, |
|
318 height=4cm, |
|
319 legend entries={regex3}, |
|
320 legend pos=north west, |
|
321 legend cell align=left] |
|
322 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; |
|
323 \end{axis} |
|
324 \end{tikzpicture}\\ |
|
325 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.} |
|
326 \end{tabular} |
|
327 \end{center} |
|
328 |
|
329 |
|
330 |
|
331 |
|
332 |
|
333 \noindent |
|
334 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is |
|
335 far from the actual bound we can expect. |
|
336 In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound |
|
337 is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$. |
|
338 Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$ |
|
339 inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function |
|
340 $f(x) = x * 2^x$. |
|
341 This means the bound we have will surge up at least |
|
342 tower-exponentially with a linear increase of the depth. |
|
343 For a regex of depth $n$, the bound |
|
344 would be approximately $4^n$. |
|
345 %TODO: change this to tower exponential! |
|
346 |
|
347 We can do better than this, but this does not improve |
|
348 the finiteness property we are proving. If we are interested in a polynomial bound, |
|
349 one would hope to obtain a similar tight bound as for partial |
|
350 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with |
|
351 $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in |
|
352 partial derivatives). |
|
353 To obtain the exact same bound would mean |
|
354 we need to introduce simplifications, such as |
|
355 $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, |
|
356 which exist for partial derivatives. |
|
357 |
|
358 However, if we introduce them in our |
|
359 setting we would lose the POSIX property of our calculated values. |
|
360 A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$. |
|
361 If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer |
|
362 would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of |
|
363 what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information |
|
364 in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$ |
|
365 occurs, and apply them in the right order once we get |
|
366 a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value. |
|
367 This is unlike the simplification we had before, where the rewriting rules |
|
368 such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value. |
|
369 We will discuss better |
|
370 bounds in the last section of this chapter.\\[-6.5mm] |
|
371 |
|
372 |
|
373 |
|
374 |
|
375 %---------------------------------------------------------------------------------------- |
|
376 % SECTION ?? |
|
377 %---------------------------------------------------------------------------------------- |
|
378 |
|
379 \section{"Closed Forms" of regular expressions' derivatives w.r.t strings} |
|
380 To embark on getting the "closed forms" of regexes, we first |
|
381 define a few auxiliary definitions to make expressing them smoothly. |
|
382 |
|
383 \begin{center} |
|
384 \begin{tabular}{ccc} |
|
385 $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ |
|
386 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\ |
|
387 $\sflataux r$ & $=$ & $ [r]$ |
|
388 \end{tabular} |
|
389 \end{center} |
|
390 The intuition behind $\sflataux{\_}$ is to break up nested regexes |
|
391 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape |
|
392 into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. |
|
393 It will return the singleton list $[r]$ otherwise. |
|
394 |
|
395 $\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps |
|
396 the output type a regular expression, not a list: |
|
397 \begin{center} |
|
398 \begin{tabular}{ccc} |
|
399 $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ |
|
400 $\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\ |
|
401 $\sflat r$ & $=$ & $ [r]$ |
|
402 \end{tabular} |
|
403 \end{center} |
|
404 $\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the |
|
405 first element of the list of children of |
|
406 an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression |
|
407 $r_1 \cdot r_2 \backslash s$. |
|
408 |
|
409 With $\sflat{\_}$ and $\sflataux{\_}$ we make |
|
410 precise what "closed forms" we have for the sequence derivatives and their simplifications, |
|
411 in other words, how can we construct $(r_1 \cdot r_2) \backslash s$ |
|
412 and $\bderssimp{(r_1\cdot r_2)}{s}$, |
|
413 if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$) |
|
414 and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over |
|
415 the substring of $s$? |
|
416 First let's look at a series of derivatives steps on a sequence |
|
417 regular expression, (assuming) that each time the first |
|
418 component of the sequence is always nullable): |
|
419 \begin{center} |
|
420 |
|
421 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ |
|
422 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad |
|
423 \ldots$ |
|
424 |
|
425 \end{center} |
|
426 %TODO: cite indian paper |
|
427 Indianpaper have come up with a slightly more formal way of putting the above process: |
|
428 \begin{center} |
|
429 $r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) + |
|
430 \delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots |
|
431 + \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$ |
|
432 \end{center} |
|
433 where $\delta(b, r)$ will produce $r$ |
|
434 if $b$ evaluates to true, |
|
435 and $\ZERO$ otherwise. |
|
436 |
|
437 But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical |
|
438 equivalence. To make this intuition useful |
|
439 for a formal proof, we need something |
|
440 stronger than language equivalence. |
|
441 With the help of $\sflat{\_}$ we can state the equation in Indianpaper |
|
442 more rigorously: |
|
443 \begin{lemma} |
|
444 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ |
|
445 \end{lemma} |
|
446 |
|
447 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string: |
|
448 |
|
449 \begin{center} |
|
450 \begin{tabular}{lcl} |
|
451 $\vsuf{[]}{\_} $ & $=$ & $[]$\\ |
|
452 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ |
|
453 && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ |
|
454 \end{tabular} |
|
455 \end{center} |
|
456 It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable, |
|
457 and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in |
|
458 the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$. |
|
459 In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing |
|
460 the entire result of $(r_1 \cdot r_2) \backslash s$, |
|
461 it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$. |
|
462 |
|
463 With this we can also add simplifications to both sides and get |
|
464 \begin{lemma} |
|
465 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ |
|
466 \end{lemma} |
|
467 Together with the idempotency property of $\rsimp$, |
|
468 %TODO: state the idempotency property of rsimp |
|
469 \begin{lemma} |
|
470 $\rsimp{r} = \rsimp{\rsimp{r}}$ |
|
471 \end{lemma} |
|
472 it is possible to convert the above lemma to obtain a "closed form" |
|
473 for derivatives nested with simplification: |
|
474 \begin{lemma} |
|
475 $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ |
|
476 \end{lemma} |
|
477 And now the reason we have (1) in section 1 is clear: |
|
478 $\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, |
|
479 is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$ |
|
480 as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$. |
|
481 |
|
482 %---------------------------------------------------------------------------------------- |
|
483 % SECTION 3 |
|
484 %---------------------------------------------------------------------------------------- |
|
485 |
|
486 \section{interaction between $\distinctWith$ and $\flts$} |
|
487 Note that it is not immediately obvious that |
|
488 \begin{center} |
|
489 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $. |
|
490 \end{center} |
|
491 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of |
|
492 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. |
|
493 |
|
494 |
|
495 %----------------------------------- |
|
496 % SECTION syntactic equivalence under simp |
|
497 %----------------------------------- |
|
498 \section{Syntactic Equivalence Under $\simp$} |
|
499 We prove that minor differences can be annhilated |
|
500 by $\simp$. |
|
501 For example, |
|
502 \begin{center} |
|
503 $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = |
|
504 \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ |
|
505 \end{center} |
|
506 |
|
507 |
|
508 %---------------------------------------------------------------------------------------- |
|
509 % SECTION ALTS CLOSED FORM |
|
510 %---------------------------------------------------------------------------------------- |
|
511 \section{A Closed Form for \textit{ALTS}} |
|
512 Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. |
|
513 |
|
514 |
|
515 There are a few key steps, one of these steps is |
|
516 \[ |
|
517 rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {}))) |
|
518 = rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {})) |
|
519 \] |
|
520 |
|
521 |
|
522 One might want to prove this by something a simple statement like: |
|
523 $map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$. |
|
524 |
|
525 For this to hold we want the $\textit{distinct}$ function to pick up |
|
526 the elements before and after derivatives correctly: |
|
527 $r \in rset \equiv (rder x r) \in (rder x rset)$. |
|
528 which essentially requires that the function $\backslash$ is an injective mapping. |
|
529 |
|
530 Unfortunately the function $\backslash c$ is not an injective mapping. |
|
531 |
|
532 \subsection{function $\backslash c$ is not injective (1-to-1)} |
|
533 \begin{center} |
|
534 The derivative $w.r.t$ character $c$ is not one-to-one. |
|
535 Formally, |
|
536 $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ |
|
537 \end{center} |
|
538 This property is trivially true for the |
|
539 character regex example: |
|
540 \begin{center} |
|
541 $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ |
|
542 \end{center} |
|
543 But apart from the cases where the derivative |
|
544 output is $\ZERO$, are there non-trivial results |
|
545 of derivatives which contain strings? |
|
546 The answer is yes. |
|
547 For example, |
|
548 \begin{center} |
|
549 Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ |
|
550 where $a$ is not nullable.\\ |
|
551 $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ |
|
552 $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ |
|
553 \end{center} |
|
554 We start with two syntactically different regexes, |
|
555 and end up with the same derivative result. |
|
556 This is not surprising as we have such |
|
557 equality as below in the style of Arden's lemma:\\ |
|
558 \begin{center} |
|
559 $L(A^*B) = L(A\cdot A^* \cdot B + B)$ |
|
560 \end{center} |
|
561 |
|
562 %---------------------------------------------------------------------------------------- |
|
563 % SECTION 4 |
|
564 %---------------------------------------------------------------------------------------- |
|
565 \section{A Bound for the Star Regular Expression} |
|
566 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using |
|
567 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then |
|
568 the property of the $\distinct$ function. |
|
569 Now we try to get a bound on $r^* \backslash s$ as well. |
|
570 Again, we first look at how a star's derivatives evolve, if they grow maximally: |
|
571 \begin{center} |
|
572 |
|
573 $r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad |
|
574 r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad |
|
575 (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''} |
|
576 \quad \ldots$ |
|
577 |
|
578 \end{center} |
|
579 When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, |
|
580 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, |
|
581 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size |
|
582 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not |
|
583 count the possible size explosions of $r \backslash c$ themselves. |
|
584 |
|
585 Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like |
|
586 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
|
587 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ |
|
588 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$). |
|
589 For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$: |
|
590 %TODO: definitions of and \hflataux \hflat |
|
591 \begin{center} |
|
592 \begin{tabular}{ccc} |
|
593 $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ |
|
594 $\hflataux r$ & $=$ & $ [r]$ |
|
595 \end{tabular} |
|
596 \end{center} |
|
597 |
|
598 \begin{center} |
|
599 \begin{tabular}{ccc} |
|
600 $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\ |
|
601 $\hflat r$ & $=$ & $ r$ |
|
602 \end{tabular} |
|
603 \end{center}s |
|
604 Again these definitions are tailor-made for dealing with alternatives that have |
|
605 originated from a star's derivatives, so we don't attempt to open up all possible |
|
606 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 |
|
607 elements. |
|
608 We give a predicate for such "star-created" regular expressions: |
|
609 \begin{center} |
|
610 \begin{tabular}{lcr} |
|
611 & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ |
|
612 $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ |
|
613 \end{tabular} |
|
614 \end{center} |
|
615 |
|
616 These definitions allows us the flexibility to talk about |
|
617 regular expressions in their most convenient format, |
|
618 for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $ |
|
619 instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$. |
|
620 These definitions help express that certain classes of syntatically |
|
621 distinct regular expressions are actually the same under simplification. |
|
622 This is not entirely true for annotated regular expressions: |
|
623 %TODO: bsimp bders \neq bderssimp |
|
624 \begin{center} |
|
625 $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ |
|
626 \end{center} |
|
627 For bit-codes, the order in which simplification is applied |
|
628 might cause a difference in the location they are placed. |
|
629 If we want something like |
|
630 \begin{center} |
|
631 $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ |
|
632 \end{center} |
|
633 Some "canonicalization" procedure is required, |
|
634 which either pushes all the common bitcodes to nodes |
|
635 as senior as possible: |
|
636 \begin{center} |
|
637 $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ |
|
638 \end{center} |
|
639 or does the reverse. However bitcodes are not of interest if we are talking about |
|
640 the $\llbracket r \rrbracket$ size of a regex. |
|
641 Therefore for the ease and simplicity of producing a |
|
642 proof for a size bound, we are happy to restrict ourselves to |
|
643 unannotated regular expressions, and obtain such equalities as |
|
644 \begin{lemma} |
|
645 $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ |
|
646 \end{lemma} |
|
647 |
|
648 \begin{proof} |
|
649 By using the rewriting relation $\rightsquigarrow$ |
|
650 \end{proof} |
|
651 %TODO: rsimp sflat |
|
652 And from this we obtain a proof that a star's derivative will be the same |
|
653 as if it had all its nested alternatives created during deriving being flattened out: |
|
654 For example, |
|
655 \begin{lemma} |
|
656 $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
|
657 \end{lemma} |
|
658 \begin{proof} |
|
659 By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$. |
|
660 \end{proof} |
|
661 % The simplification of a flattened out regular expression, provided it comes |
|
662 %from the derivative of a star, is the same as the one nested. |
|
663 |
|
664 |
|
665 |
|
666 |
|
667 |
|
668 |
|
669 |
|
670 |
|
671 |
|
672 One might wonder the actual bound rather than the loose bound we gave |
|
673 for the convenience of an easier proof. |
|
674 How much can the regex $r^* \backslash s$ grow? |
|
675 As earlier graphs have shown, |
|
676 %TODO: reference that graph where size grows quickly |
|
677 they can grow at a maximum speed |
|
678 exponential $w.r.t$ the number of characters, |
|
679 but will eventually level off when the string $s$ is long enough. |
|
680 If they grow to a size exponential $w.r.t$ the original regex, our algorithm |
|
681 would still be slow. |
|
682 And unfortunately, we have concrete examples |
|
683 where such regexes grew exponentially large before levelling off: |
|
684 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
|
685 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum |
|
686 size that is exponential on the number $n$ |
|
687 under our current simplification rules: |
|
688 %TODO: graph of a regex whose size increases exponentially. |
|
689 \begin{center} |
|
690 \begin{tikzpicture} |
|
691 \begin{axis}[ |
|
692 height=0.5\textwidth, |
|
693 width=\textwidth, |
|
694 xlabel=number of a's, |
|
695 xtick={0,...,9}, |
|
696 ylabel=maximum size, |
|
697 ymode=log, |
|
698 log basis y={2} |
|
699 ] |
|
700 \addplot[mark=*,blue] table {re-chengsong.data}; |
|
701 \end{axis} |
|
702 \end{tikzpicture} |
|
703 \end{center} |
|
704 |
|
705 For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
|
706 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
|
707 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. |
|
708 The exponential size is triggered by that the regex |
|
709 $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ |
|
710 inside the $(\ldots) ^*$ having exponentially many |
|
711 different derivatives, despite those difference being minor. |
|
712 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ |
|
713 will therefore contain the following terms (after flattening out all nested |
|
714 alternatives): |
|
715 \begin{center} |
|
716 $(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ |
|
717 $(1 \leq m' \leq m )$ |
|
718 \end{center} |
|
719 These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix). |
|
720 With each new input character taking the derivative against the intermediate result, more and more such distinct |
|
721 terms will accumulate, |
|
722 until the length reaches $L.C.M.(1, \ldots, n)$. |
|
723 $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms |
|
724 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
|
725 |
|
726 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
|
727 where $m' \neq m''$ \\ |
|
728 as they are slightly different. |
|
729 This means that with our current simplification methods, |
|
730 we will not be able to control the derivative so that |
|
731 $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$ |
|
732 as there are already exponentially many terms. |
|
733 These terms are similar in the sense that the head of those terms |
|
734 are all consisted of sub-terms of the form: |
|
735 $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. |
|
736 For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most |
|
737 $n * (n + 1) / 2$ such terms. |
|
738 For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives |
|
739 can be described by 6 terms: |
|
740 $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, |
|
741 $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. |
|
742 The total number of different "head terms", $n * (n + 1) / 2$, |
|
743 is proportional to the number of characters in the regex |
|
744 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. |
|
745 This suggests a slightly different notion of size, which we call the |
|
746 alphabetic width: |
|
747 %TODO: |
|
748 (TODO: Alphabetic width def.) |
|
749 |
|
750 |
|
751 Antimirov\parencite{Antimirov95} has proven that |
|
752 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$. |
|
753 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms |
|
754 created by doing derivatives of $r$ against all possible strings. |
|
755 If we can make sure that at any moment in our lexing algorithm our |
|
756 intermediate result hold at most one copy of each of the |
|
757 subterms then we can get the same bound as Antimirov's. |
|
758 This leads to the algorithm in the next chapter. |
|
759 |
|
760 |
|
761 |
|
762 |
|
763 |
|
764 %---------------------------------------------------------------------------------------- |
|
765 % SECTION 1 |
|
766 %---------------------------------------------------------------------------------------- |
|
767 |
|
768 \section{Idempotency of $\simp$} |
|
769 |
|
770 \begin{equation} |
|
771 \simp \;r = \simp\; \simp \; r |
|
772 \end{equation} |
|
773 This property means we do not have to repeatedly |
|
774 apply simplification in each step, which justifies |
|
775 our definition of $\blexersimp$. |
|
776 It will also be useful in future proofs where properties such as |
|
777 closed forms are needed. |
|
778 The proof is by structural induction on $r$. |
|
779 |
|
780 %----------------------------------- |
|
781 % SUBSECTION 1 |
|
782 %----------------------------------- |
|
783 \subsection{Syntactic Equivalence Under $\simp$} |
|
784 We prove that minor differences can be annhilated |
|
785 by $\simp$. |
|
786 For example, |
|
787 \begin{center} |
|
788 $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = |
|
789 \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ |
|
790 \end{center} |
|
791 |