5 \label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX} |
5 \label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX} |
6 |
6 |
7 |
7 |
8 |
8 |
9 |
9 |
10 %---------------------------------------------------------------------------------------- |
10 |
11 % SECTION 1 |
|
12 %---------------------------------------------------------------------------------------- |
|
13 |
|
14 \section{Properties of $\backslash c$} |
|
15 |
|
16 To have a clear idea of what we can and |
|
17 need to prove about the algorithms involving |
|
18 Brzozowski's derivatives, there are a few |
|
19 properties we need to be clear about . |
|
20 \subsection{function $\backslash c$ is not 1-to-1} |
|
21 \begin{center} |
|
22 The derivative $w.r.t$ character $c$ is not one-to-one. |
|
23 Formally, |
|
24 $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ |
|
25 \end{center} |
|
26 This property is trivially true for the |
|
27 character regex example: |
|
28 \begin{center} |
|
29 $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ |
|
30 \end{center} |
|
31 But apart from the cases where the derivative |
|
32 output is $\ZERO$, are there non-trivial results |
|
33 of derivatives which contain strings? |
|
34 The answer is yes. |
|
35 For example, |
|
36 \begin{center} |
|
37 Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ |
|
38 where $a$ is not nullable.\\ |
|
39 $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ |
|
40 $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ |
|
41 \end{center} |
|
42 We start with two syntactically different regexes, |
|
43 and end up with the same derivative result, which is |
|
44 a "meaningful" regex because it contains strings. |
|
45 We have rediscovered Arden's lemma:\\ |
|
46 \begin{center} |
|
47 $A^*B = A\cdot A^* \cdot B + B$ |
|
48 \end{center} |
|
49 |
|
50 |
|
51 %----------------------------------- |
11 %----------------------------------- |
52 % SUBSECTION 1 |
12 % SUBSECTION 1 |
53 %----------------------------------- |
13 %----------------------------------- |
54 \subsection{Subsection 1} |
14 \section{Specifications of Certain Functions to be Used} |
55 To be completed. |
15 To be completed. |
56 |
16 |
57 |
17 |
58 |
18 |
59 |
19 |
|
20 %----------------------------------- |
|
21 % SECTION ? |
|
22 %----------------------------------- |
|
23 \section{preparatory properties for getting a finiteness bound} |
|
24 Before we get to the proof that says the intermediate result of our lexer will |
|
25 remain finitely bounded, which is an important efficiency/liveness guarantee, |
|
26 we shall first develop a few preparatory properties and definitions to |
|
27 make the process of proving that a breeze. |
|
28 |
|
29 We define rewriting relations for $\rrexp$s, which allows us to do the |
|
30 same trick as we did for the correctness proof, |
|
31 but this time we will have stronger equalities established. |
|
32 \subsection{"hrewrite" relation} |
|
33 List of 1-step rewrite rules for regular expressions simplification without bitcodes: |
|
34 \begin{center} |
|
35 \begin{tabular}{lcl} |
|
36 $r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$ |
|
37 \end{tabular} |
|
38 \end{center} |
|
39 |
|
40 And we define an "grewrite" relation that works on lists: |
|
41 \begin{center} |
|
42 \begin{tabular}{lcl} |
|
43 $ \ZERO :: rs$ & $\rightarrow_g$ & $rs$ |
|
44 \end{tabular} |
|
45 \end{center} |
|
46 |
|
47 |
|
48 |
|
49 With these relations we prove |
|
50 \begin{lemma} |
|
51 $rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$ |
|
52 \end{lemma} |
|
53 which enables us to prove "closed forms" equalities such as |
|
54 \begin{lemma} |
|
55 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ |
|
56 \end{lemma} |
|
57 |
|
58 But the most involved part of the above lemma is proving the following: |
|
59 \begin{lemma} |
|
60 $\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ |
|
61 \end{lemma} |
|
62 which is needed in proving things like |
|
63 \begin{lemma} |
|
64 $r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$ |
|
65 \end{lemma} |
|
66 |
|
67 Fortunately this is provable by induction on the list $rs$ |
60 |
68 |
61 |
69 |
62 |
70 |
63 %----------------------------------- |
71 %----------------------------------- |
64 % SECTION 2 |
72 % SECTION 2 |
65 %----------------------------------- |
73 %----------------------------------- |
66 |
74 |
67 \section{Finiteness Property} |
75 \section{Finiteness Property} |
68 In this section let us sketch our argument for why the size of the simplified |
76 In this section let us describe our argument for why the size of the simplified |
69 derivatives with the aggressive simplification function can be finitely bounded. |
77 derivatives with the aggressive simplification function is finitely bounded. |
70 |
78 Suppose |
71 For convenience, we use a new datatype which we call $\rrexp$ to denote |
79 we have a size function for bitcoded regular expressions, written |
72 the difference between it and annotated regular expressions. |
80 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree |
|
81 |
|
82 \begin{center} |
|
83 \begin{tabular}{ccc} |
|
84 $\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\ |
|
85 \end{tabular} |
|
86 \end{center} |
|
87 (TODO: COMPLETE this defn and for $rs$) |
|
88 |
|
89 The size is based on a recursive function on the structure of the regex, |
|
90 not the bitcodes. |
|
91 Therefore we may as well talk about size of an annotated regular expression |
|
92 in their un-annotated form: |
|
93 \begin{center} |
|
94 $\asize(a) = \size(\erase(a))$. |
|
95 \end{center} |
|
96 (TODO: turn equals to roughly equals) |
|
97 |
|
98 But there is a minor nuisance here: |
|
99 the erase function actually messes with the structure of the regular expression: |
|
100 \begin{center} |
|
101 \begin{tabular}{ccc} |
|
102 $\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\ |
|
103 \end{tabular} |
|
104 \end{center} |
|
105 (TODO: complete this definition with singleton r in alts) |
|
106 |
|
107 An alternative regular expression with an empty list of children |
|
108 is turned into an $\ZERO$ during the |
|
109 $\erase$ function, thereby changing the size and structure of the regex. |
|
110 These will likely be fixable if we really want to use plain $\rexp$s for dealing |
|
111 with size, but we choose a more straightforward (or stupid) method by |
|
112 defining a new datatype that is similar to plain $\rexp$s but can take |
|
113 non-binary arguments for its alternative constructor, |
|
114 which we call $\rrexp$ to denote |
|
115 the difference between it and plain regular expressions. |
73 \[ \rrexp ::= \RZERO \mid \RONE |
116 \[ \rrexp ::= \RZERO \mid \RONE |
74 \mid \RCHAR{c} |
117 \mid \RCHAR{c} |
75 \mid \RSEQ{r_1}{r_2} |
118 \mid \RSEQ{r_1}{r_2} |
76 \mid \RALTS{rs} |
119 \mid \RALTS{rs} |
77 \mid \RSTAR{r} |
120 \mid \RSTAR{r} |
182 In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is |
219 In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is |
183 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
220 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
184 than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands |
221 than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands |
185 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
222 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
186 We reason similarly for $\STAR$.\medskip |
223 We reason similarly for $\STAR$.\medskip |
|
224 \end{proof} |
|
225 |
|
226 What guarantee does this bound give us? |
|
227 |
|
228 Whatever the regex is, it will not grow indefinitely. |
|
229 Take our previous example $(a + aa)^*$ as an example: |
|
230 \begin{center} |
|
231 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
232 \begin{tikzpicture} |
|
233 \begin{axis}[ |
|
234 xlabel={number of $a$'s}, |
|
235 x label style={at={(1.05,-0.05)}}, |
|
236 ylabel={regex size}, |
|
237 enlargelimits=false, |
|
238 xtick={0,5,...,30}, |
|
239 xmax=33, |
|
240 ymax= 40, |
|
241 ytick={0,10,...,40}, |
|
242 scaled ticks=false, |
|
243 axis lines=left, |
|
244 width=5cm, |
|
245 height=4cm, |
|
246 legend entries={$(a + aa)^*$}, |
|
247 legend pos=north west, |
|
248 legend cell align=left] |
|
249 \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; |
|
250 \end{axis} |
|
251 \end{tikzpicture} |
|
252 \end{tabular} |
|
253 \end{center} |
|
254 We are able to limit the size of the regex $(a + aa)^*$'s derivatives |
|
255 with our simplification |
|
256 rules very effectively. |
|
257 |
|
258 |
|
259 As the graphs of some more randomly generated regexes show, the size of |
|
260 the derivative might grow quickly at the start of the input, |
|
261 but after a sufficiently long input string, the trend will stop. |
|
262 |
|
263 |
|
264 %a few sample regular experessions' derivatives |
|
265 %size change |
|
266 %TODO: giving graphs showing a few regexes' size changes |
|
267 %w;r;t the input characters number |
|
268 %a*, aa*, aaa*, ..... |
|
269 %randomly generated regexes |
|
270 \begin{center} |
|
271 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
272 \begin{tikzpicture} |
|
273 \begin{axis}[ |
|
274 xlabel={number of $a$'s}, |
|
275 x label style={at={(1.05,-0.05)}}, |
|
276 ylabel={regex size}, |
|
277 enlargelimits=false, |
|
278 xtick={0,5,...,30}, |
|
279 xmax=33, |
|
280 ymax=1000, |
|
281 ytick={0,100,...,1000}, |
|
282 scaled ticks=false, |
|
283 axis lines=left, |
|
284 width=5cm, |
|
285 height=4cm, |
|
286 legend entries={regex1}, |
|
287 legend pos=north west, |
|
288 legend cell align=left] |
|
289 \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; |
|
290 \end{axis} |
|
291 \end{tikzpicture} |
|
292 & |
|
293 \begin{tikzpicture} |
|
294 \begin{axis}[ |
|
295 xlabel={$n$}, |
|
296 x label style={at={(1.05,-0.05)}}, |
|
297 %ylabel={time in secs}, |
|
298 enlargelimits=false, |
|
299 xtick={0,5,...,30}, |
|
300 xmax=33, |
|
301 ymax=1000, |
|
302 ytick={0,100,...,1000}, |
|
303 scaled ticks=false, |
|
304 axis lines=left, |
|
305 width=5cm, |
|
306 height=4cm, |
|
307 legend entries={regex2}, |
|
308 legend pos=north west, |
|
309 legend cell align=left] |
|
310 \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; |
|
311 \end{axis} |
|
312 \end{tikzpicture} |
|
313 & |
|
314 \begin{tikzpicture} |
|
315 \begin{axis}[ |
|
316 xlabel={$n$}, |
|
317 x label style={at={(1.05,-0.05)}}, |
|
318 %ylabel={time in secs}, |
|
319 enlargelimits=false, |
|
320 xtick={0,5,...,30}, |
|
321 xmax=33, |
|
322 ymax=1000, |
|
323 ytick={0,100,...,1000}, |
|
324 scaled ticks=false, |
|
325 axis lines=left, |
|
326 width=5cm, |
|
327 height=4cm, |
|
328 legend entries={regex3}, |
|
329 legend pos=north west, |
|
330 legend cell align=left] |
|
331 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; |
|
332 \end{axis} |
|
333 \end{tikzpicture}\\ |
|
334 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.} |
|
335 \end{tabular} |
|
336 \end{center} |
|
337 |
|
338 |
|
339 |
|
340 |
187 |
341 |
188 \noindent |
342 \noindent |
189 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is |
343 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is |
190 far from the actual bound we can expect. We can do better than this, but this does not improve |
344 far from the actual bound we can expect. |
|
345 In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound |
|
346 is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$. |
|
347 Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$ |
|
348 inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function |
|
349 $f(x) = x * 2^x$. |
|
350 This means the bound we have will surge up at least |
|
351 tower-exponentially with a linear increase of the depth. |
|
352 For a regex of depth $n$, the bound |
|
353 would be approximately $4^n$. |
|
354 %TODO: change this to tower exponential! |
|
355 |
|
356 We can do better than this, but this does not improve |
191 the finiteness property we are proving. If we are interested in a polynomial bound, |
357 the finiteness property we are proving. If we are interested in a polynomial bound, |
192 one would hope to obtain a similar tight bound as for partial |
358 one would hope to obtain a similar tight bound as for partial |
193 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with |
359 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with |
194 $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in |
360 $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in |
195 partial derivatives). Unfortunately to obtain the exact same bound would mean |
361 partial derivatives). |
|
362 To obtain the exact same bound would mean |
196 we need to introduce simplifications, such as |
363 we need to introduce simplifications, such as |
197 $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, |
364 $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, |
198 which exist for partial derivatives. However, if we introduce them in our |
365 which exist for partial derivatives. |
199 setting we would lose the POSIX property of our calculated values. We leave better |
366 |
200 bounds for future work.\\[-6.5mm] |
367 However, if we introduce them in our |
201 |
368 setting we would lose the POSIX property of our calculated values. |
202 |
369 A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$. |
203 |
370 If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer |
204 %----------------------------------- |
371 would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of |
205 % SECTION ? |
372 what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information |
206 %----------------------------------- |
373 in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$ |
207 \section{preparatory properties for getting a finiteness bound} |
374 occurs, and apply them in the right order once we get |
208 Before we get to the proof that says the intermediate result of our lexer will |
375 a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value. |
209 remain finitely bounded, which is an important efficiency/liveness guarantee, |
376 This is unlike the simplification we had before, where the rewriting rules |
210 we shall first develop a few preparatory properties and definitions to |
377 such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value. |
211 make the process of proving that a breeze. |
378 We will discuss better |
212 |
379 bounds in the last section of this chapter.\\[-6.5mm] |
213 We define rewriting relations for $\rrexp$s, which allows us to do the |
380 |
214 same trick as we did for the correctness proof, |
381 |
215 but this time we will have stronger equalities established. |
382 |
216 \subsection{"hrewrite" relation} |
|
217 List of 1-step rewrite rules for regular expressions simplification without bitcodes: |
|
218 \begin{center} |
|
219 \begin{tabular}{lcl} |
|
220 $r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$ |
|
221 \end{tabular} |
|
222 \end{center} |
|
223 |
|
224 And we define an "grewrite" relation that works on lists: |
|
225 \begin{center} |
|
226 \begin{tabular}{lcl} |
|
227 $ \ZERO :: rs$ & $\rightarrow_g$ & $rs$ |
|
228 \end{tabular} |
|
229 \end{center} |
|
230 |
|
231 |
|
232 |
|
233 With these relations we prove |
|
234 \begin{lemma} |
|
235 $rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$ |
|
236 \end{lemma} |
|
237 which enables us to prove "closed forms" equalities such as |
|
238 \begin{lemma} |
|
239 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ |
|
240 \end{lemma} |
|
241 |
|
242 But the most involved part of the above lemma is proving the following: |
|
243 \begin{lemma} |
|
244 $\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ |
|
245 \end{lemma} |
|
246 which is needed in proving things like |
|
247 \begin{lemma} |
|
248 $r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$ |
|
249 \end{lemma} |
|
250 . |
|
251 Fortunately this is provable by induction on the list $rs$ |
|
252 |
383 |
253 %---------------------------------------------------------------------------------------- |
384 %---------------------------------------------------------------------------------------- |
254 % SECTION ?? |
385 % SECTION ?? |
255 %---------------------------------------------------------------------------------------- |
386 %---------------------------------------------------------------------------------------- |
256 |
387 |
367 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $. |
498 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $. |
368 \end{center} |
499 \end{center} |
369 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of |
500 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of |
370 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. |
501 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. |
371 |
502 |
|
503 |
|
504 %---------------------------------------------------------------------------------------- |
|
505 % SECTION ALTS CLOSED FORM |
|
506 %---------------------------------------------------------------------------------------- |
|
507 \section{A Closed Form for \textit{ALTS}} |
|
508 Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. |
|
509 |
|
510 |
|
511 There are a few key steps, one of these steps is |
|
512 $rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {}))) |
|
513 = rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))$ |
|
514 |
|
515 One might want to prove this by something a simple statement like: |
|
516 $map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$. |
|
517 |
|
518 For this to hold we want the $\textit{distinct}$ function to pick up |
|
519 the elements before and after derivatives correctly: |
|
520 $r \in rset \equiv (rder x r) \in (rder x rset)$. |
|
521 which essentially requires that the function $\backslash$ is an injective mapping. |
|
522 |
|
523 Unfortunately the function $\backslash c$ is not an injective mapping. |
|
524 |
|
525 \subsection{function $\backslash c$ is not injective (1-to-1)} |
|
526 \begin{center} |
|
527 The derivative $w.r.t$ character $c$ is not one-to-one. |
|
528 Formally, |
|
529 $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ |
|
530 \end{center} |
|
531 This property is trivially true for the |
|
532 character regex example: |
|
533 \begin{center} |
|
534 $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ |
|
535 \end{center} |
|
536 But apart from the cases where the derivative |
|
537 output is $\ZERO$, are there non-trivial results |
|
538 of derivatives which contain strings? |
|
539 The answer is yes. |
|
540 For example, |
|
541 \begin{center} |
|
542 Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ |
|
543 where $a$ is not nullable.\\ |
|
544 $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ |
|
545 $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ |
|
546 \end{center} |
|
547 We start with two syntactically different regexes, |
|
548 and end up with the same derivative result. |
|
549 This is not surprising as we have such |
|
550 equality as below in the style of Arden's lemma:\\ |
|
551 \begin{center} |
|
552 $L(A^*B) = L(A\cdot A^* \cdot B + B)$ |
|
553 \end{center} |
|
554 |
372 %---------------------------------------------------------------------------------------- |
555 %---------------------------------------------------------------------------------------- |
373 % SECTION 4 |
556 % SECTION 4 |
374 %---------------------------------------------------------------------------------------- |
557 %---------------------------------------------------------------------------------------- |
375 \section{a bound for the star regular expression} |
558 \section{A Bound for the Star Regular Expression} |
376 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using |
559 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using |
377 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then |
560 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then |
378 the property of the $\distinct$ function. |
561 the property of the $\distinct$ function. |
379 Now we try to get a bound on $r^* \backslash s$ as well. |
562 Now we try to get a bound on $r^* \backslash s$ as well. |
380 Again, we first look at how a star's derivatives evolve, if they grow maximally: |
563 Again, we first look at how a star's derivatives evolve, if they grow maximally: |
478 |
661 |
479 |
662 |
480 |
663 |
481 |
664 |
482 One might wonder the actual bound rather than the loose bound we gave |
665 One might wonder the actual bound rather than the loose bound we gave |
483 for the convenience of a easier proof. |
666 for the convenience of an easier proof. |
484 How much can the regex $r^* \backslash s$ grow? As hinted earlier, |
667 How much can the regex $r^* \backslash s$ grow? |
|
668 As earlier graphs have shown, |
|
669 %TODO: reference that graph where size grows quickly |
485 they can grow at a maximum speed |
670 they can grow at a maximum speed |
486 exponential $w.r.t$ the number of characters, |
671 exponential $w.r.t$ the number of characters, |
487 but will eventually level off when the string $s$ is long enough, |
672 but will eventually level off when the string $s$ is long enough. |
488 as suggested by the finiteness bound proof. |
673 If they grow to a size exponential $w.r.t$ the original regex, our algorithm |
489 And unfortunately we have concrete examples |
674 would still be slow. |
|
675 And unfortunately, we have concrete examples |
490 where such regexes grew exponentially large before levelling off: |
676 where such regexes grew exponentially large before levelling off: |
491 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
677 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
492 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum |
678 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum |
493 size that is exponential on the number $n$. |
679 size that is exponential on the number $n$ |
494 %TODO: give out a graph showing how the size of the regex grows and levels off at a size exponential to the original regex |
680 under our current simplification rules: |
495 |
681 %TODO: graph of a regex whose size increases exponentially. |
496 |
682 \begin{center} |
497 While the tight upper bound will definitely be a lot lower than |
683 \begin{tikzpicture} |
498 the one we gave for the finiteness proof, |
684 \begin{axis}[ |
499 it will still be at least expoential, under our current simplification rules. |
685 height=0.5\textwidth, |
500 |
686 width=\textwidth, |
501 This suggests stronger simplifications that keep the tight bound polynomial. |
687 xlabel=number of a's, |
|
688 xtick={0,...,9}, |
|
689 ylabel=maximum size, |
|
690 ymode=log, |
|
691 log basis y={2} |
|
692 ] |
|
693 \addplot[mark=*,blue] table {re-chengsong.data}; |
|
694 \end{axis} |
|
695 \end{tikzpicture} |
|
696 \end{center} |
|
697 |
|
698 For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
|
699 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
|
700 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. |
|
701 The exponential size is triggered by that the regex |
|
702 $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ |
|
703 inside the $(\ldots) ^*$ having exponentially many |
|
704 different derivatives, despite those difference being minor. |
|
705 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ |
|
706 will therefore contain the following terms (after flattening out all nested |
|
707 alternatives): |
|
708 \begin{center} |
|
709 $(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ |
|
710 $(1 \leq m' \leq m )$ |
|
711 \end{center} |
|
712 These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix). |
|
713 With each new input character taking the derivative against the intermediate result, more and more such distinct |
|
714 terms will accumulate, |
|
715 until the length reaches $L.C.M.(1, \ldots, n)$. |
|
716 $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms |
|
717 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
|
718 |
|
719 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
|
720 where $m' \neq m''$ \\ |
|
721 as they are slightly different. |
|
722 This means that with our current simplification methods, |
|
723 we will not be able to control the derivative so that |
|
724 $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$ |
|
725 as there are already exponentially many terms. |
|
726 These terms are similar in the sense that the head of those terms |
|
727 are all consisted of sub-terms of the form: |
|
728 $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. |
|
729 For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most |
|
730 $n * (n + 1) / 2$ such terms. |
|
731 For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives |
|
732 can be described by 6 terms: |
|
733 $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, |
|
734 $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. |
|
735 The total number of different "head terms", $n * (n + 1) / 2$, |
|
736 is proportional to the number of characters in the regex |
|
737 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. |
|
738 This suggests a slightly different notion of size, which we call the |
|
739 alphabetic width: |
|
740 %TODO: |
|
741 (TODO: Alphabetic width def.) |
|
742 |
|
743 |
|
744 Antimirov\parencite{Antimirov95} has proven that |
|
745 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$. |
|
746 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms |
|
747 created by doing derivatives of $r$ against all possible strings. |
|
748 If we can make sure that at any moment in our lexing algorithm our |
|
749 intermediate result hold at most one copy of each of the |
|
750 subterms then we can get the same bound as Antimirov's. |
|
751 This leads to the algorithm in the next section. |
502 |
752 |
503 %---------------------------------------------------------------------------------------- |
753 %---------------------------------------------------------------------------------------- |
504 % SECTION 5 |
754 % SECTION 5 |
505 %---------------------------------------------------------------------------------------- |
755 %---------------------------------------------------------------------------------------- |
506 \section{A Stronger Version of Simplification} |
756 \section{A Stronger Version of Simplification Inspired by Antimirov} |
507 %TODO: search for isabelle proofs of algorithms that check equivalence |
757 %TODO: search for isabelle proofs of algorithms that check equivalence |
508 |
758 |
509 |
759 |
|
760 |
|
761 |
|
762 |
|
763 |
|
764 |
|
765 |
|
766 |