9 %its correctness proof in |
9 %its correctness proof in |
10 %Chapter 3\ref{Chapter3}. |
10 %Chapter 3\ref{Chapter3}. |
11 |
11 |
12 |
12 |
13 |
13 |
14 In this chapter we introduce the simplifications |
14 In this chapter we introduce simplifications |
15 on annotated regular expressions that can be applied to |
15 on annotated regular expressions that can be applied to |
16 each intermediate derivative result. This allows |
16 each intermediate derivative result. This allows |
17 us to make $\blexer$ much more efficient. |
17 us to make $\blexer$ much more efficient. |
18 We contrast this simplification function |
18 Sulzmann and Lu already had some bit-coded simplifications, |
19 with Sulzmann and Lu's original |
19 but their simplification functions were inefficient. |
20 simplifications, indicating the simplicity of our algorithm and |
20 We contrast our simplification function |
21 improvements we made, demostrating |
21 with Sulzmann and Lu's, indicating the simplicity of our algorithm. |
22 the usefulness and reliability of formal proofs on algorithms. |
22 This is another case for the usefulness |
|
23 and reliability of formal proofs on algorithms. |
23 These ``aggressive'' simplifications would not be possible in the injection-based |
24 These ``aggressive'' simplifications would not be possible in the injection-based |
24 lexing we introduced in chapter \ref{Inj}. |
25 lexing we introduced in chapter \ref{Inj}. |
25 We then go on to prove the correctness with the improved version of |
26 We then prove the correctness with the improved version of |
26 $\blexer$, called $\blexersimp$, by establishing |
27 $\blexer$, called $\blexersimp$, by establishing |
27 $\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system. |
28 $\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system. |
28 |
29 |
29 \section{Simplifications by Sulzmann and Lu} |
30 \section{Simplifications by Sulzmann and Lu} |
30 The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s |
31 Consider the derivatives of examples such as $(a^*a^*)^*$ |
31 and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns |
32 and $(a^* + (aa)^*)^*$: |
32 are scattered around different levels, and therefore requires |
|
33 de-duplication at different levels: |
|
34 \begin{center} |
33 \begin{center} |
35 $(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\ |
34 $(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\ |
36 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$ |
35 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$ |
37 \end{center} |
36 \end{center} |
38 \noindent |
37 \noindent |
39 As we have already mentioned in \ref{eqn:growth2}, |
38 As can be seen, there is a lot of duplication |
40 a simple-minded simplification function cannot simplify |
39 in the example we have already mentioned in |
|
40 \ref{eqn:growth2}. |
|
41 A simple-minded simplification function cannot simplify |
41 the third regular expression in the above chain of derivative |
42 the third regular expression in the above chain of derivative |
42 regular expressions: |
43 regular expressions, namely |
43 \begin{center} |
44 \begin{center} |
44 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$ |
45 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$ |
45 \end{center} |
46 \end{center} |
46 one would expect a better simplification function to work in the |
47 because the duplicates are |
|
48 not next to each other and therefore the rule |
|
49 $r+ r \rightarrow r$ does not fire. |
|
50 One would expect a better simplification function to work in the |
47 following way: |
51 following way: |
48 \begin{gather*} |
52 \begin{gather*} |
49 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
53 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
50 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\ |
54 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\ |
51 \bigg\downarrow \\ |
55 \bigg\downarrow \\ |
59 \bigg\downarrow \\ |
63 \bigg\downarrow \\ |
60 (a^*a^* + a^* |
64 (a^*a^* + a^* |
61 )\cdot(a^*a^*)^* |
65 )\cdot(a^*a^*)^* |
62 \end{gather*} |
66 \end{gather*} |
63 \noindent |
67 \noindent |
64 This motivating example came from testing Sulzmann and Lu's |
68 In the first step, the nested alternative regular expression |
|
69 $(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$. |
|
70 Now the third term $a^*$ is clearly identified as a duplicate |
|
71 and therefore removed in the second step. This causes the two |
|
72 top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$ |
|
73 removed in the final step.\\ |
|
74 This motivating example is from testing Sulzmann and Lu's |
65 algorithm: their simplification does |
75 algorithm: their simplification does |
66 not work! |
76 not work! |
67 We quote their $\textit{simp}$ function verbatim here: |
77 Consider their simplification (using our notations): |
68 \begin{center} |
78 \begin{center} |
69 \begin{tabular}{lcl} |
79 \begin{tabular}{lcl} |
70 $\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & |
80 $\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & |
71 $\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\ |
81 $\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\ |
72 & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\ |
82 & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\ |
83 (\nub \; (\filter \; (\not \circ \zeroable)\;((\simpsulz \; r) :: \map \; \simpsulz \; rs)))$\\ |
93 (\nub \; (\filter \; (\not \circ \zeroable)\;((\simpsulz \; r) :: \map \; \simpsulz \; rs)))$\\ |
84 |
94 |
85 \end{tabular} |
95 \end{tabular} |
86 \end{center} |
96 \end{center} |
87 \noindent |
97 \noindent |
88 the $\textit{zeroable}$ predicate |
98 where the $\textit{zeroable}$ predicate |
89 which tests whether the regular expression |
99 tests whether the regular expression |
90 is equivalent to $\ZERO$, |
100 is equivalent to $\ZERO$, |
91 is defined as: |
101 can be defined as: |
92 \begin{center} |
102 \begin{center} |
93 \begin{tabular}{lcl} |
103 \begin{tabular}{lcl} |
94 $\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\; |
104 $\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\; |
95 \zeroable \;_{[]}\sum\;rs $\\ |
105 \zeroable \;_{[]}\sum\;rs $\\ |
96 $\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\ |
106 $\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\ |
188 \end{quote} |
198 \end{quote} |
189 \noindent |
199 \noindent |
190 The assumption that the size of the regular expressions |
200 The assumption that the size of the regular expressions |
191 in the algorithm |
201 in the algorithm |
192 would stay below a finite constant is not ture. |
202 would stay below a finite constant is not ture. |
|
203 The main reason behind this is that (i) The $\textit{nub}$ |
|
204 function requires identical annotations between two |
|
205 annotated regular expressions to qualify as duplicates, |
|
206 and cannot simplify the cases like $_{SZZ}a^*+_{SZS}a^*$ |
|
207 even if both $a^*$ denote the same language. |
|
208 (ii) The ``flattening'' only applies to the head of the list |
|
209 in the |
|
210 \begin{center} |
|
211 \begin{tabular}{lcl} |
|
212 |
|
213 $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & |
|
214 $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
|
215 \end{tabular} |
|
216 \end{center} |
|
217 \noindent |
|
218 clause, and therefore is not thorough enough to simplify all |
|
219 needed parts of the regular expression.\\ |
193 In addition to that, even if the regular expressions size |
220 In addition to that, even if the regular expressions size |
194 do stay finite, one has to take into account that |
221 do stay finite, one has to take into account that |
195 the $\simpsulz$ function is applied many times |
222 the $\simpsulz$ function is applied many times |
196 in each derivative step, and that number is not necessarily |
223 in each derivative step, and that number is not necessarily |
197 a constant with respect to the size of the regular expression. |
224 a constant with respect to the size of the regular expression. |
207 the ideas behind components in their algorithm |
234 the ideas behind components in their algorithm |
208 and why they fail to achieve the desired effect, followed |
235 and why they fail to achieve the desired effect, followed |
209 by our solution. These solutions come with correctness |
236 by our solution. These solutions come with correctness |
210 statements that are backed up by formal proofs. |
237 statements that are backed up by formal proofs. |
211 \subsection{Flattening Nested Alternatives} |
238 \subsection{Flattening Nested Alternatives} |
212 The idea behind the |
239 The idea behind the clause |
213 \begin{center} |
240 \begin{center} |
214 $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad |
241 $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad |
215 _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$ |
242 _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$ |
216 \end{center} |
243 \end{center} |
217 clause is that it allows |
244 is that it allows |
218 duplicate removal of regular expressions at different |
245 duplicate removal of regular expressions at different |
219 levels. |
246 ``levels'' of alternatives. |
220 For example, this would help with the |
247 For example, this would help with the |
221 following simplification: |
248 following simplification: |
222 |
249 |
223 \begin{center} |
250 \begin{center} |
224 $(a+r)+r \longrightarrow a+r$ |
251 $(a+r)+r \longrightarrow a+r$ |
225 \end{center} |
252 \end{center} |
226 The problem here is that only the head element |
253 The problem here is that only the head element |
227 is ``spilled out'', |
254 is ``spilled out'', |
228 whereas we would want to flatten |
255 whereas we would want to flatten |
229 an entire list to open up possibilities for further simplifications.\\ |
256 an entire list to open up possibilities for further simplifications. |
230 Not flattening the rest of the elements also means that |
257 Not flattening the rest of the elements also means that |
231 the later de-duplication processs |
258 the later de-duplication processs |
232 does not fully remove apparent duplicates. |
259 does not fully remove further duplicates. |
233 For example, |
260 For example, |
234 using $\simpsulz$ we could not |
261 using $\simpsulz$ we could not |
235 simplify |
262 simplify |
236 \begin{center} |
263 \begin{center} |
237 $((a^* a^*)+ (a^* + a^*))\cdot (a^*a^*)^*+ |
264 $((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+ |
238 ((a^*a^*)+a^*)\cdot (a^*a^*)^*$ |
265 ((a^*a^*)+a^*)\cdot (a^*a^*)^*$ |
239 \end{center} |
266 \end{center} |
240 due to the underlined part not in the first element |
267 due to the underlined part not in the first element |
241 of the alternative.\\ |
268 of the alternative.\\ |
242 We define a flatten operation that flattens not only |
269 We define a flatten operation that flattens not only |
425 |
452 |
426 \noindent |
453 \noindent |
427 This algorithm keeps the regular expression size small. |
454 This algorithm keeps the regular expression size small. |
428 |
455 |
429 |
456 |
430 \subsection{$(a+aa)^*$ and $(a^*\cdot a^*)^*$ against |
457 \subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$ |
431 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification} |
458 After Simplification} |
432 For example, |
459 Recall the |
433 with our simplification the |
|
434 previous $(a^*a^*)^*$ example |
460 previous $(a^*a^*)^*$ example |
435 where $\simpsulz$ could not |
461 where $\simpsulz$ could not |
436 stop the fast growth (over |
462 prevent the fast growth (over |
437 3 million nodes just below $20$ input length) |
463 3 million nodes just below $20$ input length) |
438 will be reduced to just 15 and stays constant, no matter how long the |
464 will be reduced to just 15 and stays constant no matter how long the |
439 input string is. |
465 input string is. |
440 This is demonstrated in the graphs below. |
466 This is shown in the graphs below. |
441 \begin{figure}[H] |
467 \begin{figure}[H] |
442 \begin{center} |
468 \begin{center} |
443 \begin{tabular}{ll} |
469 \begin{tabular}{ll} |
444 \begin{tikzpicture} |
470 \begin{tikzpicture} |
445 \begin{axis}[ |
471 \begin{axis}[ |
480 % SECTION rewrite relation |
506 % SECTION rewrite relation |
481 %---------------------------------------------------------------------------------------- |
507 %---------------------------------------------------------------------------------------- |
482 \section{Correctness of $\blexersimp$} |
508 \section{Correctness of $\blexersimp$} |
483 In this section we give details |
509 In this section we give details |
484 of the correctness proof of $\blexersimp$, |
510 of the correctness proof of $\blexersimp$, |
485 an important contribution of this thesis.\\ |
511 one of the contributions of this thesis.\\ |
486 We first introduce the rewriting relation \emph{rrewrite} |
512 We first introduce the rewriting relation \emph{rrewrite} |
487 ($\rrewrite$) between two regular expressions, |
513 ($\rrewrite$) between two regular expressions, |
488 which expresses an atomic |
514 which expresses an atomic |
489 simplification step from the left-hand-side |
515 simplification. |
490 to the right-hand-side. |
|
491 We then prove properties about |
516 We then prove properties about |
492 this rewriting relation and its reflexive transitive closure. |
517 this rewriting relation and its reflexive transitive closure. |
493 Finally we leverage these properties to show |
518 Finally we leverage these properties to show |
494 an equivalence between the internal data structures of |
519 an equivalence between the internal data structures of |
495 $\blexer$ and $\blexersimp$. |
520 $\blexer$ and $\blexersimp$. |
496 |
521 |
497 \subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)} |
522 \subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)} |
498 In the $\blexer$'s correctness proof, we |
523 In the $\blexer$'s correctness proof, we |
499 did not directly derive the fact that $\blexer$ gives out the POSIX value, |
524 did not directly derive the fact that $\blexer$ generates the POSIX value, |
500 but first proved that $\blexer$ is linked with $\lexer$. |
525 but first proved that $\blexer$ is linked with $\lexer$. |
501 Then we re-use |
526 Then we re-use |
502 the correctness of $\lexer$ |
527 the correctness of $\lexer$ |
503 to obtain |
528 to obtain |
504 \begin{center} |
529 \begin{center} |
509 by first proving that |
534 by first proving that |
510 $\blexersimp \; r \; s $ |
535 $\blexersimp \; r \; s $ |
511 produces the same output as $\blexer \; r\; s$, |
536 produces the same output as $\blexer \; r\; s$, |
512 and then piecing it together with |
537 and then piecing it together with |
513 $\blexer$'s correctness to achieve our main |
538 $\blexer$'s correctness to achieve our main |
514 theorem:\footnote{ the case when |
539 theorem:\footnote{ The case when |
515 $s$ is not in $L \; r$, is routine to establish } |
540 $s$ is not in $L \; r$, is routine to establish.} |
516 \begin{center} |
541 \begin{center} |
517 $(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = v$ |
542 $(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = v$ |
518 \end{center} |
543 \end{center} |
519 \noindent |
544 \noindent |
520 The overall idea for the proof |
545 The overall idea for the proof |
590 \inferrule{rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \land \; rs_2 \stackrel{s*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{s*}{\rightsquigarrow} rs_3} |
615 \inferrule{rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \land \; rs_2 \stackrel{s*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{s*}{\rightsquigarrow} rs_3} |
591 \end{mathpar} |
616 \end{mathpar} |
592 \caption{The Reflexive Transitive Closure of |
617 \caption{The Reflexive Transitive Closure of |
593 $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure} |
618 $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure} |
594 \end{figure} |
619 \end{figure} |
595 Two rewritable terms will remain rewritable to each other |
620 %Two rewritable terms will remain rewritable to each other |
596 even after a derivative is taken: |
621 %even after a derivative is taken: |
|
622 Rewriting is preserved under derivatives, |
|
623 namely |
597 \begin{center} |
624 \begin{center} |
598 $r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$ |
625 $r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$ |
599 \end{center} |
626 \end{center} |
600 And finally, if two terms are rewritable to each other, |
627 And finally, if two terms are rewritable to each other, |
601 then they produce the same bitcodes: |
628 then they produce the same bitcodes: |
954 $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s $ |
981 $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s $ |
955 \end{corollary} |
982 \end{corollary} |
956 |
983 |
957 \subsection{Comments on the Proof Techniques Used} |
984 \subsection{Comments on the Proof Techniques Used} |
958 Straightforward and simple as the proof may seem, |
985 Straightforward and simple as the proof may seem, |
959 the efforts we spent obtaining it was far from trivial.\\ |
986 the efforts we spent obtaining it were far from trivial.\\ |
960 We initially attempted to re-use the argument |
987 We initially attempted to re-use the argument |
961 in \cref{flex_retrieve}. |
988 in \cref{flex_retrieve}. |
962 The problem was that both functions $\inj$ and $\retrieve$ require |
989 The problem was that both functions $\inj$ and $\retrieve$ require |
963 that the annotated regular expressions stay unsimplified, |
990 that the annotated regular expressions stay unsimplified, |
964 so that one can |
991 so that one can |
1005 $_{Z}(_{Z} \ONE + _{S} c)$ |
1032 $_{Z}(_{Z} \ONE + _{S} c)$ |
1006 |
1033 |
1007 \end{center} |
1034 \end{center} |
1008 as equal, because they were both re-written |
1035 as equal, because they were both re-written |
1009 from the same expression.\\ |
1036 from the same expression.\\ |
|
1037 The simplification rewriting rules |
|
1038 given in \ref{rrewriteRules} are by no means |
|
1039 final, |
|
1040 one could come up new rules |
|
1041 such as |
|
1042 $\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow |
|
1043 \SEQs [r_1, r_2, r_3]$. |
|
1044 This does not fit with the proof technique |
|
1045 of our main theorem, but seem to not violate the POSIX |
|
1046 property.\\ |
1010 Having correctness property is good. |
1047 Having correctness property is good. |
1011 But we would also a guarantee that the lexer is not slow in |
1048 But we would also a guarantee that the lexer is not slow in |
1012 some sense, for exampe, not grinding to a halt regardless of the input. |
1049 some sense, for exampe, not grinding to a halt regardless of the input. |
1013 As we have already seen, Sulzmann and Lu's simplification function |
1050 As we have already seen, Sulzmann and Lu's simplification function |
1014 $\simpsulz$ cannot achieve this, because their claim that |
1051 $\simpsulz$ cannot achieve this, because their claim that |