12 |
12 |
13 |
13 |
14 Now we introduce the simplifications, which is why we introduce the |
14 Now we introduce the simplifications, which is why we introduce the |
15 bitcodes in the first place. |
15 bitcodes in the first place. |
16 |
16 |
17 \subsection*{Simplification Rules} |
17 \section{Simplification for Annotated Regular Expressions} |
18 |
18 The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s |
19 This section introduces aggressive (in terms of size) simplification rules |
19 and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns |
20 on annotated regular expressions |
20 are scattered around different levels: |
21 to keep derivatives small. Such simplifications are promising |
21 |
22 as we have |
22 \begin{center} |
23 generated test data that show |
23 $(a^*a^*)^* \rightarrow (a^*a^* + a^*)\cdot(a^*a^*)^*$\\ |
24 that a good tight bound can be achieved. We could only |
24 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$ |
25 partially cover the search space as there are infinitely many regular |
25 \end{center} |
26 expressions and strings. |
26 \noindent |
27 |
27 Despite that we have already implemented the simple-minded simplifications |
28 One modification we introduced is to allow a list of annotated regular |
28 such as throwing away useless $\ONE$s and $\ZERO$s. |
29 expressions in the $\sum$ constructor. This allows us to not just |
29 The simplification rule $r + r \rightarrow $ cannot make a difference either |
30 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but |
30 since it only removes duplicates on the same level, not something like |
31 also unnecessary ``copies'' of regular expressions (very similar to |
31 $(r+a)+r \rightarrow r+a$. |
32 simplifying $r + r$ to just $r$, but in a more general setting). Another |
32 This requires us to break up nested alternatives into lists, for example |
33 modification is that we use simplification rules inspired by Antimirov's |
33 using the flatten operation similar to the one defined for any function language's |
34 work on partial derivatives. They maintain the idea that only the first |
34 list datatype: |
35 ``copy'' of a regular expression in an alternative contributes to the |
35 |
36 calculation of a POSIX value. All subsequent copies can be pruned away from |
36 \begin{center} |
37 the regular expression. A recursive definition of our simplification function |
37 \begin{tabular}{@{}lcl@{}} |
|
38 $\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; |
|
39 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\ |
|
40 $\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \; \textit{as'} $ \\ |
|
41 $\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise) |
|
42 \end{tabular} |
|
43 \end{center} |
|
44 |
|
45 \noindent |
|
46 There is a minor difference though, in that our $\flts$ operation defined by us |
|
47 also throws away $\ZERO$s. |
|
48 For a flattened list of regular expressions, a de-duplication can be done efficiently: |
|
49 |
|
50 |
|
51 \begin{center} |
|
52 \begin{tabular}{@{}lcl@{}} |
|
53 $\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\ |
|
54 $\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & \\ |
|
55 & & $\quad \textit{if} (f \; x) \in acc \textit{then} \distinctBy \; xs \; f \; acc$\\ |
|
56 & & $\quad \textit{else} x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$ |
|
57 \end{tabular} |
|
58 \end{center} |
|
59 \noindent |
|
60 The reason we define a distinct function under a mapping $f$ is because |
|
61 we want to eliminate regular expressions that are the same syntactically, |
|
62 but having different bit-codes (more on the reason why we can do this later). |
|
63 To achieve this, we call $\erase$ as the function $f$ during the distinction |
|
64 operation. |
|
65 A recursive definition of our simplification function |
38 that looks somewhat similar to our Scala code is given below: |
66 that looks somewhat similar to our Scala code is given below: |
39 %\comment{Use $\ZERO$, $\ONE$ and so on. |
|
40 %Is it $ALTS$ or $ALTS$?}\\ |
|
41 |
67 |
42 \begin{center} |
68 \begin{center} |
43 \begin{tabular}{@{}lcl@{}} |
69 \begin{tabular}{@{}lcl@{}} |
44 |
70 |
45 $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
71 $\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp} \; a_2) $ \\ |
|
72 $\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \erase \; \phi) $ \\ |
|
73 $\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
74 \end{tabular} |
|
75 \end{center} |
|
76 |
|
77 \noindent |
|
78 The simplification (named $\bsimp$ for \emph{b}it-coded) does a pattern matching on the regular expression. |
|
79 When it detected that the regular expression is an alternative or |
|
80 sequence, it will try to simplify its children regular expressions |
|
81 recursively and then see if one of the children turns into $\ZERO$ or |
|
82 $\ONE$, which might trigger further simplification at the current level. |
|
83 Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$, |
|
84 using rules such as $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$. |
|
85 \begin{center} |
|
86 \begin{tabular}{@{}lcl@{}} |
|
87 $\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\ |
46 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
88 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
47 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
89 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
48 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
90 &&$\quad\textit{case} \; (_{bs1}\ONE, a_2') \Rightarrow \textit{fuse} \; (bs@bs_1) \; a_2'$ \\ |
49 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
91 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ |
50 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ |
92 \end{tabular} |
51 |
93 \end{center} |
52 $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\ |
94 \noindent |
|
95 The most involved part is the $\sum$ clause, where we first call $\flts$ on |
|
96 the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$. |
|
97 and then call $\distinctBy$ on that list, the predicate determining whether two |
|
98 elements are the same is $\erase \; r_1 = \erase\; r_2$. |
|
99 Finally, depending on whether the regular expression list $as'$ has turned into a |
|
100 singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$ |
|
101 decides whether to keep the current level constructor $\sum$ as it is, and |
|
102 removes it when there are less than two elements: |
|
103 \begin{center} |
|
104 \begin{tabular}{lcl} |
|
105 $\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\ |
53 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
106 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
54 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
107 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
55 &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ |
108 &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ |
56 |
109 \end{tabular} |
57 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
110 |
58 \end{tabular} |
111 \end{center} |
59 \end{center} |
112 Having defined the $\bsimp$ function, |
60 |
113 we add it as a phase after a derivative is taken, |
61 \noindent |
114 so it stays small: |
62 The simplification does a pattern matching on the regular expression. |
115 \begin{center} |
63 When it detected that the regular expression is an alternative or |
116 \begin{tabular}{lcl} |
64 sequence, it will try to simplify its child regular expressions |
117 $r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$ |
65 recursively and then see if one of the children turns into $\ZERO$ or |
118 \end{tabular} |
66 $\ONE$, which might trigger further simplification at the current level. |
119 \end{center} |
67 The most involved part is the $\sum$ clause, where we use two |
120 Following previous notation of natural |
68 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested |
|
69 alternatives and reduce as many duplicates as possible. Function |
|
70 $\textit{distinct}$ keeps the first occurring copy only and removes all later ones |
|
71 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. |
|
72 Its recursive definition is given below: |
|
73 |
|
74 \begin{center} |
|
75 \begin{tabular}{@{}lcl@{}} |
|
76 $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; |
|
77 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
|
78 $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ |
|
79 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) |
|
80 \end{tabular} |
|
81 \end{center} |
|
82 |
|
83 \noindent |
|
84 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
|
85 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
|
86 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
|
87 |
|
88 Having defined the $\simp$ function, |
|
89 we can use the previous notation of natural |
|
90 extension from derivative w.r.t.~character to derivative |
121 extension from derivative w.r.t.~character to derivative |
91 w.r.t.~string:%\comment{simp in the [] case?} |
122 w.r.t.~string, we define the derivative that nests simplifications with derivatives:%\comment{simp in the [] case?} |
92 |
123 |
93 \begin{center} |
124 \begin{center} |
94 \begin{tabular}{lcl} |
125 \begin{tabular}{lcl} |
95 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
126 $r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\ |
96 $r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
127 $r \backslash_{bsimps} [\,] $ & $\dn$ & $r$ |
97 \end{tabular} |
128 \end{tabular} |
98 \end{center} |
129 \end{center} |
99 |
130 |
100 \noindent |
131 \noindent |
101 to obtain an optimised version of the algorithm: |
132 Extracting bit-codes from the derivatives that had been simplified repeatedly after |
102 |
133 each derivative run, the simplified $\blexer$, called $\blexersimp$, |
|
134 is defined as |
103 \begin{center} |
135 \begin{center} |
104 \begin{tabular}{lcl} |
136 \begin{tabular}{lcl} |
105 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
137 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
106 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
138 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
107 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
139 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
140 \begin{axis}[ |
204 \begin{axis}[ |
141 xlabel={$n$}, |
205 xlabel={$n$}, |
142 ylabel={derivative size}, |
206 ylabel={derivative size}, |
143 width = 7cm, |
207 width = 7cm, |
144 height = 4cm, |
208 height = 4cm, |
145 legend entries={Lexer without $\bsimp$}, |
209 legend entries={Lexer without $\textit{bsimp}$}, |
146 legend pos= north west, |
210 legend pos= north west, |
147 legend cell align=left] |
211 legend cell align=left] |
148 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data}; |
212 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data}; |
149 \end{axis} |
213 \end{axis} |
150 \end{tikzpicture} |
214 \end{tikzpicture} |
151 \end{tabular} |
215 \end{tabular} |
152 |
216 \end{center} |
153 |
|
154 |
|
155 |
|
156 |
217 |
157 %---------------------------------------------------------------------------------------- |
218 %---------------------------------------------------------------------------------------- |
158 % SECTION common identities |
219 % SECTION rewrite relation |
159 %---------------------------------------------------------------------------------------- |
220 %---------------------------------------------------------------------------------------- |
160 \section{Common Identities In Simplification-Related Functions} |
221 \section{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)} |
161 Need to prove these before starting on the big correctness proof. |
222 The overall idea for the correctness |
162 |
223 \begin{conjecture} |
163 %----------------------------------- |
224 $\blexersimp \; r \; s = \blexer \; r\; s$ |
164 % SUBSECTION |
225 \end{conjecture} |
165 %----------------------------------- |
226 is that the $\textit{bsimp}$ will not change the regular expressions so much that |
166 \subsection{Idempotency of $\simp$} |
227 it becomes impossible to extract the $\POSIX$ values. |
167 |
228 To capture this "similarity" between unsimplified regular expressions and simplified ones, |
168 \begin{equation} |
229 we devise the rewriting relation $\rrewrite$, written infix as $\rightsquigarrow$: |
169 \simp \;r = \simp\; \simp \; r |
230 |
170 \end{equation} |
231 \begin{center} |
171 This property means we do not have to repeatedly |
232 \begin{mathpar} |
172 apply simplification in each step, which justifies |
233 \inferrule{}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\} |
173 our definition of $\blexersimp$. |
234 |
174 It will also be useful in future proofs where properties such as |
235 \inferrule{}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\} |
175 closed forms are needed. |
236 |
176 The proof is by structural induction on $r$. |
237 \inferrule{}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\ |
177 |
238 |
178 %----------------------------------- |
239 |
179 % SUBSECTION |
240 |
180 %----------------------------------- |
241 \inferrule{\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\} |
181 \subsection{Syntactic Equivalence Under $\simp$} |
242 |
182 We prove that minor differences can be annhilated |
243 \inferrule{\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\ |
183 by $\simp$. |
244 |
184 For example, |
245 \inferrule{}{ _{bs}\sum [] \rightsquigarrow \ZERO\\} |
185 \begin{center} |
246 |
186 $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = |
247 \inferrule{}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a\\} |
187 \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ |
248 |
188 \end{center} |
249 \inferrule{\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2\\} |
189 |
250 |
190 |
251 \inferrule{}{\\ [] \stackrel{s}{\rightsquigarrow} []} |
191 |
252 |
192 |
253 \inferrule{rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{\\ r :: rs_1 \rightsquigarrow r :: rs_2 \rightsquigarrow} |
193 |
254 |
194 |
255 \inferrule{r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs\\} |
195 |
256 |
196 %---------------------------------------------------------------------------------------- |
257 \inferrule{}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs} |
197 % SECTION corretness proof |
258 |
198 %---------------------------------------------------------------------------------------- |
259 \inferrule{}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) } |
199 \section{Proof Technique of Correctness of Bit-coded Algorithm with Simplification} |
260 |
|
261 \inferrule{\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rsc \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c} |
|
262 |
|
263 \end{mathpar} |
|
264 \end{center} |
|
265 These "rewrite" steps define the atomic simplifications we could impose on regular expressions |
|
266 under our simplification algorithm. |
|
267 For convenience, we define a relation $\stackrel{s}{\rightsquigarrow}$ for rewriting |
|
268 a list of regular exression to another list. |
|
269 The $\rerase{}$ function is used instead of $_\downarrow$ for the finiteness bound proof of next chapter, |
|
270 which we will discuss later. For the moment the reader can assume they basically do the same thing as $\erase$. |
|
271 |
|
272 Usually more than one steps are taking place during the simplification of a regular expression, |
|
273 therefore we define the reflexive transitive closure of the $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$ |
|
274 relations: |
|
275 |
|
276 \begin{center} |
|
277 \begin{mathpar} |
|
278 \inferrule{}{ r \stackrel{*}{\rightsquigarrow} r \\} |
|
279 \inferrule{}{\\ rs \stackrel{s*}{\rightsquigarrow} rs \\} |
|
280 \inferrule{\\r_1 \stackrel{*}{\rightsquigarrow} r_2 \land \; r_2 \stackrel{*}{\rightsquigarrow} r_3}{r_1 \stackrel{*}{\rightsquigarrow} r_3\\} |
|
281 \inferrule{\\rs_1 \stackrel{*}{\rightsquigarrow} rs_2 \land \; rs_2 \stackrel{*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{*}{\rightsquigarrow} rs_3} |
|
282 \end{mathpar} |
|
283 \end{center} |
|
284 Now that we have modelled the rewriting behaviour of our simplifier $\bsimp$, we prove mainly |
|
285 three properties about how these relations connect to $\blexersimp$: |
|
286 \begin{itemize} |
|
287 \item |
|
288 $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$ |
|
289 The algorithm $\bsimp$ only transforms the regex $r$ using steps specified by |
|
290 $\stackrel{*}{\rightsquigarrow}$ and nothing else. |
|
291 \item |
|
292 $r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow r'\backslash c$. |
|
293 The relation $\stackrel{*}{rightsquigarrow}$ is preserved under derivative. |
|
294 \item |
|
295 $(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$. If we reach another |
|
296 expression in finitely many atomic simplification steps, then these two regular expressions |
|
297 will produce the same bit-codes under the bit collection function $\bmkeps$ used by our $\blexer$. |
|
298 |
|
299 \end{itemize} |
|
300 \section{Three Important properties} |
|
301 These properties would work together towards the correctness theorem. |
|
302 We start proving each of these lemmas below. |
|
303 \subsection{$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$ and $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$} |
|
304 The first few properties we establish is that the inference rules we gave for $\rightsquigarrow$ |
|
305 and $\stackrel{s}{\rightsquigarrow}$ also hold as implications for $\stackrel{*}{\rightsquigarrow}$ and |
|
306 $\stackrel{s*}{\rightsquigarrow}$. |
|
307 \begin{lemma} |
|
308 $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$ |
|
309 \end{lemma} |
|
310 \begin{proof} |
|
311 By rule induction of $\stackrel{s*}{\rightsquigarrow}$. |
|
312 \end{proof} |
|
313 \begin{lemma} |
|
314 $r \stackrel{*}{\rightsquigarrow} r' \implies _{bs} \sum r :: rs \stackrel{*}{\rightsquigarrow} _{bs} \sum r' :: rs$ |
|
315 \end{lemma} |
|
316 \begin{proof} |
|
317 By rule induction of $\stackrel{*}{\rightsquigarrow} $. |
|
318 \end{proof} |
|
319 \noindent |
|
320 Then we establish that the $\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ relation is preserved w.r.t appending and prepending of a list: |
|
321 \begin{lemma} |
|
322 $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$ |
|
323 \end{lemma} |
|
324 \begin{proof} |
|
325 By induction on the list $rs$. |
|
326 \end{proof} |
|
327 |
|
328 \begin{lemma} |
|
329 $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2$ |
|
330 \end{lemma} |
|
331 \begin{proof} |
|
332 By rule induction of $\stackrel{s*}{\rightsquigarrow}$. |
|
333 \end{proof} |
|
334 |
|
335 \noindent |
|
336 The $\stackrel{s}{\rightsquigarrow} $ relation after appending a list becomes $\stackrel{s*}{\rightsquigarrow}$: |
|
337 \begin{lemma}\label{ssgqTossgs} |
|
338 $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$ |
|
339 \end{lemma} |
|
340 \begin{proof} |
|
341 By rule induction of $\stackrel{s}{\rightsquigarrow}$. |
|
342 \end{proof} |
|
343 \begin{lemma} |
|
344 $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$ |
|
345 \end{lemma} |
|
346 \begin{proof} |
|
347 By rule induction of $\stackrel{s*}{\rightsquigarrow}$ and using \ref{ssgqTossgs}. |
|
348 \end{proof} |
|
349 Here are two lemmas relating $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$: |
|
350 \begin{lemma}\label{singleton} |
|
351 $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$ |
|
352 \end{lemma} |
|
353 \begin{proof} |
|
354 By rule induction of $ \stackrel{*}{\rightsquigarrow} $. |
|
355 \end{proof} |
|
356 \begin{lemma} |
|
357 $rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies |
|
358 r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$ |
|
359 \end{lemma} |
|
360 \begin{proof} |
|
361 By using \ref{singleton}. |
|
362 \end{proof} |
|
363 Now we get to the "meaty" part of the proof, which relates the relations $\stackrel{s*}{\rightsquigarrow}$ and |
|
364 $\stackrel{*}{\rightsquigarrow} $ with our simplification components such $\distinctBy$ and $\flts$. |
|
365 The first lemma below says that for a list made of two parts $rs_1 @ rs_2$, one can throw away the duplicate |
|
366 elements in $rs_2$, as well as those that have appeared in $rs_1$: |
|
367 \begin{lemma} |
|
368 $rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} (rs_1 @ (\distinctBy \; rs_2 \; \; \rerase{\_}\; \; (\map\;\; \rerase{\_}\; \; rs_1)))$ |
|
369 \end{lemma} |
|
370 \begin{proof} |
|
371 By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary. |
|
372 \end{proof} |
|
373 The above h as the corollary that is suitable for the actual way $\distinctBy$ is called in $\bsimp$: |
|
374 \begin{lemma}\label{dBPreserves} |
|
375 $rs_ 1 \rightarrow \distinctBy \; rs_1 \; \phi$ |
|
376 \end{lemma} |
|
377 |
|
378 |
|
379 |
|
380 The flatten function $\flts$ works within the $\rightsquigarrow$ relation: |
|
381 \begin{lemma}\label{fltsPreserves} |
|
382 $rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$ |
|
383 \end{lemma} |
|
384 |
|
385 The rewriting in many steps property is composible in terms of regular expression constructors: |
|
386 \begin{lemma} |
|
387 $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} \; _{bs} r_2 \cdot r_3 \quad $ and |
|
388 $r_3 \stackrel{*}{\rightsquigarrow} r_4 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} _{bs} \; r_1 \cdot r_4$ |
|
389 \end{lemma} |
|
390 |
|
391 The rewriting in many steps properties $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ is preserved under the function $\fuse$: |
|
392 \begin{lemma} |
|
393 $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies \fuse \; bs \; r_1 \stackrel{*}{\rightsquigarrow} \; \fuse \; bs \; r_2 \quad $ and |
|
394 $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\fuse \; bs) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$ |
|
395 \end{lemma} |
|
396 \begin{proof} |
|
397 By the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and |
|
398 $rs_2 \stackrel{s}{\rightsquigarrow} rs_3 \implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$. |
|
399 \end{proof} |
|
400 \noindent |
|
401 If we could rewrite a regular expression in many steps to $\ZERO$, then |
|
402 we could also rewrite any sequence containing it to $\ZERO$: |
|
403 \begin{lemma} |
|
404 $r_1 \stackrel{*}{\rightsquigarrow} \ZERO \implies _{bs}r_1\cdot r_2 \stackrel{*}{\rightsquigarrow} \ZERO$ |
|
405 \end{lemma} |
|
406 \begin{lemma} |
|
407 $\bmkeps \; (r \backslash s) = \bmkeps \; \bderssimp{r}{s}$ |
|
408 \end{lemma} |
|
409 \noindent |
|
410 The function $\bsimpalts$ preserves rewritability: |
|
411 \begin{lemma}\label{bsimpaltsPreserves} |
|
412 $_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$ |
|
413 \end{lemma} |
|
414 \noindent |
|
415 Before we give out the next lemmas, we define a predicate for a list of regular expressions |
|
416 having at least one nullable regular expressions: |
|
417 \begin{definition} |
|
418 $\textit{bnullables} \; rs \dn \exists r \in rs. \bnullable r$ |
|
419 \end{definition} |
|
420 The rewriting relation $\rightsquigarrow$ preserves nullability: |
|
421 \begin{lemma} |
|
422 $r_1 \rightsquigarrow r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$ and |
|
423 $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$ |
|
424 \end{lemma} |
|
425 \begin{proof} |
|
426 By rule induction of $\rightarrow$ and $\stackrel{s}{\rightsquigarrow}$. |
|
427 \end{proof} |
|
428 So does the many steps rewriting: |
|
429 \begin{lemma}\label{rewritesBnullable} |
|
430 $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$ |
|
431 \end{lemma} |
|
432 \begin{proof} |
|
433 By rule induction of $\stackrel{*}{\rightsquigarrow} $. |
|
434 \end{proof} |
|
435 \noindent |
|
436 And if both regular expressions in a rewriting relation are nullable, then they |
|
437 produce the same bit-codes: |
|
438 |
|
439 \begin{lemma}\label{rewriteBmkepsAux} |
|
440 $r_1 \rightsquigarrow r_2 \implies (\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 = |
|
441 \bmkeps \; r_2)$ and |
|
442 $rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2 \implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies |
|
443 \bmkepss \; rs_1 = \bmkepss \; rs2)$ |
|
444 \end{lemma} |
|
445 \noindent |
|
446 The definition of $\bmkepss$ on list $rs$ is just to extract the bit-codes on the first element in $rs$ that |
|
447 is $bnullable$: |
|
448 \begin{center} |
|
449 \begin{tabular}{lcl} |
|
450 $\bmkepss \; [] $ & $\dn$ & $[]$\\ |
|
451 $\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \; \bnullable \; r then \; (\bmkeps \; r) \; \textit{else} \; \bmkepss \; rs$ |
|
452 \end{tabular} |
|
453 \end{center} |
|
454 \noindent |
|
455 And now we are ready to prove the key property that if you |
|
456 have two regular expressions, one rewritable in many steps to the other, |
|
457 and one of them is $\bnullable$, then they will both yield the same bits under $\bmkeps$: |
|
458 \begin{lemma} |
|
459 $(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$ |
|
460 \end{lemma} |
|
461 \begin{proof} |
|
462 By rule induction of $\stackrel{*}{\rightsquigarrow} $, using \ref{rewriteBmkepsAux} and $\ref{rewritesBnullable}$. |
|
463 \end{proof} |
|
464 \noindent |
|
465 the other property is also ready: |
|
466 \begin{lemma} |
|
467 $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$ |
|
468 \end{lemma} |
|
469 \begin{proof} |
|
470 By an induction on $r$. |
|
471 The most difficult case would be the alternative case, where we using properties such as \ref{bsimpaltsPreserves} and \ref{fltsPreserves} and \ref{dBPreserves}, we could continuously rewrite a list like:\\ |
|
472 $rs \stackrel{s*}{\rightsquigarrow} \map \; \textit{bsimp} \; rs$\\ |
|
473 $\ldots \stackrel{s*}{\rightsquigarrow} \flts \; (\map \; \textit{bsimp} \; rs)$\\ |
|
474 $\ldots \;\stackrel{s*}{\rightsquigarrow} \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi$\\ |
|
475 Then we could do the following regular expresssion many steps rewrite:\\ |
|
476 $ _{bs} \sum \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi \stackrel{*}{\rightsquigarrow} \bsimpalts \; bs \; (\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi)$ |
|
477 \\ |
|
478 |
|
479 \end{proof} |
|
480 \section{Proof for the Property: $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$} |
|
481 The first thing we prove is that if we could rewrite in one step, then after derivative |
|
482 we could rewrite in many steps: |
|
483 \begin{lemma}\label{rewriteBder} |
|
484 $r_1 \rightsquigarrow r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$ and |
|
485 $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\_\backslash c) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$ |
|
486 \end{lemma} |
|
487 \begin{proof} |
|
488 By induction on $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas. |
|
489 \end{proof} |
|
490 Now we can prove that once we could rewrite from one expression to another in many steps, |
|
491 then after a derivative on both sides we could still rewrite one to another in many steps: |
|
492 \begin{lemma} |
|
493 $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$ |
|
494 \end{lemma} |
|
495 \begin{proof} |
|
496 By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma :\ref{rewriteBder}. |
|
497 \end{proof} |
|
498 \noindent |
|
499 This can be extended and combined with the previous two important properties |
|
500 so that a regular expression's successivve derivatives can be rewritten in many steps |
|
501 to its simplified counterpart: |
|
502 \begin{lemma}\label{bderBderssimp} |
|
503 $a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $ |
|
504 \end{lemma} |
|
505 \subsection{Main Theorem} |
|
506 Now with \ref{bdersBderssimp} we are ready for the main theorem. |
|
507 To link $\blexersimp$ and $\blexer$, |
|
508 we first say that they give out the same bits, if the lexing result is a match: |
|
509 \begin{lemma} |
|
510 $\bnullable \; (a \backslash s) \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$ |
|
511 \end{lemma} |
|
512 \noindent |
|
513 Now that they give out the same bits, we know that they give the same value after decoding, |
|
514 which we know is correct value as $\blexer$ is correct: |
|
515 \begin{theorem} |
|
516 $\blexer \; r \; s = \blexersimp{r}{s}$ |
|
517 \end{theorem} |
|
518 \noindent |
|
519 |
|
520 \subsection{Comments on the Proof Techniques Used} |
200 The non-trivial part of proving the correctness of the algorithm with simplification |
521 The non-trivial part of proving the correctness of the algorithm with simplification |
201 compared with not having simplification is that we can no longer use the argument |
522 compared with not having simplification is that we can no longer use the argument |
202 in \cref{flex_retrieve}. |
523 in \cref{flex_retrieve}. |
203 The function \retrieve needs the structure of the annotated regular expression to |
524 The function \retrieve needs the cumbersome structure of the (umsimplified) |
|
525 annotated regular expression to |
204 agree with the structure of the value, but simplification will always mess with the |
526 agree with the structure of the value, but simplification will always mess with the |
205 structure: |
527 structure. |
206 %TODO: after simp does not agree with each other: (a + 0) --> a v.s. Left(Char(a)) |
528 |
|
529 We also tried to prove $\bsimp{\bderssimp{a}{s}} = \bsimp{a\backslash s}$, |
|
530 but this turns out to be not true, A counterexample of this being |
|
531 \[ r = [(1+c)\cdot [aa \cdot (1+c)]] \land s = aa |
|
532 \] |
|
533 |
|
534 Then we would have $\bsimp{a \backslash s}$ being |
|
535 $_{[]}(_{ZZ}\ONE + _{ZS}c ) $ |
|
536 whereas $\bsimp{\bderssimp{a}{s}}$ would be |
|
537 $_{Z}(_{Z} \ONE + _{S} c)$. |
|
538 Unfortunately if we apply $\textit{bsimp}$ at different |
|
539 stages we will always have this discrepancy, due to |
|
540 whether the $\map \; (\fuse\; bs) \; as$ operation in $\textit{bsimp}$ |
|
541 is taken at some points will be entirely dependant on when the simplification |
|
542 take place whether there is a larger alternative structure surrounding the |
|
543 alternative being simplified. |
|
544 The good thing about $\stackrel{*}{\rightsquigarrow} $ is that it allows |
|
545 us not specify how exactly the "atomic" simplification steps $\rightsquigarrow$ |
|
546 are taken, but simply say that they can be taken to make two similar |
|
547 regular expressions equal, and can be done after interleaving derivatives |
|
548 and simplifications. |
|
549 |
|
550 |
|
551 Having correctness property is good. But we would also like the lexer to be efficient in |
|
552 some sense, for exampe, not grinding to a halt at certain cases. |
|
553 In the next chapter we shall prove that for a given $r$, the internal derivative size is always |
|
554 finitely bounded by a constant. |