532
|
1 |
% Chapter Template
|
|
2 |
|
|
3 |
% Main chapter title
|
|
4 |
\chapter{Correctness of Bit-coded Algorithm with Simplification}
|
|
5 |
|
|
6 |
\label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
|
|
7 |
%Then we illustrate how the algorithm without bitcodes falls short for such aggressive
|
|
8 |
%simplifications and therefore introduce our version of the bitcoded algorithm and
|
|
9 |
%its correctness proof in
|
|
10 |
%Chapter 3\ref{Chapter3}.
|
|
11 |
|
|
12 |
|
|
13 |
|
538
|
14 |
Now we introduce the simplifications, which is why we introduce the
|
|
15 |
bitcodes in the first place.
|
|
16 |
|
543
|
17 |
\section{Simplification for Annotated Regular Expressions}
|
|
18 |
The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s
|
|
19 |
and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns
|
|
20 |
are scattered around different levels:
|
538
|
21 |
|
543
|
22 |
\begin{center}
|
|
23 |
$(a^*a^*)^* \rightarrow (a^*a^* + a^*)\cdot(a^*a^*)^*$\\
|
|
24 |
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
|
|
25 |
\end{center}
|
|
26 |
\noindent
|
|
27 |
Despite that we have already implemented the simple-minded simplifications
|
|
28 |
such as throwing away useless $\ONE$s and $\ZERO$s.
|
|
29 |
The simplification rule $r + r \rightarrow $ cannot make a difference either
|
|
30 |
since it only removes duplicates on the same level, not something like
|
|
31 |
$(r+a)+r \rightarrow r+a$.
|
|
32 |
This requires us to break up nested alternatives into lists, for example
|
|
33 |
using the flatten operation similar to the one defined for any function language's
|
|
34 |
list datatype:
|
538
|
35 |
|
543
|
36 |
\begin{center}
|
|
37 |
\begin{tabular}{@{}lcl@{}}
|
|
38 |
$\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
|
|
39 |
(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\
|
|
40 |
$\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \; \textit{as'} $ \\
|
|
41 |
$\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise)
|
|
42 |
\end{tabular}
|
|
43 |
\end{center}
|
|
44 |
|
|
45 |
\noindent
|
|
46 |
There is a minor difference though, in that our $\flts$ operation defined by us
|
|
47 |
also throws away $\ZERO$s.
|
|
48 |
For a flattened list of regular expressions, a de-duplication can be done efficiently:
|
|
49 |
|
|
50 |
|
|
51 |
\begin{center}
|
|
52 |
\begin{tabular}{@{}lcl@{}}
|
|
53 |
$\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\
|
|
54 |
$\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & \\
|
|
55 |
& & $\quad \textit{if} (f \; x) \in acc \textit{then} \distinctBy \; xs \; f \; acc$\\
|
|
56 |
& & $\quad \textit{else} x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$
|
|
57 |
\end{tabular}
|
|
58 |
\end{center}
|
|
59 |
\noindent
|
|
60 |
The reason we define a distinct function under a mapping $f$ is because
|
|
61 |
we want to eliminate regular expressions that are the same syntactically,
|
|
62 |
but having different bit-codes (more on the reason why we can do this later).
|
|
63 |
To achieve this, we call $\erase$ as the function $f$ during the distinction
|
|
64 |
operation.
|
|
65 |
A recursive definition of our simplification function
|
538
|
66 |
that looks somewhat similar to our Scala code is given below:
|
|
67 |
|
|
68 |
\begin{center}
|
|
69 |
\begin{tabular}{@{}lcl@{}}
|
|
70 |
|
543
|
71 |
$\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp} \; a_2) $ \\
|
|
72 |
$\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \erase \; \phi) $ \\
|
|
73 |
$\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
|
538
|
74 |
\end{tabular}
|
|
75 |
\end{center}
|
|
76 |
|
|
77 |
\noindent
|
543
|
78 |
The simplification (named $\bsimp$ for \emph{b}it-coded) does a pattern matching on the regular expression.
|
538
|
79 |
When it detected that the regular expression is an alternative or
|
543
|
80 |
sequence, it will try to simplify its children regular expressions
|
538
|
81 |
recursively and then see if one of the children turns into $\ZERO$ or
|
|
82 |
$\ONE$, which might trigger further simplification at the current level.
|
543
|
83 |
Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$,
|
|
84 |
using rules such as $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$.
|
|
85 |
\begin{center}
|
|
86 |
\begin{tabular}{@{}lcl@{}}
|
|
87 |
$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
|
|
88 |
&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
|
|
89 |
&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
|
|
90 |
&&$\quad\textit{case} \; (_{bs1}\ONE, a_2') \Rightarrow \textit{fuse} \; (bs@bs_1) \; a_2'$ \\
|
|
91 |
&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$
|
|
92 |
\end{tabular}
|
|
93 |
\end{center}
|
538
|
94 |
\noindent
|
543
|
95 |
The most involved part is the $\sum$ clause, where we first call $\flts$ on
|
|
96 |
the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$.
|
|
97 |
and then call $\distinctBy$ on that list, the predicate determining whether two
|
|
98 |
elements are the same is $\erase \; r_1 = \erase\; r_2$.
|
|
99 |
Finally, depending on whether the regular expression list $as'$ has turned into a
|
|
100 |
singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$
|
|
101 |
decides whether to keep the current level constructor $\sum$ as it is, and
|
|
102 |
removes it when there are less than two elements:
|
|
103 |
\begin{center}
|
|
104 |
\begin{tabular}{lcl}
|
|
105 |
$\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\
|
|
106 |
&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
|
|
107 |
&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
|
|
108 |
&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
|
|
109 |
\end{tabular}
|
|
110 |
|
|
111 |
\end{center}
|
|
112 |
Having defined the $\bsimp$ function,
|
|
113 |
we add it as a phase after a derivative is taken,
|
|
114 |
so it stays small:
|
|
115 |
\begin{center}
|
|
116 |
\begin{tabular}{lcl}
|
|
117 |
$r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
|
|
118 |
\end{tabular}
|
|
119 |
\end{center}
|
|
120 |
Following previous notation of natural
|
538
|
121 |
extension from derivative w.r.t.~character to derivative
|
543
|
122 |
w.r.t.~string, we define the derivative that nests simplifications with derivatives:%\comment{simp in the [] case?}
|
538
|
123 |
|
|
124 |
\begin{center}
|
|
125 |
\begin{tabular}{lcl}
|
543
|
126 |
$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
|
|
127 |
$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$
|
538
|
128 |
\end{tabular}
|
|
129 |
\end{center}
|
|
130 |
|
|
131 |
\noindent
|
543
|
132 |
Extracting bit-codes from the derivatives that had been simplified repeatedly after
|
|
133 |
each derivative run, the simplified $\blexer$, called $\blexersimp$,
|
|
134 |
is defined as
|
538
|
135 |
\begin{center}
|
|
136 |
\begin{tabular}{lcl}
|
|
137 |
$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
|
|
138 |
$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
|
|
139 |
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
|
|
140 |
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
|
|
141 |
& & $\;\;\textit{else}\;\textit{None}$
|
|
142 |
\end{tabular}
|
|
143 |
\end{center}
|
|
144 |
|
|
145 |
\noindent
|
|
146 |
This algorithm keeps the regular expression size small, for example,
|
543
|
147 |
with this simplification our previous $(a + aa)^*$ example's fast growth (over
|
|
148 |
$10^5$ nodes at around $20$ input length)
|
538
|
149 |
will be reduced to just 17 and stays constant, no matter how long the
|
|
150 |
input string is.
|
543
|
151 |
We show some graphs to better demonstrate this imrpovement.
|
538
|
152 |
|
|
153 |
|
543
|
154 |
\section{$(a+aa)^*$ and $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
|
|
155 |
For $(a+aa)^*$, it used to grow to over $9000$ nodes with simple-minded simplification
|
|
156 |
at only around $15$ input characters:
|
|
157 |
\begin{center}
|
539
|
158 |
\begin{tabular}{ll}
|
|
159 |
\begin{tikzpicture}
|
|
160 |
\begin{axis}[
|
|
161 |
xlabel={$n$},
|
|
162 |
ylabel={derivative size},
|
|
163 |
width=7cm,
|
|
164 |
height=4cm,
|
543
|
165 |
legend entries={Lexer with $\textit{bsimp}$},
|
|
166 |
legend pos= south east,
|
|
167 |
legend cell align=left]
|
|
168 |
\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_bsimp.data};
|
|
169 |
\end{axis}
|
|
170 |
\end{tikzpicture} %\label{fig:BitcodedLexer}
|
|
171 |
&
|
|
172 |
\begin{tikzpicture}
|
|
173 |
\begin{axis}[
|
|
174 |
xlabel={$n$},
|
|
175 |
ylabel={derivative size},
|
|
176 |
width = 7cm,
|
|
177 |
height = 4cm,
|
|
178 |
legend entries={Lexer without $\textit{bsimp}$},
|
|
179 |
legend pos= north west,
|
|
180 |
legend cell align=left]
|
|
181 |
\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_easysimp.data};
|
|
182 |
\end{axis}
|
|
183 |
\end{tikzpicture}
|
|
184 |
\end{tabular}
|
|
185 |
\end{center}
|
|
186 |
And for $(a^*\cdot a^*)^*$, unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current
|
|
187 |
simplification rules (we put the graphs together to show the contrast)
|
|
188 |
\begin{center}
|
|
189 |
\begin{tabular}{ll}
|
|
190 |
\begin{tikzpicture}
|
|
191 |
\begin{axis}[
|
|
192 |
xlabel={$n$},
|
|
193 |
ylabel={derivative size},
|
|
194 |
width=7cm,
|
|
195 |
height=4cm,
|
|
196 |
legend entries={Lexer with $\textit{bsimp}$},
|
539
|
197 |
legend pos= south east,
|
|
198 |
legend cell align=left]
|
|
199 |
\addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};
|
|
200 |
\end{axis}
|
|
201 |
\end{tikzpicture} %\label{fig:BitcodedLexer}
|
|
202 |
&
|
|
203 |
\begin{tikzpicture}
|
|
204 |
\begin{axis}[
|
|
205 |
xlabel={$n$},
|
|
206 |
ylabel={derivative size},
|
|
207 |
width = 7cm,
|
|
208 |
height = 4cm,
|
543
|
209 |
legend entries={Lexer without $\textit{bsimp}$},
|
539
|
210 |
legend pos= north west,
|
|
211 |
legend cell align=left]
|
|
212 |
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
|
|
213 |
\end{axis}
|
|
214 |
\end{tikzpicture}
|
|
215 |
\end{tabular}
|
543
|
216 |
\end{center}
|
538
|
217 |
|
543
|
218 |
%----------------------------------------------------------------------------------------
|
|
219 |
% SECTION rewrite relation
|
|
220 |
%----------------------------------------------------------------------------------------
|
|
221 |
\section{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
|
|
222 |
The overall idea for the correctness
|
|
223 |
\begin{conjecture}
|
|
224 |
$\blexersimp \; r \; s = \blexer \; r\; s$
|
|
225 |
\end{conjecture}
|
|
226 |
is that the $\textit{bsimp}$ will not change the regular expressions so much that
|
|
227 |
it becomes impossible to extract the $\POSIX$ values.
|
|
228 |
To capture this "similarity" between unsimplified regular expressions and simplified ones,
|
|
229 |
we devise the rewriting relation $\rrewrite$, written infix as $\rightsquigarrow$:
|
538
|
230 |
|
543
|
231 |
\begin{center}
|
|
232 |
\begin{mathpar}
|
|
233 |
\inferrule{}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
|
|
234 |
|
|
235 |
\inferrule{}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}
|
|
236 |
|
|
237 |
\inferrule{}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\
|
|
238 |
|
|
239 |
|
|
240 |
|
|
241 |
\inferrule{\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\}
|
|
242 |
|
|
243 |
\inferrule{\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\
|
|
244 |
|
|
245 |
\inferrule{}{ _{bs}\sum [] \rightsquigarrow \ZERO\\}
|
|
246 |
|
|
247 |
\inferrule{}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a\\}
|
|
248 |
|
|
249 |
\inferrule{\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2\\}
|
|
250 |
|
|
251 |
\inferrule{}{\\ [] \stackrel{s}{\rightsquigarrow} []}
|
|
252 |
|
|
253 |
\inferrule{rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{\\ r :: rs_1 \rightsquigarrow r :: rs_2 \rightsquigarrow}
|
|
254 |
|
|
255 |
\inferrule{r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs\\}
|
|
256 |
|
|
257 |
\inferrule{}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs}
|
|
258 |
|
|
259 |
\inferrule{}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) }
|
|
260 |
|
|
261 |
\inferrule{\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rsc \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c}
|
|
262 |
|
|
263 |
\end{mathpar}
|
|
264 |
\end{center}
|
|
265 |
These "rewrite" steps define the atomic simplifications we could impose on regular expressions
|
|
266 |
under our simplification algorithm.
|
|
267 |
For convenience, we define a relation $\stackrel{s}{\rightsquigarrow}$ for rewriting
|
|
268 |
a list of regular exression to another list.
|
|
269 |
The $\rerase{}$ function is used instead of $_\downarrow$ for the finiteness bound proof of next chapter,
|
|
270 |
which we will discuss later. For the moment the reader can assume they basically do the same thing as $\erase$.
|
|
271 |
|
|
272 |
Usually more than one steps are taking place during the simplification of a regular expression,
|
|
273 |
therefore we define the reflexive transitive closure of the $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
|
|
274 |
relations:
|
|
275 |
|
|
276 |
\begin{center}
|
|
277 |
\begin{mathpar}
|
|
278 |
\inferrule{}{ r \stackrel{*}{\rightsquigarrow} r \\}
|
|
279 |
\inferrule{}{\\ rs \stackrel{s*}{\rightsquigarrow} rs \\}
|
|
280 |
\inferrule{\\r_1 \stackrel{*}{\rightsquigarrow} r_2 \land \; r_2 \stackrel{*}{\rightsquigarrow} r_3}{r_1 \stackrel{*}{\rightsquigarrow} r_3\\}
|
|
281 |
\inferrule{\\rs_1 \stackrel{*}{\rightsquigarrow} rs_2 \land \; rs_2 \stackrel{*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{*}{\rightsquigarrow} rs_3}
|
|
282 |
\end{mathpar}
|
|
283 |
\end{center}
|
|
284 |
Now that we have modelled the rewriting behaviour of our simplifier $\bsimp$, we prove mainly
|
|
285 |
three properties about how these relations connect to $\blexersimp$:
|
|
286 |
\begin{itemize}
|
|
287 |
\item
|
|
288 |
$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
|
|
289 |
The algorithm $\bsimp$ only transforms the regex $r$ using steps specified by
|
|
290 |
$\stackrel{*}{\rightsquigarrow}$ and nothing else.
|
|
291 |
\item
|
|
292 |
$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow r'\backslash c$.
|
|
293 |
The relation $\stackrel{*}{rightsquigarrow}$ is preserved under derivative.
|
|
294 |
\item
|
|
295 |
$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$. If we reach another
|
|
296 |
expression in finitely many atomic simplification steps, then these two regular expressions
|
|
297 |
will produce the same bit-codes under the bit collection function $\bmkeps$ used by our $\blexer$.
|
|
298 |
|
|
299 |
\end{itemize}
|
|
300 |
\section{Three Important properties}
|
|
301 |
These properties would work together towards the correctness theorem.
|
|
302 |
We start proving each of these lemmas below.
|
|
303 |
\subsection{$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$ and $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$}
|
|
304 |
The first few properties we establish is that the inference rules we gave for $\rightsquigarrow$
|
|
305 |
and $\stackrel{s}{\rightsquigarrow}$ also hold as implications for $\stackrel{*}{\rightsquigarrow}$ and
|
|
306 |
$\stackrel{s*}{\rightsquigarrow}$.
|
|
307 |
\begin{lemma}
|
|
308 |
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$
|
|
309 |
\end{lemma}
|
|
310 |
\begin{proof}
|
|
311 |
By rule induction of $\stackrel{s*}{\rightsquigarrow}$.
|
|
312 |
\end{proof}
|
|
313 |
\begin{lemma}
|
|
314 |
$r \stackrel{*}{\rightsquigarrow} r' \implies _{bs} \sum r :: rs \stackrel{*}{\rightsquigarrow} _{bs} \sum r' :: rs$
|
|
315 |
\end{lemma}
|
|
316 |
\begin{proof}
|
|
317 |
By rule induction of $\stackrel{*}{\rightsquigarrow} $.
|
|
318 |
\end{proof}
|
|
319 |
\noindent
|
|
320 |
Then we establish that the $\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ relation is preserved w.r.t appending and prepending of a list:
|
|
321 |
\begin{lemma}
|
|
322 |
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$
|
|
323 |
\end{lemma}
|
|
324 |
\begin{proof}
|
|
325 |
By induction on the list $rs$.
|
|
326 |
\end{proof}
|
|
327 |
|
|
328 |
\begin{lemma}
|
|
329 |
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2$
|
|
330 |
\end{lemma}
|
|
331 |
\begin{proof}
|
|
332 |
By rule induction of $\stackrel{s*}{\rightsquigarrow}$.
|
|
333 |
\end{proof}
|
|
334 |
|
|
335 |
\noindent
|
|
336 |
The $\stackrel{s}{\rightsquigarrow} $ relation after appending a list becomes $\stackrel{s*}{\rightsquigarrow}$:
|
|
337 |
\begin{lemma}\label{ssgqTossgs}
|
|
338 |
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
|
|
339 |
\end{lemma}
|
|
340 |
\begin{proof}
|
|
341 |
By rule induction of $\stackrel{s}{\rightsquigarrow}$.
|
|
342 |
\end{proof}
|
|
343 |
\begin{lemma}
|
|
344 |
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
|
|
345 |
\end{lemma}
|
|
346 |
\begin{proof}
|
|
347 |
By rule induction of $\stackrel{s*}{\rightsquigarrow}$ and using \ref{ssgqTossgs}.
|
|
348 |
\end{proof}
|
|
349 |
Here are two lemmas relating $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$:
|
|
350 |
\begin{lemma}\label{singleton}
|
|
351 |
$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$
|
|
352 |
\end{lemma}
|
|
353 |
\begin{proof}
|
|
354 |
By rule induction of $ \stackrel{*}{\rightsquigarrow} $.
|
|
355 |
\end{proof}
|
|
356 |
\begin{lemma}
|
|
357 |
$rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies
|
|
358 |
r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$
|
|
359 |
\end{lemma}
|
|
360 |
\begin{proof}
|
|
361 |
By using \ref{singleton}.
|
|
362 |
\end{proof}
|
|
363 |
Now we get to the "meaty" part of the proof, which relates the relations $\stackrel{s*}{\rightsquigarrow}$ and
|
|
364 |
$\stackrel{*}{\rightsquigarrow} $ with our simplification components such $\distinctBy$ and $\flts$.
|
|
365 |
The first lemma below says that for a list made of two parts $rs_1 @ rs_2$, one can throw away the duplicate
|
|
366 |
elements in $rs_2$, as well as those that have appeared in $rs_1$:
|
|
367 |
\begin{lemma}
|
|
368 |
$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} (rs_1 @ (\distinctBy \; rs_2 \; \; \rerase{\_}\; \; (\map\;\; \rerase{\_}\; \; rs_1)))$
|
|
369 |
\end{lemma}
|
|
370 |
\begin{proof}
|
|
371 |
By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary.
|
|
372 |
\end{proof}
|
|
373 |
The above h as the corollary that is suitable for the actual way $\distinctBy$ is called in $\bsimp$:
|
|
374 |
\begin{lemma}\label{dBPreserves}
|
|
375 |
$rs_ 1 \rightarrow \distinctBy \; rs_1 \; \phi$
|
|
376 |
\end{lemma}
|
538
|
377 |
|
|
378 |
|
532
|
379 |
|
543
|
380 |
The flatten function $\flts$ works within the $\rightsquigarrow$ relation:
|
|
381 |
\begin{lemma}\label{fltsPreserves}
|
|
382 |
$rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$
|
|
383 |
\end{lemma}
|
|
384 |
|
|
385 |
The rewriting in many steps property is composible in terms of regular expression constructors:
|
|
386 |
\begin{lemma}
|
|
387 |
$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} \; _{bs} r_2 \cdot r_3 \quad $ and
|
|
388 |
$r_3 \stackrel{*}{\rightsquigarrow} r_4 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} _{bs} \; r_1 \cdot r_4$
|
|
389 |
\end{lemma}
|
532
|
390 |
|
543
|
391 |
The rewriting in many steps properties $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ is preserved under the function $\fuse$:
|
|
392 |
\begin{lemma}
|
|
393 |
$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies \fuse \; bs \; r_1 \stackrel{*}{\rightsquigarrow} \; \fuse \; bs \; r_2 \quad $ and
|
|
394 |
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\fuse \; bs) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$
|
|
395 |
\end{lemma}
|
|
396 |
\begin{proof}
|
|
397 |
By the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and
|
|
398 |
$rs_2 \stackrel{s}{\rightsquigarrow} rs_3 \implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$.
|
|
399 |
\end{proof}
|
|
400 |
\noindent
|
|
401 |
If we could rewrite a regular expression in many steps to $\ZERO$, then
|
|
402 |
we could also rewrite any sequence containing it to $\ZERO$:
|
|
403 |
\begin{lemma}
|
|
404 |
$r_1 \stackrel{*}{\rightsquigarrow} \ZERO \implies _{bs}r_1\cdot r_2 \stackrel{*}{\rightsquigarrow} \ZERO$
|
|
405 |
\end{lemma}
|
|
406 |
\begin{lemma}
|
|
407 |
$\bmkeps \; (r \backslash s) = \bmkeps \; \bderssimp{r}{s}$
|
|
408 |
\end{lemma}
|
|
409 |
\noindent
|
|
410 |
The function $\bsimpalts$ preserves rewritability:
|
|
411 |
\begin{lemma}\label{bsimpaltsPreserves}
|
|
412 |
$_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$
|
|
413 |
\end{lemma}
|
|
414 |
\noindent
|
|
415 |
Before we give out the next lemmas, we define a predicate for a list of regular expressions
|
|
416 |
having at least one nullable regular expressions:
|
|
417 |
\begin{definition}
|
|
418 |
$\textit{bnullables} \; rs \dn \exists r \in rs. \bnullable r$
|
|
419 |
\end{definition}
|
|
420 |
The rewriting relation $\rightsquigarrow$ preserves nullability:
|
|
421 |
\begin{lemma}
|
|
422 |
$r_1 \rightsquigarrow r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$ and
|
|
423 |
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$
|
|
424 |
\end{lemma}
|
|
425 |
\begin{proof}
|
|
426 |
By rule induction of $\rightarrow$ and $\stackrel{s}{\rightsquigarrow}$.
|
|
427 |
\end{proof}
|
|
428 |
So does the many steps rewriting:
|
|
429 |
\begin{lemma}\label{rewritesBnullable}
|
|
430 |
$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$
|
|
431 |
\end{lemma}
|
|
432 |
\begin{proof}
|
|
433 |
By rule induction of $\stackrel{*}{\rightsquigarrow} $.
|
|
434 |
\end{proof}
|
|
435 |
\noindent
|
|
436 |
And if both regular expressions in a rewriting relation are nullable, then they
|
|
437 |
produce the same bit-codes:
|
532
|
438 |
|
543
|
439 |
\begin{lemma}\label{rewriteBmkepsAux}
|
|
440 |
$r_1 \rightsquigarrow r_2 \implies (\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 =
|
|
441 |
\bmkeps \; r_2)$ and
|
|
442 |
$rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2 \implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies
|
|
443 |
\bmkepss \; rs_1 = \bmkepss \; rs2)$
|
|
444 |
\end{lemma}
|
|
445 |
\noindent
|
|
446 |
The definition of $\bmkepss$ on list $rs$ is just to extract the bit-codes on the first element in $rs$ that
|
|
447 |
is $bnullable$:
|
532
|
448 |
\begin{center}
|
543
|
449 |
\begin{tabular}{lcl}
|
|
450 |
$\bmkepss \; [] $ & $\dn$ & $[]$\\
|
|
451 |
$\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \; \bnullable \; r then \; (\bmkeps \; r) \; \textit{else} \; \bmkepss \; rs$
|
|
452 |
\end{tabular}
|
532
|
453 |
\end{center}
|
543
|
454 |
\noindent
|
|
455 |
And now we are ready to prove the key property that if you
|
|
456 |
have two regular expressions, one rewritable in many steps to the other,
|
|
457 |
and one of them is $\bnullable$, then they will both yield the same bits under $\bmkeps$:
|
|
458 |
\begin{lemma}
|
|
459 |
$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$
|
|
460 |
\end{lemma}
|
|
461 |
\begin{proof}
|
|
462 |
By rule induction of $\stackrel{*}{\rightsquigarrow} $, using \ref{rewriteBmkepsAux} and $\ref{rewritesBnullable}$.
|
|
463 |
\end{proof}
|
|
464 |
\noindent
|
|
465 |
the other property is also ready:
|
|
466 |
\begin{lemma}
|
|
467 |
$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
|
|
468 |
\end{lemma}
|
|
469 |
\begin{proof}
|
|
470 |
By an induction on $r$.
|
|
471 |
The most difficult case would be the alternative case, where we using properties such as \ref{bsimpaltsPreserves} and \ref{fltsPreserves} and \ref{dBPreserves}, we could continuously rewrite a list like:\\
|
|
472 |
$rs \stackrel{s*}{\rightsquigarrow} \map \; \textit{bsimp} \; rs$\\
|
|
473 |
$\ldots \stackrel{s*}{\rightsquigarrow} \flts \; (\map \; \textit{bsimp} \; rs)$\\
|
|
474 |
$\ldots \;\stackrel{s*}{\rightsquigarrow} \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi$\\
|
|
475 |
Then we could do the following regular expresssion many steps rewrite:\\
|
|
476 |
$ _{bs} \sum \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi \stackrel{*}{\rightsquigarrow} \bsimpalts \; bs \; (\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi)$
|
|
477 |
\\
|
|
478 |
|
|
479 |
\end{proof}
|
|
480 |
\section{Proof for the Property: $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}
|
|
481 |
The first thing we prove is that if we could rewrite in one step, then after derivative
|
|
482 |
we could rewrite in many steps:
|
|
483 |
\begin{lemma}\label{rewriteBder}
|
|
484 |
$r_1 \rightsquigarrow r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$ and
|
|
485 |
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\_\backslash c) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$
|
|
486 |
\end{lemma}
|
|
487 |
\begin{proof}
|
|
488 |
By induction on $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.
|
|
489 |
\end{proof}
|
|
490 |
Now we can prove that once we could rewrite from one expression to another in many steps,
|
|
491 |
then after a derivative on both sides we could still rewrite one to another in many steps:
|
|
492 |
\begin{lemma}
|
|
493 |
$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$
|
|
494 |
\end{lemma}
|
|
495 |
\begin{proof}
|
|
496 |
By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma :\ref{rewriteBder}.
|
|
497 |
\end{proof}
|
|
498 |
\noindent
|
|
499 |
This can be extended and combined with the previous two important properties
|
|
500 |
so that a regular expression's successivve derivatives can be rewritten in many steps
|
|
501 |
to its simplified counterpart:
|
|
502 |
\begin{lemma}\label{bderBderssimp}
|
|
503 |
$a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $
|
|
504 |
\end{lemma}
|
|
505 |
\subsection{Main Theorem}
|
|
506 |
Now with \ref{bdersBderssimp} we are ready for the main theorem.
|
|
507 |
To link $\blexersimp$ and $\blexer$,
|
|
508 |
we first say that they give out the same bits, if the lexing result is a match:
|
|
509 |
\begin{lemma}
|
|
510 |
$\bnullable \; (a \backslash s) \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
|
|
511 |
\end{lemma}
|
|
512 |
\noindent
|
|
513 |
Now that they give out the same bits, we know that they give the same value after decoding,
|
|
514 |
which we know is correct value as $\blexer$ is correct:
|
|
515 |
\begin{theorem}
|
|
516 |
$\blexer \; r \; s = \blexersimp{r}{s}$
|
|
517 |
\end{theorem}
|
|
518 |
\noindent
|
532
|
519 |
|
543
|
520 |
\subsection{Comments on the Proof Techniques Used}
|
532
|
521 |
The non-trivial part of proving the correctness of the algorithm with simplification
|
|
522 |
compared with not having simplification is that we can no longer use the argument
|
|
523 |
in \cref{flex_retrieve}.
|
543
|
524 |
The function \retrieve needs the cumbersome structure of the (umsimplified)
|
|
525 |
annotated regular expression to
|
532
|
526 |
agree with the structure of the value, but simplification will always mess with the
|
543
|
527 |
structure.
|
|
528 |
|
|
529 |
We also tried to prove $\bsimp{\bderssimp{a}{s}} = \bsimp{a\backslash s}$,
|
|
530 |
but this turns out to be not true, A counterexample of this being
|
|
531 |
\[ r = [(1+c)\cdot [aa \cdot (1+c)]] \land s = aa
|
|
532 |
\]
|
|
533 |
|
|
534 |
Then we would have $\bsimp{a \backslash s}$ being
|
|
535 |
$_{[]}(_{ZZ}\ONE + _{ZS}c ) $
|
|
536 |
whereas $\bsimp{\bderssimp{a}{s}}$ would be
|
|
537 |
$_{Z}(_{Z} \ONE + _{S} c)$.
|
|
538 |
Unfortunately if we apply $\textit{bsimp}$ at different
|
|
539 |
stages we will always have this discrepancy, due to
|
|
540 |
whether the $\map \; (\fuse\; bs) \; as$ operation in $\textit{bsimp}$
|
|
541 |
is taken at some points will be entirely dependant on when the simplification
|
|
542 |
take place whether there is a larger alternative structure surrounding the
|
|
543 |
alternative being simplified.
|
|
544 |
The good thing about $\stackrel{*}{\rightsquigarrow} $ is that it allows
|
|
545 |
us not specify how exactly the "atomic" simplification steps $\rightsquigarrow$
|
|
546 |
are taken, but simply say that they can be taken to make two similar
|
|
547 |
regular expressions equal, and can be done after interleaving derivatives
|
|
548 |
and simplifications.
|
|
549 |
|
|
550 |
|
|
551 |
Having correctness property is good. But we would also like the lexer to be efficient in
|
|
552 |
some sense, for exampe, not grinding to a halt at certain cases.
|
|
553 |
In the next chapter we shall prove that for a given $r$, the internal derivative size is always
|
|
554 |
finitely bounded by a constant.
|