author | Chengsong |
Fri, 26 May 2023 08:09:30 +0100 | |
changeset 646 | 56057198e4f5 |
parent 640 | bd1354127574 |
child 649 | ef2b8abcbc55 |
permissions | -rwxr-xr-x |
532 | 1 |
% Chapter Template |
2 |
||
3 |
% Main chapter title |
|
4 |
\chapter{Correctness of Bit-coded Algorithm with Simplification} |
|
5 |
||
6 |
\label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX} |
|
7 |
%Then we illustrate how the algorithm without bitcodes falls short for such aggressive |
|
8 |
%simplifications and therefore introduce our version of the bitcoded algorithm and |
|
9 |
%its correctness proof in |
|
10 |
%Chapter 3\ref{Chapter3}. |
|
11 |
||
12 |
||
13 |
||
600 | 14 |
In this chapter we introduce simplifications |
624 | 15 |
for annotated regular expressions that can be applied to |
583 | 16 |
each intermediate derivative result. This allows |
17 |
us to make $\blexer$ much more efficient. |
|
624 | 18 |
Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions, |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
19 |
but their simplification functions could have been more efficient and in some cases needed fixing. |
624 | 20 |
%We contrast our simplification function |
21 |
%with Sulzmann and Lu's, indicating the simplicity of our algorithm. |
|
22 |
%This is another case for the usefulness |
|
23 |
%and reliability of formal proofs on algorithms. |
|
24 |
%These ``aggressive'' simplifications would not be possible in the injection-based |
|
25 |
%lexing we introduced in chapter \ref{Inj}. |
|
26 |
%We then prove the correctness with the improved version of |
|
27 |
%$\blexer$, called $\blexersimp$, by establishing |
|
28 |
%$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system. |
|
29 |
% |
|
585 | 30 |
\section{Simplifications by Sulzmann and Lu} |
624 | 31 |
Consider the derivatives of the following example $(a^*a^*)^*$: |
32 |
%and $(a^* + (aa)^*)^*$: |
|
543 | 33 |
\begin{center} |
624 | 34 |
\begin{tabular}{lcl} |
35 |
$(a^*a^*)^*$ & $ \stackrel{\backslash a}{\longrightarrow}$ & |
|
36 |
$ (a^*a^* + a^*)\cdot(a^*a^*)^*$\\ |
|
37 |
& |
|
38 |
$ \stackrel{\backslash a}{\longrightarrow} $ & |
|
39 |
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$\\ |
|
40 |
& $\stackrel{\backslash a}{ |
|
41 |
\longrightarrow} $ & $\ldots$\\ |
|
42 |
\end{tabular} |
|
543 | 43 |
\end{center} |
44 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
45 |
As can be seen, there are several duplications. |
600 | 46 |
A simple-minded simplification function cannot simplify |
585 | 47 |
the third regular expression in the above chain of derivative |
600 | 48 |
regular expressions, namely |
583 | 49 |
\begin{center} |
579 | 50 |
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$ |
583 | 51 |
\end{center} |
600 | 52 |
because the duplicates are |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
53 |
not next to each other, and therefore the rule |
639 | 54 |
$r+ r \rightarrow r$ from $\textit{simp}$ does not fire. |
600 | 55 |
One would expect a better simplification function to work in the |
579 | 56 |
following way: |
57 |
\begin{gather*} |
|
58 |
((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
|
59 |
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\ |
|
624 | 60 |
\bigg\downarrow (1) \\ |
579 | 61 |
(a^*a^* + a^* |
62 |
\color{gray} + a^* \color{black})\cdot(a^*a^*)^* + |
|
63 |
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\ |
|
624 | 64 |
\bigg\downarrow (2) \\ |
579 | 65 |
(a^*a^* + a^* |
66 |
)\cdot(a^*a^*)^* |
|
583 | 67 |
\color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*\\ |
624 | 68 |
\bigg\downarrow (3) \\ |
583 | 69 |
(a^*a^* + a^* |
70 |
)\cdot(a^*a^*)^* |
|
579 | 71 |
\end{gather*} |
72 |
\noindent |
|
600 | 73 |
In the first step, the nested alternative regular expression |
74 |
$(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$. |
|
624 | 75 |
Now the third term $a^*$ can clearly be identified as a duplicate |
76 |
and therefore removed in the second step. |
|
77 |
This causes the two |
|
600 | 78 |
top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$ |
624 | 79 |
removed in the final step. |
80 |
Sulzmann and Lu's simplification function (using our notations) can achieve this |
|
81 |
simplification: |
|
579 | 82 |
\begin{center} |
83 |
\begin{tabular}{lcl} |
|
624 | 84 |
$\textit{simp}\_{SL} \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & |
579 | 85 |
$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\ |
86 |
& &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\ |
|
624 | 87 |
$\textit{simp}\_{SL} \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if} |
579 | 88 |
\; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\; |
89 |
\textit{then} \;\; \ZERO$\\ |
|
624 | 90 |
& & $\textit{else}\;\;_{bs}((\textit{simp}\_{SL} \;r_1)\cdot |
91 |
(\textit{simp}\_{SL} \; r_2))$\\ |
|
92 |
$\textit{simp}\_{SL} \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\ |
|
93 |
$\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & |
|
579 | 94 |
$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
624 | 95 |
$\textit{simp}\_{SL} \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL} \; r)$\\ |
96 |
$\textit{simp}\_{SL} \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum |
|
639 | 97 |
(\nub \; (\filter \; (\neg\zeroable)\;((\textit{simp}\_{SL} \; r) :: \map \; \textit{simp}\_{SL} \; rs)))$\\ |
579 | 98 |
|
99 |
\end{tabular} |
|
100 |
\end{center} |
|
101 |
\noindent |
|
624 | 102 |
The $\textit{zeroable}$ predicate |
600 | 103 |
tests whether the regular expression |
624 | 104 |
is equivalent to $\ZERO$, and |
600 | 105 |
can be defined as: |
579 | 106 |
\begin{center} |
107 |
\begin{tabular}{lcl} |
|
108 |
$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\; |
|
109 |
\zeroable \;_{[]}\sum\;rs $\\ |
|
110 |
$\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\ |
|
111 |
$\zeroable\;_{bs}r^*$ & $\dn$ & $\textit{false}$ \\ |
|
112 |
$\zeroable\;_{bs}c$ & $\dn$ & $\textit{false}$\\ |
|
113 |
$\zeroable\;_{bs}\ONE$ & $\dn$ & $\textit{false}$\\ |
|
114 |
$\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$ |
|
115 |
\end{tabular} |
|
116 |
\end{center} |
|
117 |
\noindent |
|
624 | 118 |
The |
585 | 119 |
\begin{center} |
120 |
\begin{tabular}{lcl} |
|
624 | 121 |
$\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & |
122 |
$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
|
123 |
\end{tabular} |
|
124 |
\end{center} |
|
125 |
\noindent |
|
126 |
clause does flatten the alternative as required in step (1), |
|
127 |
but $\textit{simp}\_{SL}$ is insufficient if we want to do steps (2) and (3), |
|
128 |
as these ``identical'' terms have different bit-annotations. |
|
129 |
They also suggested that the $\textit{simp}\_{SL} $ function should be |
|
130 |
applied repeatedly until a fixpoint is reached. |
|
131 |
We call this construction $\textit{SLSimp}$: |
|
132 |
\begin{center} |
|
133 |
\begin{tabular}{lcl} |
|
134 |
$\textit{SLSimp} \; r$ & $\dn$ & |
|
135 |
$\textit{while}((\textit{simp}\_{SL} \; r)\; \cancel{=} \; r)$ \\ |
|
136 |
& & $\quad r := \textit{simp}\_{SL} \; r$\\ |
|
585 | 137 |
& & $\textit{return} \; r$ |
138 |
\end{tabular} |
|
139 |
\end{center} |
|
140 |
We call the operation of alternatingly |
|
141 |
applying derivatives and simplifications |
|
142 |
(until the string is exhausted) Sulz-simp-derivative, |
|
624 | 143 |
written $\backslash_{SLSimp}$: |
585 | 144 |
\begin{center} |
145 |
\begin{tabular}{lcl} |
|
624 | 146 |
$r \backslash_{SLSimp} (c\!::\!s) $ & $\dn$ & $(\textit{SLSimp} \; (r \backslash c)) \backslash_{SLSimp}\, s$ \\ |
147 |
$r \backslash_{SLSimp} [\,] $ & $\dn$ & $r$ |
|
585 | 148 |
\end{tabular} |
149 |
\end{center} |
|
150 |
\noindent |
|
151 |
After the derivatives have been taken, the bitcodes |
|
152 |
are extracted and decoded in the same manner |
|
153 |
as $\blexer$: |
|
154 |
\begin{center} |
|
155 |
\begin{tabular}{lcl} |
|
624 | 156 |
$\textit{blexer\_SLSimp}\;r\,s$ & $\dn$ & |
157 |
$\textit{let}\;a = (r^\uparrow)\backslash_{SLSimp}\, s\;\textit{in}$\\ |
|
585 | 158 |
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
159 |
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
160 |
& & $\;\;\textit{else}\;\textit{None}$ |
|
161 |
\end{tabular} |
|
162 |
\end{center} |
|
163 |
\noindent |
|
164 |
We implemented this lexing algorithm in Scala, |
|
165 |
and found that the final derivative regular expression |
|
639 | 166 |
size still grows exponentially (note the logarithmic scale): |
584 | 167 |
\begin{figure}[H] |
585 | 168 |
\centering |
584 | 169 |
\begin{tikzpicture} |
170 |
\begin{axis}[ |
|
171 |
xlabel={$n$}, |
|
585 | 172 |
ylabel={size}, |
584 | 173 |
ymode = log, |
585 | 174 |
legend entries={Final Derivative Size}, |
584 | 175 |
legend pos=north west, |
176 |
legend cell align=left] |
|
177 |
\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexer.data}; |
|
178 |
\end{axis} |
|
179 |
\end{tikzpicture} |
|
585 | 180 |
\caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form |
584 | 181 |
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s} |
585 | 182 |
$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer} |
584 | 183 |
\end{figure} |
184 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
185 |
At $n= 20$ we already get an out-of-memory error with Scala's normal |
585 | 186 |
JVM heap size settings. |
624 | 187 |
In fact their simplification does not improve much over |
585 | 188 |
the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}. |
189 |
The time required also grows exponentially: |
|
190 |
\begin{figure}[H] |
|
191 |
\centering |
|
192 |
\begin{tikzpicture} |
|
193 |
\begin{axis}[ |
|
194 |
xlabel={$n$}, |
|
195 |
ylabel={time}, |
|
601 | 196 |
%ymode = log, |
585 | 197 |
legend entries={time in secs}, |
198 |
legend pos=north west, |
|
199 |
legend cell align=left] |
|
200 |
\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexerTime.data}; |
|
201 |
\end{axis} |
|
202 |
\end{tikzpicture} |
|
203 |
\caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form |
|
204 |
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s} |
|
205 |
$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexerTime} |
|
206 |
\end{figure} |
|
207 |
\noindent |
|
208 |
which seems like a counterexample for |
|
639 | 209 |
Sulzmann and Lu's linear complexity claim |
210 |
in their paper \cite{Sulzmann2014}: |
|
584 | 211 |
\begin{quote}\it |
639 | 212 |
``Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations: |
213 |
simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input.'' |
|
585 | 214 |
\end{quote} |
215 |
\noindent |
|
216 |
The assumption that the size of the regular expressions |
|
217 |
in the algorithm |
|
639 | 218 |
would stay below a finite constant is not true, at least not in the |
624 | 219 |
examples we considered. |
639 | 220 |
The main reason behind this is that (i) Haskell's $\textit{nub}$ |
600 | 221 |
function requires identical annotations between two |
222 |
annotated regular expressions to qualify as duplicates, |
|
624 | 223 |
and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$ |
224 |
even if both $a^*$ denote the same language, and |
|
225 |
(ii) the ``flattening'' only applies to the head of the list |
|
600 | 226 |
in the |
227 |
\begin{center} |
|
228 |
\begin{tabular}{lcl} |
|
229 |
||
624 | 230 |
$\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & |
600 | 231 |
$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
232 |
\end{tabular} |
|
233 |
\end{center} |
|
234 |
\noindent |
|
624 | 235 |
clause, and therefore is not strong enough to simplify all |
639 | 236 |
needed parts of the regular expression. Moreover, |
237 |
the $\textit{simp}\_{SL}$ function is applied repeatedly |
|
238 |
in each derivative step until a fixed point is reached, |
|
239 |
which makes the algorithm even more |
|
240 |
unpredictable and inefficient. |
|
624 | 241 |
%To not get ``caught off guard'' by |
242 |
%these counterexamples, |
|
243 |
%one needs to be more careful when designing the |
|
244 |
%simplification function and making claims about them. |
|
584 | 245 |
|
585 | 246 |
\section{Our $\textit{Simp}$ Function} |
639 | 247 |
We will now introduce our own simplification function. |
624 | 248 |
%by making a contrast with $\textit{simp}\_{SL}$. |
249 |
We also describe |
|
250 |
the ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$ |
|
251 |
algorithm |
|
639 | 252 |
and why it fails to achieve the desired effect of keeping the sizes of derivatives finitely bounded. |
253 |
In addition, our simplification function will come with a formal |
|
624 | 254 |
correctness proof. |
585 | 255 |
\subsection{Flattening Nested Alternatives} |
600 | 256 |
The idea behind the clause |
585 | 257 |
\begin{center} |
624 | 258 |
$\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad |
585 | 259 |
_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$ |
260 |
\end{center} |
|
600 | 261 |
is that it allows |
585 | 262 |
duplicate removal of regular expressions at different |
600 | 263 |
``levels'' of alternatives. |
585 | 264 |
For example, this would help with the |
265 |
following simplification: |
|
538 | 266 |
|
585 | 267 |
\begin{center} |
268 |
$(a+r)+r \longrightarrow a+r$ |
|
269 |
\end{center} |
|
624 | 270 |
The problem is that only the head element |
271 |
is ``spilled out''. |
|
272 |
It is more desirable |
|
273 |
to flatten |
|
274 |
an entire list to open up possibilities for further simplifications |
|
275 |
with later regular expressions. |
|
585 | 276 |
Not flattening the rest of the elements also means that |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
277 |
the later de-duplication process |
600 | 278 |
does not fully remove further duplicates. |
585 | 279 |
For example, |
624 | 280 |
using $\textit{simp}\_{SL}$ we cannot |
585 | 281 |
simplify |
282 |
\begin{center} |
|
600 | 283 |
$((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+ |
585 | 284 |
((a^*a^*)+a^*)\cdot (a^*a^*)^*$ |
285 |
\end{center} |
|
639 | 286 |
due to the underlined part not being the head |
624 | 287 |
of the alternative. |
288 |
||
289 |
We define our flatten operation so that it flattens |
|
290 |
the entire list: |
|
543 | 291 |
\begin{center} |
292 |
\begin{tabular}{@{}lcl@{}} |
|
293 |
$\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; |
|
294 |
(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\ |
|
295 |
$\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \; \textit{as'} $ \\ |
|
296 |
$\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise) |
|
297 |
\end{tabular} |
|
298 |
\end{center} |
|
299 |
\noindent |
|
585 | 300 |
Our $\flts$ operation |
301 |
also throws away $\ZERO$s |
|
302 |
as they do not contribute to a lexing result. |
|
303 |
\subsection{Duplicate Removal} |
|
624 | 304 |
After flattening is done, we can deduplicate. |
585 | 305 |
The de-duplicate function is called $\distinctBy$, |
306 |
and that is where we make our second improvement over |
|
624 | 307 |
Sulzmann and Lu's simplification method. |
585 | 308 |
The process goes as follows: |
309 |
\begin{center} |
|
310 |
$rs \stackrel{\textit{flts}}{\longrightarrow} |
|
311 |
rs_{flat} |
|
312 |
\xrightarrow{\distinctBy \; |
|
313 |
rs_{flat} \; \rerases\; \varnothing} rs_{distinct}$ |
|
314 |
%\stackrel{\distinctBy \; |
|
315 |
%rs_{flat} \; \erase\; \varnothing}{\longrightarrow} \; rs_{distinct}$ |
|
316 |
\end{center} |
|
317 |
where the $\distinctBy$ function is defined as: |
|
543 | 318 |
\begin{center} |
319 |
\begin{tabular}{@{}lcl@{}} |
|
320 |
$\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\ |
|
585 | 321 |
$\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & $\quad \textit{if} (f \; x \in acc)\;\; \textit{then} \;\; \distinctBy \; xs \; f \; acc$\\ |
322 |
& & $\quad \textit{else}\;\; x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$ |
|
543 | 323 |
\end{tabular} |
324 |
\end{center} |
|
325 |
\noindent |
|
326 |
The reason we define a distinct function under a mapping $f$ is because |
|
585 | 327 |
we want to eliminate regular expressions that are syntactically the same, |
624 | 328 |
but have different bit-codes. |
585 | 329 |
For example, we can remove the second $a^*a^*$ from |
330 |
$_{ZSZ}a^*a^* + _{SZZ}a^*a^*$, because it |
|
331 |
represents a match with shorter initial sub-match |
|
332 |
(and therefore is definitely not POSIX), |
|
333 |
and will be discarded by $\bmkeps$ later. |
|
334 |
\begin{center} |
|
335 |
$_{ZSZ}\underbrace{a^*}_{ZS:\; match \; 1\; times\quad}\underbrace{a^*}_{Z: \;match\; 1 \;times} + |
|
336 |
_{SZZ}\underbrace{a^*}_{S: \; match \; 0 \; times\quad}\underbrace{a^*}_{ZZ: \; match \; 2 \; times} |
|
337 |
$ |
|
338 |
\end{center} |
|
339 |
%$_{bs1} r_1 + _{bs2} r_2 \text{where} (r_1)_{\downarrow} = (r_2)_{\downarrow}$ |
|
340 |
Due to the way our algorithm works, |
|
341 |
the matches that conform to the POSIX standard |
|
342 |
will always be placed further to the left. When we |
|
343 |
traverse the list from left to right, |
|
344 |
regular expressions we have already seen |
|
345 |
will definitely not contribute to a POSIX value, |
|
346 |
even if they are attached with different bitcodes. |
|
347 |
These duplicates therefore need to be removed. |
|
348 |
To achieve this, we call $\rerases$ as the function $f$ during the distinction |
|
624 | 349 |
operation. The function |
585 | 350 |
$\rerases$ is very similar to $\erase$, except that it preserves the structure |
351 |
when erasing an alternative regular expression. |
|
352 |
The reason why we use $\rerases$ instead of $\erase$ is that |
|
353 |
it keeps the structures of alternative |
|
354 |
annotated regular expressions |
|
624 | 355 |
whereas $\erase$ would turn it back into a binary tree structure. |
585 | 356 |
Not having to mess with the structure |
590 | 357 |
greatly simplifies the finiteness proof in chapter |
624 | 358 |
\ref{Finite}. |
585 | 359 |
We give the definitions of $\rerases$ here together with |
360 |
the new datatype used by $\rerases$ (as our plain |
|
590 | 361 |
regular expression datatype does not allow non-binary alternatives). |
624 | 362 |
For now we can think of |
363 |
$\rerases$ as the function $(\_)_\downarrow$ defined in chapter \ref{Bitcoded1} |
|
364 |
and $\rrexp$ as plain regular expressions, but having a general list constructor |
|
365 |
for alternatives: |
|
590 | 366 |
\begin{figure}[H] |
367 |
\begin{center} |
|
368 |
$\rrexp ::= \RZERO \mid \RONE |
|
585 | 369 |
\mid \RCHAR{c} |
370 |
\mid \RSEQ{r_1}{r_2} |
|
371 |
\mid \RALTS{rs} |
|
590 | 372 |
\mid \RSTAR{r} $ |
373 |
\end{center} |
|
374 |
\caption{$\rrexp$: plain regular expressions, but with $\sum$ alternative |
|
375 |
constructor}\label{rrexpDef} |
|
376 |
\end{figure} |
|
624 | 377 |
The function $\rerases$ we define as follows: |
585 | 378 |
\begin{center} |
379 |
\begin{tabular}{lcl} |
|
380 |
$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\ |
|
381 |
$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\ |
|
382 |
$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\ |
|
383 |
$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ |
|
384 |
$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ |
|
385 |
$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$ |
|
386 |
\end{tabular} |
|
387 |
\end{center} |
|
388 |
||
389 |
\subsection{Putting Things Together} |
|
624 | 390 |
We can now give the definition of our simplification function: |
585 | 391 |
%that looks somewhat similar to our Scala code is |
538 | 392 |
\begin{center} |
393 |
\begin{tabular}{@{}lcl@{}} |
|
394 |
||
543 | 395 |
$\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp} \; a_2) $ \\ |
585 | 396 |
$\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \rerases \; \varnothing) $ \\ |
543 | 397 |
$\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
538 | 398 |
\end{tabular} |
399 |
\end{center} |
|
400 |
||
401 |
\noindent |
|
585 | 402 |
The simplification (named $\textit{bsimp}$ for \emph{b}it-coded) |
403 |
does a pattern matching on the regular expression. |
|
639 | 404 |
When it detects that the regular expression is an alternative or |
543 | 405 |
sequence, it will try to simplify its children regular expressions |
538 | 406 |
recursively and then see if one of the children turns into $\ZERO$ or |
407 |
$\ONE$, which might trigger further simplification at the current level. |
|
543 | 408 |
Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$, |
409 |
using rules such as $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$. |
|
410 |
\begin{center} |
|
411 |
\begin{tabular}{@{}lcl@{}} |
|
412 |
$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\ |
|
413 |
&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
|
414 |
&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
|
415 |
&&$\quad\textit{case} \; (_{bs1}\ONE, a_2') \Rightarrow \textit{fuse} \; (bs@bs_1) \; a_2'$ \\ |
|
416 |
&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ |
|
417 |
\end{tabular} |
|
418 |
\end{center} |
|
538 | 419 |
\noindent |
543 | 420 |
The most involved part is the $\sum$ clause, where we first call $\flts$ on |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
421 |
the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$, |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
422 |
and then call $\distinctBy$ on that list. The predicate used in $\distinctBy$ for determining whether two |
585 | 423 |
elements are the same is $\rerases \; r_1 = \rerases\; r_2$. |
543 | 424 |
Finally, depending on whether the regular expression list $as'$ has turned into a |
624 | 425 |
singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$ |
543 | 426 |
decides whether to keep the current level constructor $\sum$ as it is, and |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
427 |
removes it when there are fewer than two elements: |
543 | 428 |
\begin{center} |
429 |
\begin{tabular}{lcl} |
|
624 | 430 |
$\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\ |
543 | 431 |
&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
432 |
&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
|
433 |
&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ |
|
434 |
\end{tabular} |
|
435 |
||
436 |
\end{center} |
|
624 | 437 |
Having defined the $\textit{bsimp}$ function, |
438 |
we add it as a phase after a derivative is taken. |
|
543 | 439 |
\begin{center} |
440 |
\begin{tabular}{lcl} |
|
441 |
$r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$ |
|
442 |
\end{tabular} |
|
443 |
\end{center} |
|
585 | 444 |
%Following previous notations |
445 |
%when extending from derivatives w.r.t.~character to derivative |
|
446 |
%w.r.t.~string, we define the derivative that nests simplifications |
|
447 |
%with derivatives:%\comment{simp in the [] case?} |
|
624 | 448 |
We extend this from characters to strings: |
538 | 449 |
\begin{center} |
450 |
\begin{tabular}{lcl} |
|
543 | 451 |
$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\ |
452 |
$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$ |
|
538 | 453 |
\end{tabular} |
454 |
\end{center} |
|
455 |
||
456 |
\noindent |
|
585 | 457 |
The lexer that extracts bitcodes from the |
458 |
derivatives with simplifications from our $\simp$ function |
|
459 |
is called $\blexersimp$: |
|
460 |
\begin{center} |
|
538 | 461 |
\begin{tabular}{lcl} |
462 |
$\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
639 | 463 |
$\textit{let}\;a = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\ |
538 | 464 |
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
465 |
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
466 |
& & $\;\;\textit{else}\;\textit{None}$ |
|
467 |
\end{tabular} |
|
468 |
\end{center} |
|
469 |
\noindent |
|
639 | 470 |
This algorithm keeps the regular expression size small, |
471 |
as we shall demonstrate with some examples in the next section. |
|
538 | 472 |
|
473 |
||
600 | 474 |
\subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$ |
475 |
After Simplification} |
|
476 |
Recall the |
|
585 | 477 |
previous $(a^*a^*)^*$ example |
624 | 478 |
where $\textit{simp}\_{SL}$ could not |
600 | 479 |
prevent the fast growth (over |
585 | 480 |
3 million nodes just below $20$ input length) |
600 | 481 |
will be reduced to just 15 and stays constant no matter how long the |
585 | 482 |
input string is. |
600 | 483 |
This is shown in the graphs below. |
585 | 484 |
\begin{figure}[H] |
543 | 485 |
\begin{center} |
486 |
\begin{tabular}{ll} |
|
487 |
\begin{tikzpicture} |
|
488 |
\begin{axis}[ |
|
489 |
xlabel={$n$}, |
|
490 |
ylabel={derivative size}, |
|
491 |
width=7cm, |
|
492 |
height=4cm, |
|
493 |
legend entries={Lexer with $\textit{bsimp}$}, |
|
539 | 494 |
legend pos= south east, |
495 |
legend cell align=left] |
|
496 |
\addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data}; |
|
497 |
\end{axis} |
|
498 |
\end{tikzpicture} %\label{fig:BitcodedLexer} |
|
499 |
& |
|
500 |
\begin{tikzpicture} |
|
501 |
\begin{axis}[ |
|
502 |
xlabel={$n$}, |
|
503 |
ylabel={derivative size}, |
|
504 |
width = 7cm, |
|
505 |
height = 4cm, |
|
624 | 506 |
legend entries={Lexer with $\textit{simp}\_{SL}$}, |
539 | 507 |
legend pos= north west, |
508 |
legend cell align=left] |
|
509 |
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data}; |
|
510 |
\end{axis} |
|
511 |
\end{tikzpicture} |
|
512 |
\end{tabular} |
|
543 | 513 |
\end{center} |
639 | 514 |
\caption{Our Improvement over Sulzmann and Lu's in terms of size of the derivatives.} |
585 | 515 |
\end{figure} |
516 |
\noindent |
|
517 |
Given the size difference, it is not |
|
518 |
surprising that our $\blexersimp$ significantly outperforms |
|
639 | 519 |
$\textit{blexer\_SLSimp}$ by Sulzmann and Lu. |
624 | 520 |
In the next section we are going to establish that our |
521 |
simplification preserves the correctness of the algorithm. |
|
543 | 522 |
%---------------------------------------------------------------------------------------- |
523 |
% SECTION rewrite relation |
|
524 |
%---------------------------------------------------------------------------------------- |
|
585 | 525 |
\section{Correctness of $\blexersimp$} |
526 |
We first introduce the rewriting relation \emph{rrewrite} |
|
527 |
($\rrewrite$) between two regular expressions, |
|
624 | 528 |
which stands for an atomic |
600 | 529 |
simplification. |
585 | 530 |
We then prove properties about |
531 |
this rewriting relation and its reflexive transitive closure. |
|
532 |
Finally we leverage these properties to show |
|
624 | 533 |
an equivalence between the results generated by |
585 | 534 |
$\blexer$ and $\blexersimp$. |
535 |
||
536 |
\subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)} |
|
576 | 537 |
In the $\blexer$'s correctness proof, we |
600 | 538 |
did not directly derive the fact that $\blexer$ generates the POSIX value, |
624 | 539 |
but first proved that $\blexer$ generates the same result as $\lexer$. |
576 | 540 |
Then we re-use |
585 | 541 |
the correctness of $\lexer$ |
624 | 542 |
to obtain |
585 | 543 |
\begin{center} |
624 | 544 |
$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$\\ |
545 |
$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\; |
|
546 |
r\;s = \None$. |
|
585 | 547 |
\end{center} |
624 | 548 |
%\begin{center} |
549 |
% $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$. |
|
550 |
%\end{center} |
|
585 | 551 |
Here we apply this |
552 |
modularised technique again |
|
553 |
by first proving that |
|
576 | 554 |
$\blexersimp \; r \; s $ |
555 |
produces the same output as $\blexer \; r\; s$, |
|
585 | 556 |
and then piecing it together with |
557 |
$\blexer$'s correctness to achieve our main |
|
624 | 558 |
theorem: |
576 | 559 |
\begin{center} |
624 | 560 |
$(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = \Some \;v$ |
561 |
\\ |
|
562 |
$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; |
|
563 |
r\;s = \None$ |
|
576 | 564 |
\end{center} |
565 |
\noindent |
|
585 | 566 |
The overall idea for the proof |
567 |
of $\blexer \;r \;s = \blexersimp \; r \;s$ |
|
568 |
is that the transition from $r$ to $\textit{bsimp}\; r$ can be |
|
624 | 569 |
broken down into smaller rewrite steps of the form: |
585 | 570 |
\begin{center} |
571 |
$r \rightsquigarrow^* \textit{bsimp} \; r$ |
|
572 |
\end{center} |
|
573 |
where each rewrite step, written $\rightsquigarrow$, |
|
574 |
is an ``atomic'' simplification that |
|
624 | 575 |
is similar to a small-step reduction in operational semantics ( |
576 |
see figure \ref{rrewriteRules} for the rules): |
|
585 | 577 |
\begin{figure}[H] |
578 |
\begin{mathpar} |
|
579 |
\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\} |
|
538 | 580 |
|
585 | 581 |
\inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\} |
543 | 582 |
|
585 | 583 |
\inferrule * [Right = $S_1$]{\vspace{0em}}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\ |
543 | 584 |
|
585 |
||
586 |
||
585 | 587 |
\inferrule * [Right = $SL$] {\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\} |
543 | 588 |
|
585 | 589 |
\inferrule * [Right = $SR$] {\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\ |
543 | 590 |
|
585 | 591 |
\inferrule * [Right = $A0$] {\vspace{0em}}{ _{bs}\sum [] \rightsquigarrow \ZERO} |
543 | 592 |
|
585 | 593 |
\inferrule * [Right = $A1$] {\vspace{0em}}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a} |
543 | 594 |
|
585 | 595 |
\inferrule * [Right = $AL$] {\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2} |
543 | 596 |
|
585 | 597 |
\inferrule * [Right = $LE$] {\vspace{0em}}{ [] \stackrel{s}{\rightsquigarrow} []} |
543 | 598 |
|
585 | 599 |
\inferrule * [Right = $LT$] {rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{ r :: rs_1 \stackrel{s}{\rightsquigarrow} r :: rs_2 } |
543 | 600 |
|
585 | 601 |
\inferrule * [Right = $LH$] {r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs} |
543 | 602 |
|
585 | 603 |
\inferrule * [Right = $L\ZERO$] {\vspace{0em}}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs} |
543 | 604 |
|
585 | 605 |
\inferrule * [Right = $LS$] {\vspace{0em}}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) } |
543 | 606 |
|
591 | 607 |
\inferrule * [Right = $LD$] {\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rs_c \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c} |
543 | 608 |
|
609 |
\end{mathpar} |
|
585 | 610 |
\caption{ |
611 |
The rewrite rules that generate simplified regular expressions |
|
612 |
in small steps: $r_1 \rightsquigarrow r_2$ is for bitcoded regular expressions |
|
613 |
and $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$ for |
|
614 |
lists of bitcoded regular expressions. |
|
615 |
Interesting is the LD rule that allows copies of regular expressions |
|
616 |
to be removed provided a regular expression |
|
617 |
earlier in the list can match the same strings. |
|
618 |
}\label{rrewriteRules} |
|
619 |
\end{figure} |
|
620 |
\noindent |
|
624 | 621 |
The rules $LT$ and $LH$ are for rewriting two regular expression lists |
585 | 622 |
such that one regular expression |
623 |
in the left-hand-side list is rewritable in one step |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
624 |
to the right-hand side's regular expression at the same position. |
639 | 625 |
This helps with defining the ``context rule'' $AL$. |
626 |
||
585 | 627 |
The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$ |
628 |
are defined in the usual way: |
|
629 |
\begin{figure}[H] |
|
630 |
\centering |
|
631 |
\begin{mathpar} |
|
632 |
\inferrule{\vspace{0em}}{ r \rightsquigarrow^* r \\} |
|
543 | 633 |
|
585 | 634 |
\inferrule{\vspace{0em}}{rs \stackrel{s*}{\rightsquigarrow} rs \\} |
543 | 635 |
|
585 | 636 |
\inferrule{r_1 \rightsquigarrow^* r_2 \land \; r_2 \rightsquigarrow^* r_3}{r_1 \rightsquigarrow^* r_3\\} |
637 |
||
638 |
\inferrule{rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \land \; rs_2 \stackrel{s*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{s*}{\rightsquigarrow} rs_3} |
|
639 |
\end{mathpar} |
|
640 |
\caption{The Reflexive Transitive Closure of |
|
641 |
$\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure} |
|
642 |
\end{figure} |
|
600 | 643 |
%Two rewritable terms will remain rewritable to each other |
644 |
%even after a derivative is taken: |
|
624 | 645 |
The main point of our rewriting relation |
646 |
is that it is preserved under derivatives, |
|
600 | 647 |
namely |
543 | 648 |
\begin{center} |
585 | 649 |
$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$ |
543 | 650 |
\end{center} |
624 | 651 |
And also, if two terms are rewritable to each other, |
585 | 652 |
then they produce the same bitcodes: |
653 |
\begin{center} |
|
654 |
$r \rightsquigarrow^* r' \;\; \textit{then} \; \; \bmkeps \; r = \bmkeps \; r'$ |
|
655 |
\end{center} |
|
656 |
The decoding phase of both $\blexer$ and $\blexersimp$ |
|
624 | 657 |
are the same, which means that if they receive the same |
585 | 658 |
bitcodes before the decoding phase, |
624 | 659 |
they generate the same value after decoding is done. |
585 | 660 |
We will prove the three properties |
661 |
we mentioned above in the next sub-section. |
|
662 |
\subsection{Important Properties of $\rightsquigarrow$} |
|
663 |
First we prove some basic facts |
|
664 |
about $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$, |
|
665 |
$\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$, |
|
666 |
which will be needed later.\\ |
|
667 |
The inference rules (\ref{rrewriteRules}) we |
|
668 |
gave in the previous section |
|
669 |
have their ``many-steps version'': |
|
543 | 670 |
|
586 | 671 |
\begin{lemma}\label{squig1} |
585 | 672 |
\hspace{0em} |
673 |
\begin{itemize} |
|
674 |
\item |
|
675 |
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$ |
|
676 |
\item |
|
586 | 677 |
$r \rightsquigarrow^* r' \implies _{bs} \sum (r :: rs)\; \rightsquigarrow^*\; _{bs} \sum (r' :: rs)$ |
678 |
||
679 |
\item |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
680 |
The rewriting in many steps property is composable |
586 | 681 |
in terms of the sequence constructor:\\ |
682 |
$r_1 \rightsquigarrow^* r_2 |
|
683 |
\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* \; |
|
684 |
_{bs} r_2 \cdot r_3 \quad $ |
|
685 |
and |
|
686 |
$\quad r_3 \rightsquigarrow^* r_4 |
|
687 |
\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* _{bs} \; r_1 \cdot r_4$ |
|
688 |
\item |
|
689 |
The rewriting in many steps properties |
|
690 |
$\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ |
|
691 |
is preserved under the function $\fuse$:\\ |
|
692 |
$r_1 \rightsquigarrow^* r_2 |
|
693 |
\implies \fuse \; bs \; r_1 \rightsquigarrow^* \; |
|
694 |
\fuse \; bs \; r_2 \quad $ and |
|
695 |
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 |
|
696 |
\implies \map \; (\fuse \; bs) \; rs_1 |
|
697 |
\stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$ |
|
585 | 698 |
\end{itemize} |
543 | 699 |
\end{lemma} |
700 |
\begin{proof} |
|
585 | 701 |
By an induction on |
702 |
the inductive cases of $\stackrel{s*}{\rightsquigarrow}$ and $\rightsquigarrow^*$ respectively. |
|
586 | 703 |
The third and fourth points are |
704 |
by the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and |
|
705 |
$rs_2 \stackrel{s}{\rightsquigarrow} rs_3 |
|
706 |
\implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
707 |
which can be inductively proven by the inductive cases of $\rightsquigarrow$ and |
586 | 708 |
$\stackrel{s}{\rightsquigarrow}$. |
543 | 709 |
\end{proof} |
710 |
\noindent |
|
585 | 711 |
The inference rules of $\stackrel{s}{\rightsquigarrow}$ |
624 | 712 |
are defined in terms of the list cons operation, where |
585 | 713 |
we establish that the |
714 |
$\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ |
|
586 | 715 |
relation is also preserved w.r.t appending and prepending of a list. |
716 |
In addition, we |
|
717 |
also prove some relations |
|
718 |
between $\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$. |
|
585 | 719 |
\begin{lemma}\label{ssgqTossgs} |
720 |
\hspace{0em} |
|
721 |
\begin{itemize} |
|
722 |
\item |
|
723 |
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$ |
|
724 |
||
725 |
\item |
|
726 |
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies |
|
586 | 727 |
rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2 \; \; |
728 |
\textit{and} \; \; |
|
729 |
rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$ |
|
730 |
||
585 | 731 |
\item |
732 |
The $\stackrel{s}{\rightsquigarrow} $ relation after appending |
|
733 |
a list becomes $\stackrel{s*}{\rightsquigarrow}$:\\ |
|
586 | 734 |
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 |
735 |
\implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$ |
|
736 |
\item |
|
737 |
||
738 |
$r_1 \rightsquigarrow^* r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$ |
|
739 |
\item |
|
740 |
||
741 |
$rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \rightsquigarrow^* r_2 \implies |
|
742 |
r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$ |
|
743 |
\item |
|
624 | 744 |
If we can rewrite a regular expression |
586 | 745 |
in many steps to $\ZERO$, then |
624 | 746 |
we can also rewrite any sequence containing it to $\ZERO$:\\ |
586 | 747 |
$r_1 \rightsquigarrow^* \ZERO |
748 |
\implies _{bs}r_1\cdot r_2 \rightsquigarrow^* \ZERO$ |
|
585 | 749 |
\end{itemize} |
543 | 750 |
\end{lemma} |
751 |
\begin{proof} |
|
585 | 752 |
The first part is by induction on the list $rs$. |
753 |
The second part is by induction on the inductive cases of $\stackrel{s*}{\rightsquigarrow}$. |
|
754 |
The third part is |
|
755 |
by rule induction of $\stackrel{s}{\rightsquigarrow}$. |
|
586 | 756 |
The fourth sub-lemma is |
757 |
by rule induction of |
|
758 |
$\stackrel{s*}{\rightsquigarrow}$ and using part one to three. |
|
759 |
The fifth part is a corollary of part four. |
|
760 |
The last part is proven by rule induction again on $\rightsquigarrow^*$. |
|
543 | 761 |
\end{proof} |
586 | 762 |
\noindent |
624 | 763 |
Now we are ready to give the proofs of the following properties: |
585 | 764 |
\begin{itemize} |
765 |
\item |
|
624 | 766 |
$r \rightsquigarrow^* r'\land \bnullable \; r_1 |
585 | 767 |
\implies \bmkeps \; r = \bmkeps \; r'$. \\ |
768 |
\item |
|
769 |
$r \rightsquigarrow^* \textit{bsimp} \;r$.\\ |
|
770 |
\item |
|
771 |
$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow^* r'\backslash c$.\\ |
|
772 |
\end{itemize} |
|
624 | 773 |
|
774 |
\subsubsection{Property 1: $r \rightsquigarrow^* r'\land \bnullable \; r_1 |
|
586 | 775 |
\implies \bmkeps \; r = \bmkeps \; r'$} |
776 |
Intuitively, this property says we can |
|
777 |
extract the same bitcodes using $\bmkeps$ from the nullable |
|
778 |
components of two regular expressions $r$ and $r'$, |
|
779 |
if we can rewrite from one to the other in finitely |
|
639 | 780 |
many steps. |
781 |
||
586 | 782 |
For convenience, |
783 |
we define a predicate for a list of regular expressions |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
784 |
having at least one nullable regular expression: |
586 | 785 |
\begin{center} |
786 |
$\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$ |
|
787 |
\end{center} |
|
788 |
\noindent |
|
624 | 789 |
The rewriting relation $\rightsquigarrow$ preserves (b)nullability: |
586 | 790 |
\begin{lemma}\label{rewritesBnullable} |
791 |
\hspace{0em} |
|
792 |
\begin{itemize} |
|
793 |
\item |
|
794 |
$\text{If} \; r_1 \rightsquigarrow r_2, \; |
|
795 |
\text{then} \; \bnullable \; r_1 = \bnullable \; r_2$ |
|
796 |
\item |
|
797 |
$\text{If} \; rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \; |
|
798 |
\text{then} \; \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$ |
|
799 |
\item |
|
800 |
$r_1 \rightsquigarrow^* r_2 |
|
801 |
\implies \bnullable \; r_1 = \bnullable \; r_2$ |
|
802 |
\end{itemize} |
|
803 |
\end{lemma} |
|
804 |
\begin{proof} |
|
805 |
By rule induction of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
806 |
The third point is a result of the second. |
586 | 807 |
\end{proof} |
808 |
\noindent |
|
809 |
For convenience again, |
|
810 |
we define $\bmkepss$ on a list $rs$, |
|
811 |
which extracts the bit-codes on the first $\bnullable$ element in $rs$: |
|
812 |
\begin{center} |
|
813 |
\begin{tabular}{lcl} |
|
814 |
$\bmkepss \; [] $ & $\dn$ & $[]$\\ |
|
815 |
$\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \;(\bnullable \; r) \;\; |
|
816 |
\textit{then} \;\; \bmkeps \; r \; \textit{else} \;\; \bmkepss \; rs$ |
|
817 |
\end{tabular} |
|
818 |
\end{center} |
|
819 |
\noindent |
|
820 |
If both regular expressions in a rewriting relation are nullable, then they |
|
821 |
produce the same bitcodes: |
|
822 |
\begin{lemma}\label{rewriteBmkepsAux} |
|
823 |
\hspace{0em} |
|
824 |
\begin{itemize} |
|
825 |
\item |
|
826 |
$r_1 \rightsquigarrow r_2 \implies |
|
827 |
(\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 = |
|
828 |
\bmkeps \; r_2)$ |
|
829 |
\item |
|
830 |
and |
|
831 |
$rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2 |
|
832 |
\implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies |
|
833 |
\bmkepss \; rs_1 = \bmkepss \; rs2)$ |
|
834 |
\end{itemize} |
|
835 |
\end{lemma} |
|
836 |
\begin{proof} |
|
837 |
By rule induction over the cases that lead to $r_1 \rightsquigarrow r_2$. |
|
838 |
\end{proof} |
|
839 |
\noindent |
|
624 | 840 |
With lemma \ref{rewriteBmkepsAux} in place we are ready to prove its |
586 | 841 |
many-step version: |
842 |
\begin{lemma} |
|
843 |
$\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r, \;\;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$ |
|
844 |
\end{lemma} |
|
845 |
\begin{proof} |
|
624 | 846 |
By rule induction of $\stackrel{*}{\rightsquigarrow} $. Lemma |
847 |
$\ref{rewritesBnullable}$ gives us both $r$ and $r'$ are nullable. |
|
848 |
The lemma \ref{rewriteBmkepsAux} solves the inductive case. |
|
586 | 849 |
\end{proof} |
585 | 850 |
|
624 | 851 |
\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$} |
852 |
Now we get to the key part of the proof, |
|
586 | 853 |
which says that our simplification's helper functions |
624 | 854 |
such as $\distinctBy$ and $\flts$ describe |
855 |
reducts of $\stackrel{s*}{\rightsquigarrow}$ and |
|
639 | 856 |
$\rightsquigarrow^* $. |
857 |
||
586 | 858 |
The first lemma to prove is a more general version of |
859 |
$rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$: |
|
543 | 860 |
\begin{lemma} |
586 | 861 |
$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} |
862 |
(rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$ |
|
543 | 863 |
\end{lemma} |
586 | 864 |
\noindent |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
865 |
It says that for a list made of two parts $rs_1 @ rs_2$, |
586 | 866 |
one can throw away the duplicate |
867 |
elements in $rs_2$, as well as those that have appeared in $rs_1$. |
|
543 | 868 |
\begin{proof} |
869 |
By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary. |
|
870 |
\end{proof} |
|
586 | 871 |
\noindent |
872 |
Setting $rs_2$ to be empty, |
|
873 |
we get the corollary |
|
874 |
\begin{corollary}\label{dBPreserves} |
|
875 |
$rs_1 \stackrel{s*}{\rightsquigarrow} \distinctBy \; rs_1 \; \phi$. |
|
876 |
\end{corollary} |
|
877 |
\noindent |
|
624 | 878 |
Similarly the flatten function $\flts$ describes a reduct of |
586 | 879 |
$\stackrel{s*}{\rightsquigarrow}$ as well: |
538 | 880 |
|
543 | 881 |
\begin{lemma}\label{fltsPreserves} |
882 |
$rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$ |
|
883 |
\end{lemma} |
|
884 |
\begin{proof} |
|
586 | 885 |
By an induction on $rs$. |
543 | 886 |
\end{proof} |
887 |
\noindent |
|
888 |
The function $\bsimpalts$ preserves rewritability: |
|
889 |
\begin{lemma}\label{bsimpaltsPreserves} |
|
890 |
$_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$ |
|
891 |
\end{lemma} |
|
892 |
\noindent |
|
586 | 893 |
The simplification function |
624 | 894 |
$\textit{bsimp}$ only transforms the regular expression using steps specified by |
895 |
$\rightsquigarrow^*$ and nothing else: |
|
543 | 896 |
\begin{lemma} |
624 | 897 |
$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$ |
543 | 898 |
\end{lemma} |
899 |
\begin{proof} |
|
900 |
By an induction on $r$. |
|
624 | 901 |
The most involved case is the alternative, |
586 | 902 |
where we use lemmas \ref{bsimpaltsPreserves}, |
903 |
\ref{fltsPreserves} and \ref{dBPreserves} to do a series of rewriting:\\ |
|
904 |
\begin{center} |
|
905 |
\begin{tabular}{lcl} |
|
906 |
$rs$ & $\stackrel{s*}{\rightsquigarrow}$ & $ \map \; \textit{bsimp} \; rs$\\ |
|
907 |
& $\stackrel{s*}{\rightsquigarrow}$ & $ \flts \; (\map \; \textit{bsimp} \; rs)$\\ |
|
908 |
& $\stackrel{s*}{\rightsquigarrow}$ & $ \distinctBy \; |
|
909 |
(\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerases \; \phi$\\ |
|
910 |
\end{tabular} |
|
911 |
\end{center} |
|
624 | 912 |
Using this we can derive the following rewrite sequence:\\ |
586 | 913 |
\begin{center} |
914 |
\begin{tabular}{lcl} |
|
915 |
$r$ & $=$ & $_{bs}\sum rs$\\[1.5ex] |
|
916 |
& $\rightsquigarrow^*$ & $\bsimpalts \; bs \; rs$ \\[1.5ex] |
|
917 |
& $\rightsquigarrow^*$ & $\ldots$ \\ [1.5ex] |
|
918 |
& $\rightsquigarrow^*$ & $\bsimpalts \; bs \; |
|
919 |
(\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) |
|
920 |
\; \rerases \; \phi)$\\[1.5ex] |
|
921 |
%& $\rightsquigarrow^*$ & $ _{bs} \sum (\distinctBy \; |
|
922 |
%(\flts \; (\map \; \textit{bsimp}\; rs)) \; \; |
|
923 |
%\rerases \; \;\phi) $\\[1.5ex] |
|
924 |
& $\rightsquigarrow^*$ & $\textit{bsimp} \; r$\\[1.5ex] |
|
925 |
\end{tabular} |
|
926 |
\end{center} |
|
543 | 927 |
\end{proof} |
585 | 928 |
\subsubsection{Property 3: $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$} |
624 | 929 |
The rewrite relation |
930 |
$\rightsquigarrow$ changes into $\stackrel{*}{\rightsquigarrow}$ |
|
931 |
after derivatives are taken on both sides: |
|
543 | 932 |
\begin{lemma}\label{rewriteBder} |
588 | 933 |
\hspace{0em} |
934 |
\begin{itemize} |
|
935 |
\item |
|
936 |
If $r_1 \rightsquigarrow r_2$, then $r_1 \backslash c |
|
937 |
\rightsquigarrow^* r_2 \backslash c$ |
|
938 |
\item |
|
939 |
If $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$, then $ |
|
940 |
\map \; (\_\backslash c) \; rs_1 |
|
941 |
\stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$ |
|
942 |
\end{itemize} |
|
543 | 943 |
\end{lemma} |
944 |
\begin{proof} |
|
588 | 945 |
By induction on $\rightsquigarrow$ |
946 |
and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas. |
|
947 |
\end{proof} |
|
948 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
949 |
Now we can prove property 3 as an immediate corollary: |
588 | 950 |
\begin{corollary}\label{rewritesBder} |
951 |
$r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^* |
|
952 |
r_2 \backslash c$ |
|
953 |
\end{corollary} |
|
954 |
\begin{proof} |
|
624 | 955 |
By rule induction of $\stackrel{*}{\rightsquigarrow} $ and lemma \ref{rewriteBder}. |
543 | 956 |
\end{proof} |
957 |
\noindent |
|
588 | 958 |
This can be extended and combined with $r \rightsquigarrow^* \textit{bsimp} \; r$ |
624 | 959 |
to obtain the correspondence between |
588 | 960 |
$\blexer$ and $\blexersimp$'s intermediate |
961 |
derivative regular expressions |
|
543 | 962 |
\begin{lemma}\label{bderBderssimp} |
588 | 963 |
$a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $ |
543 | 964 |
\end{lemma} |
588 | 965 |
\begin{proof} |
966 |
By an induction on $s$. |
|
967 |
\end{proof} |
|
543 | 968 |
\subsection{Main Theorem} |
624 | 969 |
Now with \ref{bderBderssimp} in place we are ready for the main theorem. |
543 | 970 |
\begin{theorem} |
971 |
$\blexer \; r \; s = \blexersimp{r}{s}$ |
|
972 |
\end{theorem} |
|
973 |
\noindent |
|
576 | 974 |
\begin{proof} |
624 | 975 |
We can rewrite in many steps from the original lexer's |
588 | 976 |
derivative regular expressions to the |
977 |
lexer with simplification applied (by lemma \ref{bderBderssimp}): |
|
978 |
\begin{center} |
|
639 | 979 |
$a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $. |
588 | 980 |
\end{center} |
624 | 981 |
We know that they generate the same bits, if the lexing result is a match: |
588 | 982 |
\begin{center} |
983 |
$\bnullable \; (a \backslash s) |
|
984 |
\implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$ |
|
985 |
\end{center} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
986 |
Now that they generate the same bits, we know they also give the same value after decoding. |
588 | 987 |
\begin{center} |
988 |
$\bnullable \; (a \backslash s) |
|
989 |
\implies \decode \; r \; (\bmkeps \; (a \backslash s)) = |
|
990 |
\decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$ |
|
991 |
\end{center} |
|
624 | 992 |
Which is required by our proof goal: |
588 | 993 |
\begin{center} |
994 |
$\blexer \; r \; s = \blexersimp \; r \; s$. |
|
995 |
\end{center} |
|
576 | 996 |
\end{proof} |
997 |
\noindent |
|
998 |
As a corollary, |
|
624 | 999 |
we can link this result with the lemma we proved earlier that |
576 | 1000 |
\begin{center} |
624 | 1001 |
$(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = \Some \;v$\\ |
1002 |
$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\; |
|
1003 |
r\;s = \None$. |
|
576 | 1004 |
\end{center} |
639 | 1005 |
and obtain the property that the bit-coded lexer with simplification is |
1006 |
indeed correctly generating a POSIX lexing result, if such a result exists. |
|
576 | 1007 |
\begin{corollary} |
624 | 1008 |
$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\ |
1009 |
$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; |
|
1010 |
r\;s = \None$. |
|
576 | 1011 |
\end{corollary} |
532 | 1012 |
|
543 | 1013 |
\subsection{Comments on the Proof Techniques Used} |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1014 |
Straightforward as the proof may seem, |
624 | 1015 |
the efforts we spent obtaining it were far from trivial. |
589 | 1016 |
We initially attempted to re-use the argument |
1017 |
in \cref{flex_retrieve}. |
|
624 | 1018 |
The problem is that both functions $\inj$ and $\retrieve$ require |
589 | 1019 |
that the annotated regular expressions stay unsimplified, |
1020 |
so that one can |
|
1021 |
correctly compare $v_{i+1}$ and $r_i$ and $v_i$ |
|
639 | 1022 |
in diagram \ref{graph:inj}. |
543 | 1023 |
|
589 | 1024 |
We also tried to prove |
1025 |
\begin{center} |
|
1026 |
$\textit{bsimp} \;\; (\bderssimp{a}{s}) = |
|
1027 |
\textit{bsimp} \;\; (a\backslash s)$, |
|
1028 |
\end{center} |
|
1029 |
but this turns out to be not true. |
|
639 | 1030 |
A counterexample is |
589 | 1031 |
\[ a = [(_{Z}1+_{S}c)\cdot [bb \cdot (_{Z}1+_{S}c)]] \;\; |
1032 |
\text{and} \;\; s = bb. |
|
1033 |
\] |
|
1034 |
\noindent |
|
1035 |
Then we would have |
|
1036 |
\begin{center} |
|
1037 |
$\textit{bsimp}\;\; ( a \backslash s )$ = |
|
1038 |
$_{[]}(_{ZZ}\ONE + _{ZS}c ) $ |
|
1039 |
\end{center} |
|
1040 |
\noindent |
|
1041 |
whereas |
|
1042 |
\begin{center} |
|
1043 |
$\textit{bsimp} \;\;( \bderssimp{a}{s} )$ = |
|
1044 |
$_{Z}(_{Z} \ONE + _{S} c)$. |
|
1045 |
\end{center} |
|
1046 |
Unfortunately, |
|
1047 |
if we apply $\textit{bsimp}$ differently |
|
1048 |
we will always have this discrepancy. |
|
1049 |
This is due to |
|
1050 |
the $\map \; (\fuse\; bs) \; as$ operation |
|
639 | 1051 |
happening at different locations in the regular expression. |
1052 |
||
589 | 1053 |
The rewriting relation |
1054 |
$\rightsquigarrow^*$ |
|
1055 |
allows us to ignore this discrepancy |
|
1056 |
and view the expressions |
|
1057 |
\begin{center} |
|
1058 |
$_{[]}(_{ZZ}\ONE + _{ZS}c ) $\\ |
|
1059 |
and\\ |
|
1060 |
$_{Z}(_{Z} \ONE + _{S} c)$ |
|
543 | 1061 |
|
589 | 1062 |
\end{center} |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1063 |
as equal because they were both re-written |
639 | 1064 |
from the same expression. |
1065 |
||
600 | 1066 |
The simplification rewriting rules |
1067 |
given in \ref{rrewriteRules} are by no means |
|
1068 |
final, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1069 |
one could come up with new rules |
600 | 1070 |
such as |
1071 |
$\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow |
|
1072 |
\SEQs [r_1, r_2, r_3]$. |
|
624 | 1073 |
However this does not fit with the proof technique |
600 | 1074 |
of our main theorem, but seem to not violate the POSIX |
624 | 1075 |
property. |
1076 |
||
1077 |
Having established the correctness of our |
|
1078 |
$\blexersimp$, in the next chapter we shall prove that with our $\simp$ function, |
|
1079 |
for a given $r$, the derivative size is always |
|
543 | 1080 |
finitely bounded by a constant. |