532
+ − 1
% Chapter Template
+ − 2
+ − 3
% Main chapter title
+ − 4
\chapter{Correctness of Bit-coded Algorithm with Simplification}
+ − 5
+ − 6
\label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
+ − 7
%Then we illustrate how the algorithm without bitcodes falls short for such aggressive
+ − 8
%simplifications and therefore introduce our version of the bitcoded algorithm and
+ − 9
%its correctness proof in
+ − 10
%Chapter 3\ref{Chapter3}.
+ − 11
+ − 12
+ − 13
538
+ − 14
Now we introduce the simplifications, which is why we introduce the
+ − 15
bitcodes in the first place.
+ − 16
543
+ − 17
\section{Simplification for Annotated Regular Expressions}
+ − 18
The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s
+ − 19
and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns
579
+ − 20
are scattered around different levels, and therefore requires
+ − 21
de-duplication at different levels:
543
+ − 22
\begin{center}
579
+ − 23
$(a^*a^*)^* \rightarrow (a^*a^* + a^*)\cdot(a^*a^*)^* \rightarrow $\\
543
+ − 24
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
+ − 25
\end{center}
+ − 26
\noindent
579
+ − 27
As we have already mentioned in \ref{eqn:growth2},
+ − 28
a simple-minded simplification function cannot simplify
+ − 29
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
+ − 30
any further.
+ − 31
we would expect a better simplification function to work in the
+ − 32
following way:
+ − 33
\begin{gather*}
+ − 34
((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
+ − 35
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\
+ − 36
\bigg\downarrow \\
+ − 37
(a^*a^* + a^*
+ − 38
\color{gray} + a^* \color{black})\cdot(a^*a^*)^* +
+ − 39
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\
+ − 40
\bigg\downarrow \\
+ − 41
(a^*a^* + a^*
+ − 42
)\cdot(a^*a^*)^*
+ − 43
\color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*
+ − 44
\end{gather*}
+ − 45
\noindent
+ − 46
This motivating example came from testing Sulzmann and Lu's
+ − 47
algorithm: their simplification does
+ − 48
not work!
+ − 49
We quote their $\textit{simp}$ function verbatim here:
+ − 50
\begin{center}
+ − 51
\begin{tabular}{lcl}
+ − 52
$\simp\; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ &
+ − 53
$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\
+ − 54
& &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\
+ − 55
$\simp\;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if}
+ − 56
\; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\;
+ − 57
\textit{then} \;\; \ZERO$\\
+ − 58
& & $\textit{else}\;\;_{bs}((\simp\;r_1)\cdot
+ − 59
(\simp\; r_2))$\\
+ − 60
$\simp \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\
+ − 61
$\simp \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
+ − 62
$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
+ − 63
$\simp \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\simp \; r)$\\
+ − 64
$\simp \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum
+ − 65
(\nub \; (\filter \; (\not \circ \zeroable)\;((\simp \; r) :: \map \; \simp \; rs)))$\\
+ − 66
+ − 67
\end{tabular}
+ − 68
\end{center}
+ − 69
\noindent
+ − 70
the $\textit{zeroable}$ predicate
+ − 71
which tests whether the regular expression
+ − 72
is equivalent to $\ZERO$,
+ − 73
is defined as:
+ − 74
\begin{center}
+ − 75
\begin{tabular}{lcl}
+ − 76
$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\;
+ − 77
\zeroable \;_{[]}\sum\;rs $\\
+ − 78
$\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\
+ − 79
$\zeroable\;_{bs}r^*$ & $\dn$ & $\textit{false}$ \\
+ − 80
$\zeroable\;_{bs}c$ & $\dn$ & $\textit{false}$\\
+ − 81
$\zeroable\;_{bs}\ONE$ & $\dn$ & $\textit{false}$\\
+ − 82
$\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$
+ − 83
\end{tabular}
+ − 84
\end{center}
+ − 85
%\[
+ − 86
%\begin{aligned}
+ − 87
%&\textit{isPhi} \left(b s @ r i^{*}\right)=\text { False } \\
+ − 88
%&\textit{isPhi}\left(b s @ r i_{1} r i_{2}\right)=\textit{isPhi} r i_{1} \vee \textit{isPhi} r i_{2} \\
+ − 89
%&\textit{isPhi} \left(b s @ r i_{1} \oplus r i_{2}\right)=\textit{isPhi} r i_{1} \wedge \textit{isPhi} r i_{2} \\
+ − 90
%&\textit{isPhi}(b s @ l)= False \\
+ − 91
%&\textit{isPhi}(b s @ \epsilon)= False \\
+ − 92
%&\textit{isPhi} \; \ZERO = True
+ − 93
%\end{aligned}
+ − 94
%\]
+ − 95
\noindent
543
+ − 96
Despite that we have already implemented the simple-minded simplifications
+ − 97
such as throwing away useless $\ONE$s and $\ZERO$s.
+ − 98
The simplification rule $r + r \rightarrow $ cannot make a difference either
+ − 99
since it only removes duplicates on the same level, not something like
+ − 100
$(r+a)+r \rightarrow r+a$.
+ − 101
This requires us to break up nested alternatives into lists, for example
+ − 102
using the flatten operation similar to the one defined for any function language's
+ − 103
list datatype:
538
+ − 104
543
+ − 105
\begin{center}
+ − 106
\begin{tabular}{@{}lcl@{}}
+ − 107
$\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
+ − 108
(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\
+ − 109
$\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \; \textit{as'} $ \\
+ − 110
$\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise)
+ − 111
\end{tabular}
+ − 112
\end{center}
+ − 113
+ − 114
\noindent
+ − 115
There is a minor difference though, in that our $\flts$ operation defined by us
+ − 116
also throws away $\ZERO$s.
+ − 117
For a flattened list of regular expressions, a de-duplication can be done efficiently:
+ − 118
+ − 119
+ − 120
\begin{center}
+ − 121
\begin{tabular}{@{}lcl@{}}
+ − 122
$\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\
+ − 123
$\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & \\
+ − 124
& & $\quad \textit{if} (f \; x) \in acc \textit{then} \distinctBy \; xs \; f \; acc$\\
+ − 125
& & $\quad \textit{else} x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$
+ − 126
\end{tabular}
+ − 127
\end{center}
+ − 128
\noindent
+ − 129
The reason we define a distinct function under a mapping $f$ is because
+ − 130
we want to eliminate regular expressions that are the same syntactically,
+ − 131
but having different bit-codes (more on the reason why we can do this later).
+ − 132
To achieve this, we call $\erase$ as the function $f$ during the distinction
+ − 133
operation.
+ − 134
A recursive definition of our simplification function
538
+ − 135
that looks somewhat similar to our Scala code is given below:
+ − 136
+ − 137
\begin{center}
+ − 138
\begin{tabular}{@{}lcl@{}}
+ − 139
543
+ − 140
$\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp} \; a_2) $ \\
+ − 141
$\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \erase \; \phi) $ \\
+ − 142
$\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
538
+ − 143
\end{tabular}
+ − 144
\end{center}
+ − 145
+ − 146
\noindent
543
+ − 147
The simplification (named $\bsimp$ for \emph{b}it-coded) does a pattern matching on the regular expression.
538
+ − 148
When it detected that the regular expression is an alternative or
543
+ − 149
sequence, it will try to simplify its children regular expressions
538
+ − 150
recursively and then see if one of the children turns into $\ZERO$ or
+ − 151
$\ONE$, which might trigger further simplification at the current level.
543
+ − 152
Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$,
+ − 153
using rules such as $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$.
+ − 154
\begin{center}
+ − 155
\begin{tabular}{@{}lcl@{}}
+ − 156
$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
+ − 157
&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
+ − 158
&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
+ − 159
&&$\quad\textit{case} \; (_{bs1}\ONE, a_2') \Rightarrow \textit{fuse} \; (bs@bs_1) \; a_2'$ \\
+ − 160
&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$
+ − 161
\end{tabular}
+ − 162
\end{center}
538
+ − 163
\noindent
543
+ − 164
The most involved part is the $\sum$ clause, where we first call $\flts$ on
+ − 165
the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$.
+ − 166
and then call $\distinctBy$ on that list, the predicate determining whether two
+ − 167
elements are the same is $\erase \; r_1 = \erase\; r_2$.
+ − 168
Finally, depending on whether the regular expression list $as'$ has turned into a
+ − 169
singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$
+ − 170
decides whether to keep the current level constructor $\sum$ as it is, and
+ − 171
removes it when there are less than two elements:
+ − 172
\begin{center}
+ − 173
\begin{tabular}{lcl}
+ − 174
$\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\
+ − 175
&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
+ − 176
&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
+ − 177
&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
+ − 178
\end{tabular}
+ − 179
+ − 180
\end{center}
+ − 181
Having defined the $\bsimp$ function,
+ − 182
we add it as a phase after a derivative is taken,
+ − 183
so it stays small:
+ − 184
\begin{center}
+ − 185
\begin{tabular}{lcl}
+ − 186
$r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
+ − 187
\end{tabular}
+ − 188
\end{center}
+ − 189
Following previous notation of natural
538
+ − 190
extension from derivative w.r.t.~character to derivative
543
+ − 191
w.r.t.~string, we define the derivative that nests simplifications with derivatives:%\comment{simp in the [] case?}
538
+ − 192
+ − 193
\begin{center}
+ − 194
\begin{tabular}{lcl}
543
+ − 195
$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
+ − 196
$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$
538
+ − 197
\end{tabular}
+ − 198
\end{center}
+ − 199
+ − 200
\noindent
543
+ − 201
Extracting bit-codes from the derivatives that had been simplified repeatedly after
+ − 202
each derivative run, the simplified $\blexer$, called $\blexersimp$,
+ − 203
is defined as
538
+ − 204
\begin{center}
+ − 205
\begin{tabular}{lcl}
+ − 206
$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
+ − 207
$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
+ − 208
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+ − 209
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+ − 210
& & $\;\;\textit{else}\;\textit{None}$
+ − 211
\end{tabular}
+ − 212
\end{center}
+ − 213
+ − 214
\noindent
+ − 215
This algorithm keeps the regular expression size small, for example,
543
+ − 216
with this simplification our previous $(a + aa)^*$ example's fast growth (over
+ − 217
$10^5$ nodes at around $20$ input length)
538
+ − 218
will be reduced to just 17 and stays constant, no matter how long the
+ − 219
input string is.
543
+ − 220
We show some graphs to better demonstrate this imrpovement.
538
+ − 221
+ − 222
543
+ − 223
\section{$(a+aa)^*$ and $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
+ − 224
For $(a+aa)^*$, it used to grow to over $9000$ nodes with simple-minded simplification
+ − 225
at only around $15$ input characters:
+ − 226
\begin{center}
539
+ − 227
\begin{tabular}{ll}
+ − 228
\begin{tikzpicture}
+ − 229
\begin{axis}[
+ − 230
xlabel={$n$},
+ − 231
ylabel={derivative size},
+ − 232
width=7cm,
+ − 233
height=4cm,
543
+ − 234
legend entries={Lexer with $\textit{bsimp}$},
+ − 235
legend pos= south east,
+ − 236
legend cell align=left]
+ − 237
\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_bsimp.data};
+ − 238
\end{axis}
+ − 239
\end{tikzpicture} %\label{fig:BitcodedLexer}
+ − 240
&
+ − 241
\begin{tikzpicture}
+ − 242
\begin{axis}[
+ − 243
xlabel={$n$},
+ − 244
ylabel={derivative size},
+ − 245
width = 7cm,
+ − 246
height = 4cm,
+ − 247
legend entries={Lexer without $\textit{bsimp}$},
+ − 248
legend pos= north west,
+ − 249
legend cell align=left]
+ − 250
\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_easysimp.data};
+ − 251
\end{axis}
+ − 252
\end{tikzpicture}
+ − 253
\end{tabular}
+ − 254
\end{center}
+ − 255
And for $(a^*\cdot a^*)^*$, unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current
+ − 256
simplification rules (we put the graphs together to show the contrast)
+ − 257
\begin{center}
+ − 258
\begin{tabular}{ll}
+ − 259
\begin{tikzpicture}
+ − 260
\begin{axis}[
+ − 261
xlabel={$n$},
+ − 262
ylabel={derivative size},
+ − 263
width=7cm,
+ − 264
height=4cm,
+ − 265
legend entries={Lexer with $\textit{bsimp}$},
539
+ − 266
legend pos= south east,
+ − 267
legend cell align=left]
+ − 268
\addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};
+ − 269
\end{axis}
+ − 270
\end{tikzpicture} %\label{fig:BitcodedLexer}
+ − 271
&
+ − 272
\begin{tikzpicture}
+ − 273
\begin{axis}[
+ − 274
xlabel={$n$},
+ − 275
ylabel={derivative size},
+ − 276
width = 7cm,
+ − 277
height = 4cm,
543
+ − 278
legend entries={Lexer without $\textit{bsimp}$},
539
+ − 279
legend pos= north west,
+ − 280
legend cell align=left]
+ − 281
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
+ − 282
\end{axis}
+ − 283
\end{tikzpicture}
+ − 284
\end{tabular}
543
+ − 285
\end{center}
538
+ − 286
543
+ − 287
%----------------------------------------------------------------------------------------
+ − 288
% SECTION rewrite relation
+ − 289
%----------------------------------------------------------------------------------------
+ − 290
\section{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
576
+ − 291
In the $\blexer$'s correctness proof, we
+ − 292
did not directly derive the fact that $\blexer$ gives out the POSIX value,
+ − 293
but first prove that $\blexer$ is linked with $\lexer$.
+ − 294
Then we re-use
+ − 295
the correctness of $\lexer$.
+ − 296
Here we follow suit by first proving that
+ − 297
$\blexersimp \; r \; s $
+ − 298
produces the same output as $\blexer \; r\; s$,
+ − 299
and then putting it together with
+ − 300
$\blexer$'s correctness result
+ − 301
($(r, s) \rightarrow v \implies \blexer \; r \;s = v$)
+ − 302
to obtain half of the correctness property (the other half
+ − 303
being when $s$ is not $L \; r$ which is routine to establish)
+ − 304
\begin{center}
+ − 305
$(r, s) \rightarrow v \implies \blexersimp \; r \; s = v$
+ − 306
\end{center}
+ − 307
\noindent
+ − 308
because it makes the proof simpler
+ − 309
The overall idea for the correctness of our simplified bitcoded lexer
+ − 310
\noindent
543
+ − 311
is that the $\textit{bsimp}$ will not change the regular expressions so much that
+ − 312
it becomes impossible to extract the $\POSIX$ values.
+ − 313
To capture this "similarity" between unsimplified regular expressions and simplified ones,
+ − 314
we devise the rewriting relation $\rrewrite$, written infix as $\rightsquigarrow$:
538
+ − 315
543
+ − 316
\begin{center}
+ − 317
\begin{mathpar}
+ − 318
\inferrule{}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
+ − 319
+ − 320
\inferrule{}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}
+ − 321
+ − 322
\inferrule{}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\
+ − 323
+ − 324
+ − 325
+ − 326
\inferrule{\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\}
+ − 327
+ − 328
\inferrule{\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\
+ − 329
+ − 330
\inferrule{}{ _{bs}\sum [] \rightsquigarrow \ZERO\\}
+ − 331
+ − 332
\inferrule{}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a\\}
+ − 333
+ − 334
\inferrule{\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2\\}
+ − 335
+ − 336
\inferrule{}{\\ [] \stackrel{s}{\rightsquigarrow} []}
+ − 337
+ − 338
\inferrule{rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{\\ r :: rs_1 \rightsquigarrow r :: rs_2 \rightsquigarrow}
+ − 339
+ − 340
\inferrule{r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs\\}
+ − 341
+ − 342
\inferrule{}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs}
+ − 343
+ − 344
\inferrule{}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) }
+ − 345
+ − 346
\inferrule{\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rsc \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c}
+ − 347
+ − 348
\end{mathpar}
+ − 349
\end{center}
+ − 350
These "rewrite" steps define the atomic simplifications we could impose on regular expressions
+ − 351
under our simplification algorithm.
+ − 352
For convenience, we define a relation $\stackrel{s}{\rightsquigarrow}$ for rewriting
+ − 353
a list of regular exression to another list.
+ − 354
The $\rerase{}$ function is used instead of $_\downarrow$ for the finiteness bound proof of next chapter,
+ − 355
which we will discuss later. For the moment the reader can assume they basically do the same thing as $\erase$.
+ − 356
+ − 357
Usually more than one steps are taking place during the simplification of a regular expression,
+ − 358
therefore we define the reflexive transitive closure of the $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
+ − 359
relations:
+ − 360
+ − 361
\begin{center}
+ − 362
\begin{mathpar}
+ − 363
\inferrule{}{ r \stackrel{*}{\rightsquigarrow} r \\}
+ − 364
\inferrule{}{\\ rs \stackrel{s*}{\rightsquigarrow} rs \\}
+ − 365
\inferrule{\\r_1 \stackrel{*}{\rightsquigarrow} r_2 \land \; r_2 \stackrel{*}{\rightsquigarrow} r_3}{r_1 \stackrel{*}{\rightsquigarrow} r_3\\}
+ − 366
\inferrule{\\rs_1 \stackrel{*}{\rightsquigarrow} rs_2 \land \; rs_2 \stackrel{*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{*}{\rightsquigarrow} rs_3}
+ − 367
\end{mathpar}
+ − 368
\end{center}
+ − 369
Now that we have modelled the rewriting behaviour of our simplifier $\bsimp$, we prove mainly
+ − 370
three properties about how these relations connect to $\blexersimp$:
+ − 371
\begin{itemize}
+ − 372
\item
+ − 373
$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
+ − 374
The algorithm $\bsimp$ only transforms the regex $r$ using steps specified by
+ − 375
$\stackrel{*}{\rightsquigarrow}$ and nothing else.
+ − 376
\item
+ − 377
$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow r'\backslash c$.
+ − 378
The relation $\stackrel{*}{rightsquigarrow}$ is preserved under derivative.
+ − 379
\item
+ − 380
$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$. If we reach another
+ − 381
expression in finitely many atomic simplification steps, then these two regular expressions
+ − 382
will produce the same bit-codes under the bit collection function $\bmkeps$ used by our $\blexer$.
+ − 383
+ − 384
\end{itemize}
+ − 385
\section{Three Important properties}
+ − 386
These properties would work together towards the correctness theorem.
+ − 387
We start proving each of these lemmas below.
+ − 388
\subsection{$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$ and $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$}
+ − 389
The first few properties we establish is that the inference rules we gave for $\rightsquigarrow$
+ − 390
and $\stackrel{s}{\rightsquigarrow}$ also hold as implications for $\stackrel{*}{\rightsquigarrow}$ and
+ − 391
$\stackrel{s*}{\rightsquigarrow}$.
+ − 392
\begin{lemma}
+ − 393
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$
+ − 394
\end{lemma}
+ − 395
\begin{proof}
+ − 396
By rule induction of $\stackrel{s*}{\rightsquigarrow}$.
+ − 397
\end{proof}
+ − 398
\begin{lemma}
+ − 399
$r \stackrel{*}{\rightsquigarrow} r' \implies _{bs} \sum r :: rs \stackrel{*}{\rightsquigarrow} _{bs} \sum r' :: rs$
+ − 400
\end{lemma}
+ − 401
\begin{proof}
+ − 402
By rule induction of $\stackrel{*}{\rightsquigarrow} $.
+ − 403
\end{proof}
+ − 404
\noindent
+ − 405
Then we establish that the $\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ relation is preserved w.r.t appending and prepending of a list:
+ − 406
\begin{lemma}
+ − 407
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$
+ − 408
\end{lemma}
+ − 409
\begin{proof}
+ − 410
By induction on the list $rs$.
+ − 411
\end{proof}
+ − 412
+ − 413
\begin{lemma}
+ − 414
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2$
+ − 415
\end{lemma}
+ − 416
\begin{proof}
+ − 417
By rule induction of $\stackrel{s*}{\rightsquigarrow}$.
+ − 418
\end{proof}
+ − 419
+ − 420
\noindent
+ − 421
The $\stackrel{s}{\rightsquigarrow} $ relation after appending a list becomes $\stackrel{s*}{\rightsquigarrow}$:
+ − 422
\begin{lemma}\label{ssgqTossgs}
+ − 423
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
+ − 424
\end{lemma}
+ − 425
\begin{proof}
+ − 426
By rule induction of $\stackrel{s}{\rightsquigarrow}$.
+ − 427
\end{proof}
+ − 428
\begin{lemma}
+ − 429
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
+ − 430
\end{lemma}
+ − 431
\begin{proof}
+ − 432
By rule induction of $\stackrel{s*}{\rightsquigarrow}$ and using \ref{ssgqTossgs}.
+ − 433
\end{proof}
+ − 434
Here are two lemmas relating $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$:
+ − 435
\begin{lemma}\label{singleton}
+ − 436
$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$
+ − 437
\end{lemma}
+ − 438
\begin{proof}
+ − 439
By rule induction of $ \stackrel{*}{\rightsquigarrow} $.
+ − 440
\end{proof}
+ − 441
\begin{lemma}
+ − 442
$rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies
+ − 443
r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$
+ − 444
\end{lemma}
+ − 445
\begin{proof}
+ − 446
By using \ref{singleton}.
+ − 447
\end{proof}
+ − 448
Now we get to the "meaty" part of the proof, which relates the relations $\stackrel{s*}{\rightsquigarrow}$ and
+ − 449
$\stackrel{*}{\rightsquigarrow} $ with our simplification components such $\distinctBy$ and $\flts$.
+ − 450
The first lemma below says that for a list made of two parts $rs_1 @ rs_2$, one can throw away the duplicate
+ − 451
elements in $rs_2$, as well as those that have appeared in $rs_1$:
+ − 452
\begin{lemma}
+ − 453
$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} (rs_1 @ (\distinctBy \; rs_2 \; \; \rerase{\_}\; \; (\map\;\; \rerase{\_}\; \; rs_1)))$
+ − 454
\end{lemma}
+ − 455
\begin{proof}
+ − 456
By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary.
+ − 457
\end{proof}
+ − 458
The above h as the corollary that is suitable for the actual way $\distinctBy$ is called in $\bsimp$:
+ − 459
\begin{lemma}\label{dBPreserves}
+ − 460
$rs_ 1 \rightarrow \distinctBy \; rs_1 \; \phi$
+ − 461
\end{lemma}
538
+ − 462
+ − 463
532
+ − 464
543
+ − 465
The flatten function $\flts$ works within the $\rightsquigarrow$ relation:
+ − 466
\begin{lemma}\label{fltsPreserves}
+ − 467
$rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$
+ − 468
\end{lemma}
+ − 469
+ − 470
The rewriting in many steps property is composible in terms of regular expression constructors:
+ − 471
\begin{lemma}
+ − 472
$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} \; _{bs} r_2 \cdot r_3 \quad $ and
+ − 473
$r_3 \stackrel{*}{\rightsquigarrow} r_4 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} _{bs} \; r_1 \cdot r_4$
+ − 474
\end{lemma}
532
+ − 475
543
+ − 476
The rewriting in many steps properties $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ is preserved under the function $\fuse$:
+ − 477
\begin{lemma}
+ − 478
$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies \fuse \; bs \; r_1 \stackrel{*}{\rightsquigarrow} \; \fuse \; bs \; r_2 \quad $ and
+ − 479
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\fuse \; bs) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$
+ − 480
\end{lemma}
+ − 481
\begin{proof}
+ − 482
By the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and
+ − 483
$rs_2 \stackrel{s}{\rightsquigarrow} rs_3 \implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$.
+ − 484
\end{proof}
+ − 485
\noindent
+ − 486
If we could rewrite a regular expression in many steps to $\ZERO$, then
+ − 487
we could also rewrite any sequence containing it to $\ZERO$:
+ − 488
\begin{lemma}
+ − 489
$r_1 \stackrel{*}{\rightsquigarrow} \ZERO \implies _{bs}r_1\cdot r_2 \stackrel{*}{\rightsquigarrow} \ZERO$
+ − 490
\end{lemma}
+ − 491
\begin{lemma}
+ − 492
$\bmkeps \; (r \backslash s) = \bmkeps \; \bderssimp{r}{s}$
+ − 493
\end{lemma}
+ − 494
\noindent
+ − 495
The function $\bsimpalts$ preserves rewritability:
+ − 496
\begin{lemma}\label{bsimpaltsPreserves}
+ − 497
$_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$
+ − 498
\end{lemma}
+ − 499
\noindent
+ − 500
Before we give out the next lemmas, we define a predicate for a list of regular expressions
+ − 501
having at least one nullable regular expressions:
+ − 502
\begin{definition}
+ − 503
$\textit{bnullables} \; rs \dn \exists r \in rs. \bnullable r$
+ − 504
\end{definition}
+ − 505
The rewriting relation $\rightsquigarrow$ preserves nullability:
+ − 506
\begin{lemma}
+ − 507
$r_1 \rightsquigarrow r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$ and
+ − 508
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$
+ − 509
\end{lemma}
+ − 510
\begin{proof}
+ − 511
By rule induction of $\rightarrow$ and $\stackrel{s}{\rightsquigarrow}$.
+ − 512
\end{proof}
+ − 513
So does the many steps rewriting:
+ − 514
\begin{lemma}\label{rewritesBnullable}
+ − 515
$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$
+ − 516
\end{lemma}
+ − 517
\begin{proof}
+ − 518
By rule induction of $\stackrel{*}{\rightsquigarrow} $.
+ − 519
\end{proof}
+ − 520
\noindent
+ − 521
And if both regular expressions in a rewriting relation are nullable, then they
+ − 522
produce the same bit-codes:
532
+ − 523
543
+ − 524
\begin{lemma}\label{rewriteBmkepsAux}
+ − 525
$r_1 \rightsquigarrow r_2 \implies (\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 =
+ − 526
\bmkeps \; r_2)$ and
+ − 527
$rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2 \implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies
+ − 528
\bmkepss \; rs_1 = \bmkepss \; rs2)$
+ − 529
\end{lemma}
+ − 530
\noindent
+ − 531
The definition of $\bmkepss$ on list $rs$ is just to extract the bit-codes on the first element in $rs$ that
+ − 532
is $bnullable$:
532
+ − 533
\begin{center}
543
+ − 534
\begin{tabular}{lcl}
+ − 535
$\bmkepss \; [] $ & $\dn$ & $[]$\\
+ − 536
$\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \; \bnullable \; r then \; (\bmkeps \; r) \; \textit{else} \; \bmkepss \; rs$
+ − 537
\end{tabular}
532
+ − 538
\end{center}
543
+ − 539
\noindent
+ − 540
And now we are ready to prove the key property that if you
+ − 541
have two regular expressions, one rewritable in many steps to the other,
+ − 542
and one of them is $\bnullable$, then they will both yield the same bits under $\bmkeps$:
+ − 543
\begin{lemma}
576
+ − 544
$\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r_1 \;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$
543
+ − 545
\end{lemma}
+ − 546
\begin{proof}
+ − 547
By rule induction of $\stackrel{*}{\rightsquigarrow} $, using \ref{rewriteBmkepsAux} and $\ref{rewritesBnullable}$.
+ − 548
\end{proof}
+ − 549
\noindent
+ − 550
the other property is also ready:
+ − 551
\begin{lemma}
+ − 552
$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
+ − 553
\end{lemma}
+ − 554
\begin{proof}
+ − 555
By an induction on $r$.
+ − 556
The most difficult case would be the alternative case, where we using properties such as \ref{bsimpaltsPreserves} and \ref{fltsPreserves} and \ref{dBPreserves}, we could continuously rewrite a list like:\\
+ − 557
$rs \stackrel{s*}{\rightsquigarrow} \map \; \textit{bsimp} \; rs$\\
+ − 558
$\ldots \stackrel{s*}{\rightsquigarrow} \flts \; (\map \; \textit{bsimp} \; rs)$\\
+ − 559
$\ldots \;\stackrel{s*}{\rightsquigarrow} \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi$\\
+ − 560
Then we could do the following regular expresssion many steps rewrite:\\
+ − 561
$ _{bs} \sum \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi \stackrel{*}{\rightsquigarrow} \bsimpalts \; bs \; (\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi)$
+ − 562
\\
+ − 563
+ − 564
\end{proof}
+ − 565
\section{Proof for the Property: $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}
+ − 566
The first thing we prove is that if we could rewrite in one step, then after derivative
+ − 567
we could rewrite in many steps:
+ − 568
\begin{lemma}\label{rewriteBder}
+ − 569
$r_1 \rightsquigarrow r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$ and
+ − 570
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\_\backslash c) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$
+ − 571
\end{lemma}
+ − 572
\begin{proof}
+ − 573
By induction on $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.
+ − 574
\end{proof}
+ − 575
Now we can prove that once we could rewrite from one expression to another in many steps,
+ − 576
then after a derivative on both sides we could still rewrite one to another in many steps:
+ − 577
\begin{lemma}
+ − 578
$r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$
+ − 579
\end{lemma}
+ − 580
\begin{proof}
+ − 581
By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma :\ref{rewriteBder}.
+ − 582
\end{proof}
+ − 583
\noindent
+ − 584
This can be extended and combined with the previous two important properties
+ − 585
so that a regular expression's successivve derivatives can be rewritten in many steps
+ − 586
to its simplified counterpart:
+ − 587
\begin{lemma}\label{bderBderssimp}
+ − 588
$a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $
+ − 589
\end{lemma}
+ − 590
\subsection{Main Theorem}
+ − 591
Now with \ref{bdersBderssimp} we are ready for the main theorem.
+ − 592
To link $\blexersimp$ and $\blexer$,
+ − 593
we first say that they give out the same bits, if the lexing result is a match:
+ − 594
\begin{lemma}
+ − 595
$\bnullable \; (a \backslash s) \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
+ − 596
\end{lemma}
+ − 597
\noindent
+ − 598
Now that they give out the same bits, we know that they give the same value after decoding,
+ − 599
which we know is correct value as $\blexer$ is correct:
+ − 600
\begin{theorem}
+ − 601
$\blexer \; r \; s = \blexersimp{r}{s}$
+ − 602
\end{theorem}
+ − 603
\noindent
576
+ − 604
\begin{proof}
+ − 605
One can rewrite in many steps from the original lexer's derivative regular expressions to the
+ − 606
lexer with simplification applied:
+ − 607
$a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $.
+ − 608
If two regular expressions are rewritable, then they produce the same bits.
+ − 609
Under the condition that $ r_1$ is nullable, we have
+ − 610
$\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r', \;\;\text{then} \;\; \bmkeps \; r = \bmkeps \; r'$.
+ − 611
This proves the \emph{if} (interesting) branch of
+ − 612
$\blexer \; r \; s = \blexersimp{r}{s}$.
+ − 613
+ − 614
\end{proof}
+ − 615
\noindent
+ − 616
As a corollary,
+ − 617
we link this result with the lemma we proved earlier that
+ − 618
\begin{center}
+ − 619
$(r, s) \rightarrow v \implies \blexer \; r \; s = v$
+ − 620
\end{center}
+ − 621
and obtain the corollary that the bit-coded lexer with simplification is
+ − 622
indeed correctly outputting POSIX lexing result, if such a result exists.
+ − 623
\begin{corollary}
+ − 624
$(r, s) \rightarrow v \implies \blexersimp{r}{s}$
+ − 625
\end{corollary}
532
+ − 626
543
+ − 627
\subsection{Comments on the Proof Techniques Used}
532
+ − 628
The non-trivial part of proving the correctness of the algorithm with simplification
+ − 629
compared with not having simplification is that we can no longer use the argument
+ − 630
in \cref{flex_retrieve}.
543
+ − 631
The function \retrieve needs the cumbersome structure of the (umsimplified)
+ − 632
annotated regular expression to
532
+ − 633
agree with the structure of the value, but simplification will always mess with the
543
+ − 634
structure.
+ − 635
+ − 636
We also tried to prove $\bsimp{\bderssimp{a}{s}} = \bsimp{a\backslash s}$,
+ − 637
but this turns out to be not true, A counterexample of this being
+ − 638
\[ r = [(1+c)\cdot [aa \cdot (1+c)]] \land s = aa
+ − 639
\]
+ − 640
+ − 641
Then we would have $\bsimp{a \backslash s}$ being
+ − 642
$_{[]}(_{ZZ}\ONE + _{ZS}c ) $
+ − 643
whereas $\bsimp{\bderssimp{a}{s}}$ would be
+ − 644
$_{Z}(_{Z} \ONE + _{S} c)$.
+ − 645
Unfortunately if we apply $\textit{bsimp}$ at different
+ − 646
stages we will always have this discrepancy, due to
+ − 647
whether the $\map \; (\fuse\; bs) \; as$ operation in $\textit{bsimp}$
+ − 648
is taken at some points will be entirely dependant on when the simplification
+ − 649
take place whether there is a larger alternative structure surrounding the
+ − 650
alternative being simplified.
+ − 651
The good thing about $\stackrel{*}{\rightsquigarrow} $ is that it allows
+ − 652
us not specify how exactly the "atomic" simplification steps $\rightsquigarrow$
+ − 653
are taken, but simply say that they can be taken to make two similar
+ − 654
regular expressions equal, and can be done after interleaving derivatives
+ − 655
and simplifications.
+ − 656
+ − 657
+ − 658
Having correctness property is good. But we would also like the lexer to be efficient in
+ − 659
some sense, for exampe, not grinding to a halt at certain cases.
+ − 660
In the next chapter we shall prove that for a given $r$, the internal derivative size is always
+ − 661
finitely bounded by a constant.