532
+ − 1
% Chapter Template
+ − 2
+ − 3
% Main chapter title
+ − 4
\chapter{Correctness of Bit-coded Algorithm with Simplification}
+ − 5
+ − 6
\label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
+ − 7
%Then we illustrate how the algorithm without bitcodes falls short for such aggressive
+ − 8
%simplifications and therefore introduce our version of the bitcoded algorithm and
+ − 9
%its correctness proof in
+ − 10
%Chapter 3\ref{Chapter3}.
+ − 11
+ − 12
+ − 13
583
+ − 14
In this chapter we introduce the simplifications
+ − 15
on annotated regular expressions that can be applied to
+ − 16
each intermediate derivative result. This allows
+ − 17
us to make $\blexer$ much more efficient.
+ − 18
We contrast this simplification function
+ − 19
with Sulzmann and Lu's original
+ − 20
simplifications, indicating the simplicity of our algorithm and
+ − 21
improvements we made, demostrating
+ − 22
the usefulness and reliability of formal proofs on algorithms.
+ − 23
These ``aggressive'' simplifications would not be possible in the injection-based
+ − 24
lexing we introduced in chapter \ref{Inj}.
+ − 25
We then go on to prove the correctness with the improved version of
+ − 26
$\blexer$, called $\blexersimp$, by establishing
+ − 27
$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
538
+ − 28
585
+ − 29
\section{Simplifications by Sulzmann and Lu}
543
+ − 30
The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s
+ − 31
and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns
579
+ − 32
are scattered around different levels, and therefore requires
+ − 33
de-duplication at different levels:
543
+ − 34
\begin{center}
585
+ − 35
$(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\
+ − 36
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$
543
+ − 37
\end{center}
+ − 38
\noindent
579
+ − 39
As we have already mentioned in \ref{eqn:growth2},
+ − 40
a simple-minded simplification function cannot simplify
585
+ − 41
the third regular expression in the above chain of derivative
+ − 42
regular expressions:
583
+ − 43
\begin{center}
579
+ − 44
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
583
+ − 45
\end{center}
585
+ − 46
one would expect a better simplification function to work in the
579
+ − 47
following way:
+ − 48
\begin{gather*}
+ − 49
((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
+ − 50
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\
+ − 51
\bigg\downarrow \\
+ − 52
(a^*a^* + a^*
+ − 53
\color{gray} + a^* \color{black})\cdot(a^*a^*)^* +
+ − 54
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\
+ − 55
\bigg\downarrow \\
+ − 56
(a^*a^* + a^*
+ − 57
)\cdot(a^*a^*)^*
583
+ − 58
\color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*\\
+ − 59
\bigg\downarrow \\
+ − 60
(a^*a^* + a^*
+ − 61
)\cdot(a^*a^*)^*
579
+ − 62
\end{gather*}
+ − 63
\noindent
+ − 64
This motivating example came from testing Sulzmann and Lu's
+ − 65
algorithm: their simplification does
+ − 66
not work!
+ − 67
We quote their $\textit{simp}$ function verbatim here:
+ − 68
\begin{center}
+ − 69
\begin{tabular}{lcl}
585
+ − 70
$\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ &
579
+ − 71
$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\
+ − 72
& &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\
585
+ − 73
$\simpsulz \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if}
579
+ − 74
\; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\;
+ − 75
\textit{then} \;\; \ZERO$\\
585
+ − 76
& & $\textit{else}\;\;_{bs}((\simpsulz \;r_1)\cdot
+ − 77
(\simpsulz \; r_2))$\\
+ − 78
$\simpsulz \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\
+ − 79
$\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
579
+ − 80
$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
585
+ − 81
$\simpsulz \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\simpsulz \; r)$\\
+ − 82
$\simpsulz \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum
+ − 83
(\nub \; (\filter \; (\not \circ \zeroable)\;((\simpsulz \; r) :: \map \; \simpsulz \; rs)))$\\
579
+ − 84
+ − 85
\end{tabular}
+ − 86
\end{center}
+ − 87
\noindent
+ − 88
the $\textit{zeroable}$ predicate
+ − 89
which tests whether the regular expression
+ − 90
is equivalent to $\ZERO$,
+ − 91
is defined as:
+ − 92
\begin{center}
+ − 93
\begin{tabular}{lcl}
+ − 94
$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\;
+ − 95
\zeroable \;_{[]}\sum\;rs $\\
+ − 96
$\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\
+ − 97
$\zeroable\;_{bs}r^*$ & $\dn$ & $\textit{false}$ \\
+ − 98
$\zeroable\;_{bs}c$ & $\dn$ & $\textit{false}$\\
+ − 99
$\zeroable\;_{bs}\ONE$ & $\dn$ & $\textit{false}$\\
+ − 100
$\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$
+ − 101
\end{tabular}
+ − 102
\end{center}
+ − 103
\noindent
585
+ − 104
They suggested that the $\simpsulz $ function should be
584
+ − 105
applied repeatedly until a fixpoint is reached.
585
+ − 106
We call this construction $\textit{sulzSimp}$:
+ − 107
\begin{center}
+ − 108
\begin{tabular}{lcl}
+ − 109
$\textit{sulzSimp} \; r$ & $\dn$ &
+ − 110
$\textit{while}((\simpsulz \; r)\; \cancel{=} \; r)$ \\
+ − 111
& & $\quad r := \simpsulz \; r$\\
+ − 112
& & $\textit{return} \; r$
+ − 113
\end{tabular}
+ − 114
\end{center}
+ − 115
We call the operation of alternatingly
+ − 116
applying derivatives and simplifications
+ − 117
(until the string is exhausted) Sulz-simp-derivative,
+ − 118
written $\backslash_{sulzSimp}$:
+ − 119
\begin{center}
+ − 120
\begin{tabular}{lcl}
+ − 121
$r \backslash_{sulzSimp} (c\!::\!s) $ & $\dn$ & $(\textit{sulzSimp} \; (r \backslash c)) \backslash_{sulzSimp}\, s$ \\
+ − 122
$r \backslash_{sulzSimp} [\,] $ & $\dn$ & $r$
+ − 123
\end{tabular}
+ − 124
\end{center}
+ − 125
\noindent
+ − 126
After the derivatives have been taken, the bitcodes
+ − 127
are extracted and decoded in the same manner
+ − 128
as $\blexer$:
+ − 129
\begin{center}
+ − 130
\begin{tabular}{lcl}
+ − 131
$\textit{blexer\_sulzSimp}\;r\,s$ & $\dn$ &
+ − 132
$\textit{let}\;a = (r^\uparrow)\backslash_{sulzSimp}\, s\;\textit{in}$\\
+ − 133
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+ − 134
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+ − 135
& & $\;\;\textit{else}\;\textit{None}$
+ − 136
\end{tabular}
+ − 137
\end{center}
+ − 138
\noindent
+ − 139
We implemented this lexing algorithm in Scala,
+ − 140
and found that the final derivative regular expression
584
+ − 141
size grows exponentially fast:
+ − 142
\begin{figure}[H]
585
+ − 143
\centering
584
+ − 144
\begin{tikzpicture}
+ − 145
\begin{axis}[
+ − 146
xlabel={$n$},
585
+ − 147
ylabel={size},
584
+ − 148
ymode = log,
585
+ − 149
legend entries={Final Derivative Size},
584
+ − 150
legend pos=north west,
+ − 151
legend cell align=left]
+ − 152
\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexer.data};
+ − 153
\end{axis}
+ − 154
\end{tikzpicture}
585
+ − 155
\caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form
584
+ − 156
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
585
+ − 157
$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer}
584
+ − 158
\end{figure}
+ − 159
\noindent
+ − 160
At $n= 20$ we already get an out of memory error with Scala's normal
585
+ − 161
JVM heap size settings.
+ − 162
In fact their simplification does not improve over
+ − 163
the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}.
+ − 164
The time required also grows exponentially:
+ − 165
\begin{figure}[H]
+ − 166
\centering
+ − 167
\begin{tikzpicture}
+ − 168
\begin{axis}[
+ − 169
xlabel={$n$},
+ − 170
ylabel={time},
+ − 171
ymode = log,
+ − 172
legend entries={time in secs},
+ − 173
legend pos=north west,
+ − 174
legend cell align=left]
+ − 175
\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexerTime.data};
+ − 176
\end{axis}
+ − 177
\end{tikzpicture}
+ − 178
\caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form
+ − 179
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
+ − 180
$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexerTime}
+ − 181
\end{figure}
+ − 182
\noindent
+ − 183
which seems like a counterexample for
+ − 184
their linear complexity claim:
584
+ − 185
\begin{quote}\it
585
+ − 186
Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations:
+ − 187
simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input.
+ − 188
\end{quote}
+ − 189
\noindent
+ − 190
The assumption that the size of the regular expressions
+ − 191
in the algorithm
+ − 192
would stay below a finite constant is not ture.
+ − 193
In addition to that, even if the regular expressions size
+ − 194
do stay finite, one has to take into account that
+ − 195
the $\simpsulz$ function is applied many times
+ − 196
in each derivative step, and that number is not necessarily
+ − 197
a constant with respect to the size of the regular expression.
+ − 198
To not get ``caught off guard'' by
+ − 199
these counterexamples,
+ − 200
one needs to be more careful when designing the
+ − 201
simplification function and making claims about them.
584
+ − 202
585
+ − 203
\section{Our $\textit{Simp}$ Function}
+ − 204
We will now introduce our simplification function,
+ − 205
by making a contrast with $\simpsulz$.
+ − 206
We describe
+ − 207
the ideas behind components in their algorithm
+ − 208
and why they fail to achieve the desired effect, followed
+ − 209
by our solution. These solutions come with correctness
+ − 210
statements that are backed up by formal proofs.
+ − 211
\subsection{Flattening Nested Alternatives}
+ − 212
The idea behind the
+ − 213
\begin{center}
+ − 214
$\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
+ − 215
_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$
+ − 216
\end{center}
+ − 217
clause is that it allows
+ − 218
duplicate removal of regular expressions at different
+ − 219
levels.
+ − 220
For example, this would help with the
+ − 221
following simplification:
538
+ − 222
585
+ − 223
\begin{center}
+ − 224
$(a+r)+r \longrightarrow a+r$
+ − 225
\end{center}
+ − 226
The problem here is that only the head element
+ − 227
is ``spilled out'',
+ − 228
whereas we would want to flatten
+ − 229
an entire list to open up possibilities for further simplifications.\\
+ − 230
Not flattening the rest of the elements also means that
+ − 231
the later de-duplication processs
+ − 232
does not fully remove apparent duplicates.
+ − 233
For example,
+ − 234
using $\simpsulz$ we could not
+ − 235
simplify
+ − 236
\begin{center}
+ − 237
$((a^* a^*)+ (a^* + a^*))\cdot (a^*a^*)^*+
+ − 238
((a^*a^*)+a^*)\cdot (a^*a^*)^*$
+ − 239
\end{center}
+ − 240
due to the underlined part not in the first element
+ − 241
of the alternative.\\
+ − 242
We define a flatten operation that flattens not only
+ − 243
the first regular expression of an alternative,
+ − 244
but the entire list:
543
+ − 245
\begin{center}
+ − 246
\begin{tabular}{@{}lcl@{}}
+ − 247
$\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
+ − 248
(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\
+ − 249
$\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \; \textit{as'} $ \\
+ − 250
$\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise)
+ − 251
\end{tabular}
+ − 252
\end{center}
+ − 253
\noindent
585
+ − 254
Our $\flts$ operation
+ − 255
also throws away $\ZERO$s
+ − 256
as they do not contribute to a lexing result.
+ − 257
\subsection{Duplicate Removal}
+ − 258
After flattening is done, we are ready to deduplicate.
+ − 259
The de-duplicate function is called $\distinctBy$,
+ − 260
and that is where we make our second improvement over
+ − 261
Sulzmann and Lu's.
+ − 262
The process goes as follows:
+ − 263
\begin{center}
+ − 264
$rs \stackrel{\textit{flts}}{\longrightarrow}
+ − 265
rs_{flat}
+ − 266
\xrightarrow{\distinctBy \;
+ − 267
rs_{flat} \; \rerases\; \varnothing} rs_{distinct}$
+ − 268
%\stackrel{\distinctBy \;
+ − 269
%rs_{flat} \; \erase\; \varnothing}{\longrightarrow} \; rs_{distinct}$
+ − 270
\end{center}
+ − 271
where the $\distinctBy$ function is defined as:
543
+ − 272
\begin{center}
+ − 273
\begin{tabular}{@{}lcl@{}}
+ − 274
$\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\
585
+ − 275
$\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & $\quad \textit{if} (f \; x \in acc)\;\; \textit{then} \;\; \distinctBy \; xs \; f \; acc$\\
+ − 276
& & $\quad \textit{else}\;\; x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$
543
+ − 277
\end{tabular}
+ − 278
\end{center}
+ − 279
\noindent
+ − 280
The reason we define a distinct function under a mapping $f$ is because
585
+ − 281
we want to eliminate regular expressions that are syntactically the same,
+ − 282
but with different bit-codes.
+ − 283
For example, we can remove the second $a^*a^*$ from
+ − 284
$_{ZSZ}a^*a^* + _{SZZ}a^*a^*$, because it
+ − 285
represents a match with shorter initial sub-match
+ − 286
(and therefore is definitely not POSIX),
+ − 287
and will be discarded by $\bmkeps$ later.
+ − 288
\begin{center}
+ − 289
$_{ZSZ}\underbrace{a^*}_{ZS:\; match \; 1\; times\quad}\underbrace{a^*}_{Z: \;match\; 1 \;times} +
+ − 290
_{SZZ}\underbrace{a^*}_{S: \; match \; 0 \; times\quad}\underbrace{a^*}_{ZZ: \; match \; 2 \; times}
+ − 291
$
+ − 292
\end{center}
+ − 293
%$_{bs1} r_1 + _{bs2} r_2 \text{where} (r_1)_{\downarrow} = (r_2)_{\downarrow}$
+ − 294
Due to the way our algorithm works,
+ − 295
the matches that conform to the POSIX standard
+ − 296
will always be placed further to the left. When we
+ − 297
traverse the list from left to right,
+ − 298
regular expressions we have already seen
+ − 299
will definitely not contribute to a POSIX value,
+ − 300
even if they are attached with different bitcodes.
+ − 301
These duplicates therefore need to be removed.
+ − 302
To achieve this, we call $\rerases$ as the function $f$ during the distinction
+ − 303
operation.\\
+ − 304
$\rerases$ is very similar to $\erase$, except that it preserves the structure
+ − 305
when erasing an alternative regular expression.
+ − 306
The reason why we use $\rerases$ instead of $\erase$ is that
+ − 307
it keeps the structures of alternative
+ − 308
annotated regular expressions
+ − 309
whereas $\erase$ would turn it back into a binary structure.
+ − 310
Not having to mess with the structure
590
+ − 311
greatly simplifies the finiteness proof in chapter
+ − 312
\ref{Finite} (we will follow up with more details there).
585
+ − 313
We give the definitions of $\rerases$ here together with
+ − 314
the new datatype used by $\rerases$ (as our plain
590
+ − 315
regular expression datatype does not allow non-binary alternatives).
585
+ − 316
For the moment the reader can just think of
+ − 317
$\rerases$ as $\erase$ and $\rrexp$ as plain regular expressions.
590
+ − 318
\begin{figure}[H]
+ − 319
\begin{center}
+ − 320
$\rrexp ::= \RZERO \mid \RONE
585
+ − 321
\mid \RCHAR{c}
+ − 322
\mid \RSEQ{r_1}{r_2}
+ − 323
\mid \RALTS{rs}
590
+ − 324
\mid \RSTAR{r} $
+ − 325
\end{center}
+ − 326
\caption{$\rrexp$: plain regular expressions, but with $\sum$ alternative
+ − 327
constructor}\label{rrexpDef}
+ − 328
\end{figure}
585
+ − 329
The notation of $\rerases$ also follows that of $\erase$,
+ − 330
which is a postfix operator written as a subscript,
+ − 331
except that it has an \emph{r} attached to it to distinguish against $\erase$:
+ − 332
\begin{center}
+ − 333
\begin{tabular}{lcl}
+ − 334
$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
+ − 335
$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
+ − 336
$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
+ − 337
$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
+ − 338
$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
+ − 339
$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
+ − 340
\end{tabular}
+ − 341
\end{center}
+ − 342
+ − 343
\subsection{Putting Things Together}
543
+ − 344
A recursive definition of our simplification function
585
+ − 345
is given below:
+ − 346
%that looks somewhat similar to our Scala code is
538
+ − 347
\begin{center}
+ − 348
\begin{tabular}{@{}lcl@{}}
+ − 349
543
+ − 350
$\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp} \; a_2) $ \\
585
+ − 351
$\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \rerases \; \varnothing) $ \\
543
+ − 352
$\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
538
+ − 353
\end{tabular}
+ − 354
\end{center}
+ − 355
+ − 356
\noindent
585
+ − 357
The simplification (named $\textit{bsimp}$ for \emph{b}it-coded)
+ − 358
does a pattern matching on the regular expression.
538
+ − 359
When it detected that the regular expression is an alternative or
543
+ − 360
sequence, it will try to simplify its children regular expressions
538
+ − 361
recursively and then see if one of the children turns into $\ZERO$ or
+ − 362
$\ONE$, which might trigger further simplification at the current level.
543
+ − 363
Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$,
+ − 364
using rules such as $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$.
+ − 365
\begin{center}
+ − 366
\begin{tabular}{@{}lcl@{}}
+ − 367
$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
+ − 368
&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
+ − 369
&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
+ − 370
&&$\quad\textit{case} \; (_{bs1}\ONE, a_2') \Rightarrow \textit{fuse} \; (bs@bs_1) \; a_2'$ \\
+ − 371
&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$
+ − 372
\end{tabular}
+ − 373
\end{center}
538
+ − 374
\noindent
543
+ − 375
The most involved part is the $\sum$ clause, where we first call $\flts$ on
+ − 376
the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$.
+ − 377
and then call $\distinctBy$ on that list, the predicate determining whether two
585
+ − 378
elements are the same is $\rerases \; r_1 = \rerases\; r_2$.
543
+ − 379
Finally, depending on whether the regular expression list $as'$ has turned into a
+ − 380
singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$
+ − 381
decides whether to keep the current level constructor $\sum$ as it is, and
+ − 382
removes it when there are less than two elements:
+ − 383
\begin{center}
+ − 384
\begin{tabular}{lcl}
+ − 385
$\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\
+ − 386
&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
+ − 387
&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
+ − 388
&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
+ − 389
\end{tabular}
+ − 390
+ − 391
\end{center}
+ − 392
Having defined the $\bsimp$ function,
+ − 393
we add it as a phase after a derivative is taken,
+ − 394
so it stays small:
+ − 395
\begin{center}
+ − 396
\begin{tabular}{lcl}
+ − 397
$r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
+ − 398
\end{tabular}
+ − 399
\end{center}
585
+ − 400
%Following previous notations
+ − 401
%when extending from derivatives w.r.t.~character to derivative
+ − 402
%w.r.t.~string, we define the derivative that nests simplifications
+ − 403
%with derivatives:%\comment{simp in the [] case?}
+ − 404
We extend this from character to string:
538
+ − 405
\begin{center}
+ − 406
\begin{tabular}{lcl}
543
+ − 407
$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
+ − 408
$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$
538
+ − 409
\end{tabular}
+ − 410
\end{center}
+ − 411
+ − 412
\noindent
585
+ − 413
The lexer that extracts bitcodes from the
+ − 414
derivatives with simplifications from our $\simp$ function
+ − 415
is called $\blexersimp$:
+ − 416
\begin{center}
538
+ − 417
\begin{tabular}{lcl}
+ − 418
$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
+ − 419
$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
+ − 420
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+ − 421
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+ − 422
& & $\;\;\textit{else}\;\textit{None}$
+ − 423
\end{tabular}
+ − 424
\end{center}
+ − 425
+ − 426
\noindent
585
+ − 427
This algorithm keeps the regular expression size small.
538
+ − 428
+ − 429
585
+ − 430
\subsection{$(a+aa)^*$ and $(a^*\cdot a^*)^*$ against
+ − 431
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
+ − 432
For example,
+ − 433
with our simplification the
+ − 434
previous $(a^*a^*)^*$ example
+ − 435
where $\simpsulz$ could not
+ − 436
stop the fast growth (over
+ − 437
3 million nodes just below $20$ input length)
+ − 438
will be reduced to just 15 and stays constant, no matter how long the
+ − 439
input string is.
+ − 440
This is demonstrated in the graphs below.
+ − 441
\begin{figure}[H]
543
+ − 442
\begin{center}
+ − 443
\begin{tabular}{ll}
+ − 444
\begin{tikzpicture}
+ − 445
\begin{axis}[
+ − 446
xlabel={$n$},
+ − 447
ylabel={derivative size},
+ − 448
width=7cm,
+ − 449
height=4cm,
+ − 450
legend entries={Lexer with $\textit{bsimp}$},
539
+ − 451
legend pos= south east,
+ − 452
legend cell align=left]
+ − 453
\addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};
+ − 454
\end{axis}
+ − 455
\end{tikzpicture} %\label{fig:BitcodedLexer}
+ − 456
&
+ − 457
\begin{tikzpicture}
+ − 458
\begin{axis}[
+ − 459
xlabel={$n$},
+ − 460
ylabel={derivative size},
+ − 461
width = 7cm,
+ − 462
height = 4cm,
585
+ − 463
legend entries={Lexer with $\simpsulz$},
539
+ − 464
legend pos= north west,
+ − 465
legend cell align=left]
+ − 466
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
+ − 467
\end{axis}
+ − 468
\end{tikzpicture}
+ − 469
\end{tabular}
543
+ − 470
\end{center}
585
+ − 471
\caption{Our Improvement over Sulzmann and Lu's in terms of size}
+ − 472
\end{figure}
+ − 473
\noindent
+ − 474
Given the size difference, it is not
+ − 475
surprising that our $\blexersimp$ significantly outperforms
+ − 476
$\textit{blexer\_sulzSimp}$.
+ − 477
In the next section we are going to establish the
+ − 478
first important property of our lexer--the correctness.
543
+ − 479
%----------------------------------------------------------------------------------------
+ − 480
% SECTION rewrite relation
+ − 481
%----------------------------------------------------------------------------------------
585
+ − 482
\section{Correctness of $\blexersimp$}
+ − 483
In this section we give details
+ − 484
of the correctness proof of $\blexersimp$,
+ − 485
an important contribution of this thesis.\\
+ − 486
We first introduce the rewriting relation \emph{rrewrite}
+ − 487
($\rrewrite$) between two regular expressions,
+ − 488
which expresses an atomic
+ − 489
simplification step from the left-hand-side
+ − 490
to the right-hand-side.
+ − 491
We then prove properties about
+ − 492
this rewriting relation and its reflexive transitive closure.
+ − 493
Finally we leverage these properties to show
+ − 494
an equivalence between the internal data structures of
+ − 495
$\blexer$ and $\blexersimp$.
+ − 496
+ − 497
\subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
576
+ − 498
In the $\blexer$'s correctness proof, we
+ − 499
did not directly derive the fact that $\blexer$ gives out the POSIX value,
585
+ − 500
but first proved that $\blexer$ is linked with $\lexer$.
576
+ − 501
Then we re-use
585
+ − 502
the correctness of $\lexer$
+ − 503
to obtain
+ − 504
\begin{center}
+ − 505
$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$.
+ − 506
\end{center}
+ − 507
Here we apply this
+ − 508
modularised technique again
+ − 509
by first proving that
576
+ − 510
$\blexersimp \; r \; s $
+ − 511
produces the same output as $\blexer \; r\; s$,
585
+ − 512
and then piecing it together with
+ − 513
$\blexer$'s correctness to achieve our main
+ − 514
theorem:\footnote{ the case when
+ − 515
$s$ is not in $L \; r$, is routine to establish }
576
+ − 516
\begin{center}
585
+ − 517
$(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = v$
576
+ − 518
\end{center}
+ − 519
\noindent
585
+ − 520
The overall idea for the proof
+ − 521
of $\blexer \;r \;s = \blexersimp \; r \;s$
+ − 522
is that the transition from $r$ to $\textit{bsimp}\; r$ can be
+ − 523
broken down into finitely many rewrite steps:
+ − 524
\begin{center}
+ − 525
$r \rightsquigarrow^* \textit{bsimp} \; r$
+ − 526
\end{center}
+ − 527
where each rewrite step, written $\rightsquigarrow$,
+ − 528
is an ``atomic'' simplification that
+ − 529
cannot be broken down any further:
+ − 530
\begin{figure}[H]
+ − 531
\begin{mathpar}
+ − 532
\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
538
+ − 533
585
+ − 534
\inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}
543
+ − 535
585
+ − 536
\inferrule * [Right = $S_1$]{\vspace{0em}}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\
543
+ − 537
+ − 538
+ − 539
585
+ − 540
\inferrule * [Right = $SL$] {\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\}
543
+ − 541
585
+ − 542
\inferrule * [Right = $SR$] {\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\
543
+ − 543
585
+ − 544
\inferrule * [Right = $A0$] {\vspace{0em}}{ _{bs}\sum [] \rightsquigarrow \ZERO}
543
+ − 545
585
+ − 546
\inferrule * [Right = $A1$] {\vspace{0em}}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a}
543
+ − 547
585
+ − 548
\inferrule * [Right = $AL$] {\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2}
543
+ − 549
585
+ − 550
\inferrule * [Right = $LE$] {\vspace{0em}}{ [] \stackrel{s}{\rightsquigarrow} []}
543
+ − 551
585
+ − 552
\inferrule * [Right = $LT$] {rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{ r :: rs_1 \stackrel{s}{\rightsquigarrow} r :: rs_2 }
543
+ − 553
585
+ − 554
\inferrule * [Right = $LH$] {r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs}
543
+ − 555
585
+ − 556
\inferrule * [Right = $L\ZERO$] {\vspace{0em}}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs}
543
+ − 557
585
+ − 558
\inferrule * [Right = $LS$] {\vspace{0em}}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) }
543
+ − 559
585
+ − 560
\inferrule * [Right = $LD$] {\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rsc \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c}
543
+ − 561
+ − 562
\end{mathpar}
585
+ − 563
\caption{
+ − 564
The rewrite rules that generate simplified regular expressions
+ − 565
in small steps: $r_1 \rightsquigarrow r_2$ is for bitcoded regular expressions
+ − 566
and $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$ for
+ − 567
lists of bitcoded regular expressions.
+ − 568
Interesting is the LD rule that allows copies of regular expressions
+ − 569
to be removed provided a regular expression
+ − 570
earlier in the list can match the same strings.
+ − 571
}\label{rrewriteRules}
+ − 572
\end{figure}
+ − 573
\noindent
+ − 574
The rules such as $LT$ and $LH$ are for rewriting between two regular expression lists
+ − 575
such that one regular expression
+ − 576
in the left-hand-side list is rewritable in one step
+ − 577
to the right-hand-side's regular expression at the same position.
+ − 578
This helps with defining the ``context rules'' such as $AL$.\\
+ − 579
The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
+ − 580
are defined in the usual way:
+ − 581
\begin{figure}[H]
+ − 582
\centering
+ − 583
\begin{mathpar}
+ − 584
\inferrule{\vspace{0em}}{ r \rightsquigarrow^* r \\}
543
+ − 585
585
+ − 586
\inferrule{\vspace{0em}}{rs \stackrel{s*}{\rightsquigarrow} rs \\}
543
+ − 587
585
+ − 588
\inferrule{r_1 \rightsquigarrow^* r_2 \land \; r_2 \rightsquigarrow^* r_3}{r_1 \rightsquigarrow^* r_3\\}
+ − 589
+ − 590
\inferrule{rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \land \; rs_2 \stackrel{s*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{s*}{\rightsquigarrow} rs_3}
+ − 591
\end{mathpar}
+ − 592
\caption{The Reflexive Transitive Closure of
+ − 593
$\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure}
+ − 594
\end{figure}
+ − 595
Two rewritable terms will remain rewritable to each other
+ − 596
even after a derivative is taken:
543
+ − 597
\begin{center}
585
+ − 598
$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$
543
+ − 599
\end{center}
585
+ − 600
And finally, if two terms are rewritable to each other,
+ − 601
then they produce the same bitcodes:
+ − 602
\begin{center}
+ − 603
$r \rightsquigarrow^* r' \;\; \textit{then} \; \; \bmkeps \; r = \bmkeps \; r'$
+ − 604
\end{center}
+ − 605
The decoding phase of both $\blexer$ and $\blexersimp$
+ − 606
are the same, which means that if they get the same
+ − 607
bitcodes before the decoding phase,
+ − 608
they get the same value after decoding is done.
+ − 609
We will prove the three properties
+ − 610
we mentioned above in the next sub-section.
+ − 611
\subsection{Important Properties of $\rightsquigarrow$}
+ − 612
First we prove some basic facts
+ − 613
about $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$,
+ − 614
$\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$,
+ − 615
which will be needed later.\\
+ − 616
The inference rules (\ref{rrewriteRules}) we
+ − 617
gave in the previous section
+ − 618
have their ``many-steps version'':
543
+ − 619
586
+ − 620
\begin{lemma}\label{squig1}
585
+ − 621
\hspace{0em}
+ − 622
\begin{itemize}
+ − 623
\item
+ − 624
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$
+ − 625
\item
586
+ − 626
$r \rightsquigarrow^* r' \implies _{bs} \sum (r :: rs)\; \rightsquigarrow^*\; _{bs} \sum (r' :: rs)$
+ − 627
+ − 628
\item
+ − 629
The rewriting in many steps property is composible
+ − 630
in terms of the sequence constructor:\\
+ − 631
$r_1 \rightsquigarrow^* r_2
+ − 632
\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* \;
+ − 633
_{bs} r_2 \cdot r_3 \quad $
+ − 634
and
+ − 635
$\quad r_3 \rightsquigarrow^* r_4
+ − 636
\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* _{bs} \; r_1 \cdot r_4$
+ − 637
\item
+ − 638
The rewriting in many steps properties
+ − 639
$\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$
+ − 640
is preserved under the function $\fuse$:\\
+ − 641
$r_1 \rightsquigarrow^* r_2
+ − 642
\implies \fuse \; bs \; r_1 \rightsquigarrow^* \;
+ − 643
\fuse \; bs \; r_2 \quad $ and
+ − 644
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2
+ − 645
\implies \map \; (\fuse \; bs) \; rs_1
+ − 646
\stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$
585
+ − 647
\end{itemize}
543
+ − 648
\end{lemma}
+ − 649
\begin{proof}
585
+ − 650
By an induction on
+ − 651
the inductive cases of $\stackrel{s*}{\rightsquigarrow}$ and $\rightsquigarrow^*$ respectively.
586
+ − 652
The third and fourth points are
+ − 653
by the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and
+ − 654
$rs_2 \stackrel{s}{\rightsquigarrow} rs_3
+ − 655
\implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$,
+ − 656
which can be indutively proven by the inductive cases of $\rightsquigarrow$ and
+ − 657
$\stackrel{s}{\rightsquigarrow}$.
543
+ − 658
\end{proof}
+ − 659
\noindent
585
+ − 660
The inference rules of $\stackrel{s}{\rightsquigarrow}$
+ − 661
are defined in terms of list cons operation, here
+ − 662
we establish that the
+ − 663
$\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$
586
+ − 664
relation is also preserved w.r.t appending and prepending of a list.
+ − 665
In addition, we
+ − 666
also prove some relations
+ − 667
between $\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$.
585
+ − 668
\begin{lemma}\label{ssgqTossgs}
+ − 669
\hspace{0em}
+ − 670
\begin{itemize}
+ − 671
\item
+ − 672
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$
+ − 673
+ − 674
\item
+ − 675
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies
586
+ − 676
rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2 \; \;
+ − 677
\textit{and} \; \;
+ − 678
rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
+ − 679
585
+ − 680
\item
+ − 681
The $\stackrel{s}{\rightsquigarrow} $ relation after appending
+ − 682
a list becomes $\stackrel{s*}{\rightsquigarrow}$:\\
586
+ − 683
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2
+ − 684
\implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
+ − 685
\item
+ − 686
+ − 687
$r_1 \rightsquigarrow^* r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$
+ − 688
\item
+ − 689
+ − 690
$rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \rightsquigarrow^* r_2 \implies
+ − 691
r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$
+ − 692
\item
+ − 693
If we could rewrite a regular expression
+ − 694
in many steps to $\ZERO$, then
+ − 695
we could also rewrite any sequence containing it to $\ZERO$:\\
+ − 696
$r_1 \rightsquigarrow^* \ZERO
+ − 697
\implies _{bs}r_1\cdot r_2 \rightsquigarrow^* \ZERO$
585
+ − 698
\end{itemize}
543
+ − 699
\end{lemma}
+ − 700
\begin{proof}
585
+ − 701
The first part is by induction on the list $rs$.
+ − 702
The second part is by induction on the inductive cases of $\stackrel{s*}{\rightsquigarrow}$.
+ − 703
The third part is
+ − 704
by rule induction of $\stackrel{s}{\rightsquigarrow}$.
586
+ − 705
The fourth sub-lemma is
+ − 706
by rule induction of
+ − 707
$\stackrel{s*}{\rightsquigarrow}$ and using part one to three.
+ − 708
The fifth part is a corollary of part four.
+ − 709
The last part is proven by rule induction again on $\rightsquigarrow^*$.
543
+ − 710
\end{proof}
586
+ − 711
\noindent
+ − 712
Now we are ready to give the proofs of the below properties:
585
+ − 713
\begin{itemize}
+ − 714
\item
586
+ − 715
$(r \rightsquigarrow^* r'\land \bnullable \; r_1)
585
+ − 716
\implies \bmkeps \; r = \bmkeps \; r'$. \\
+ − 717
\item
+ − 718
$r \rightsquigarrow^* \textit{bsimp} \;r$.\\
+ − 719
\item
+ − 720
$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow^* r'\backslash c$.\\
+ − 721
\end{itemize}
+ − 722
These properties would work together towards the correctness theorem.
586
+ − 723
\subsubsection{Property 1: $(r \rightsquigarrow^* r'\land \bnullable \; r_1)
+ − 724
\implies \bmkeps \; r = \bmkeps \; r'$}
+ − 725
Intuitively, this property says we can
+ − 726
extract the same bitcodes using $\bmkeps$ from the nullable
+ − 727
components of two regular expressions $r$ and $r'$,
+ − 728
if we can rewrite from one to the other in finitely
+ − 729
many steps.\\
+ − 730
For convenience,
+ − 731
we define a predicate for a list of regular expressions
+ − 732
having at least one nullable regular expressions:
+ − 733
\begin{center}
+ − 734
$\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$
+ − 735
\end{center}
+ − 736
\noindent
+ − 737
The rewriting relation $\rightsquigarrow$ preserves nullability:
+ − 738
\begin{lemma}\label{rewritesBnullable}
+ − 739
\hspace{0em}
+ − 740
\begin{itemize}
+ − 741
\item
+ − 742
$\text{If} \; r_1 \rightsquigarrow r_2, \;
+ − 743
\text{then} \; \bnullable \; r_1 = \bnullable \; r_2$
+ − 744
\item
+ − 745
$\text{If} \; rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \;
+ − 746
\text{then} \; \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$
+ − 747
\item
+ − 748
$r_1 \rightsquigarrow^* r_2
+ − 749
\implies \bnullable \; r_1 = \bnullable \; r_2$
+ − 750
\end{itemize}
+ − 751
\end{lemma}
+ − 752
\begin{proof}
+ − 753
By rule induction of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$.
+ − 754
The third point is a corollary of the second.
+ − 755
\end{proof}
+ − 756
\noindent
+ − 757
For convenience again,
+ − 758
we define $\bmkepss$ on a list $rs$,
+ − 759
which extracts the bit-codes on the first $\bnullable$ element in $rs$:
+ − 760
\begin{center}
+ − 761
\begin{tabular}{lcl}
+ − 762
$\bmkepss \; [] $ & $\dn$ & $[]$\\
+ − 763
$\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \;(\bnullable \; r) \;\;
+ − 764
\textit{then} \;\; \bmkeps \; r \; \textit{else} \;\; \bmkepss \; rs$
+ − 765
\end{tabular}
+ − 766
\end{center}
+ − 767
\noindent
+ − 768
If both regular expressions in a rewriting relation are nullable, then they
+ − 769
produce the same bitcodes:
+ − 770
\begin{lemma}\label{rewriteBmkepsAux}
+ − 771
\hspace{0em}
+ − 772
\begin{itemize}
+ − 773
\item
+ − 774
$r_1 \rightsquigarrow r_2 \implies
+ − 775
(\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 =
+ − 776
\bmkeps \; r_2)$
+ − 777
\item
+ − 778
and
+ − 779
$rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2
+ − 780
\implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies
+ − 781
\bmkepss \; rs_1 = \bmkepss \; rs2)$
+ − 782
\end{itemize}
+ − 783
\end{lemma}
+ − 784
\begin{proof}
+ − 785
By rule induction over the cases that lead to $r_1 \rightsquigarrow r_2$.
+ − 786
\end{proof}
+ − 787
\noindent
+ − 788
With lemma \ref{rewriteBmkepsAux} we are ready to prove its
+ − 789
many-step version:
+ − 790
\begin{lemma}
+ − 791
$\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r, \;\;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$
+ − 792
\end{lemma}
+ − 793
\begin{proof}
+ − 794
By rule induction of $\stackrel{*}{\rightsquigarrow} $.
+ − 795
$\ref{rewritesBnullable}$ tells us both $r$ and $r'$ are nullable.
+ − 796
\ref{rewriteBmkepsAux} solves the inductive case.
+ − 797
\end{proof}
585
+ − 798
586
+ − 799
\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$}
+ − 800
Now we get to the ``meaty'' part of the proof,
+ − 801
which says that our simplification's helper functions
+ − 802
such as $\distinctBy$ and $\flts$ conform to
+ − 803
the $\stackrel{s*}{\rightsquigarrow}$ and
+ − 804
$\rightsquigarrow^* $ rewriting relations.\\
+ − 805
The first lemma to prove is a more general version of
+ − 806
$rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$:
543
+ − 807
\begin{lemma}
586
+ − 808
$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow}
+ − 809
(rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$
543
+ − 810
\end{lemma}
586
+ − 811
\noindent
+ − 812
It says that that for a list made of two parts $rs_1 @ rs_2$,
+ − 813
one can throw away the duplicate
+ − 814
elements in $rs_2$, as well as those that have appeared in $rs_1$.
543
+ − 815
\begin{proof}
+ − 816
By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary.
+ − 817
\end{proof}
586
+ − 818
\noindent
+ − 819
Setting $rs_2$ to be empty,
+ − 820
we get the corollary
+ − 821
\begin{corollary}\label{dBPreserves}
+ − 822
$rs_1 \stackrel{s*}{\rightsquigarrow} \distinctBy \; rs_1 \; \phi$.
+ − 823
\end{corollary}
+ − 824
\noindent
+ − 825
The flatten function $\flts$ conforms to
+ − 826
$\stackrel{s*}{\rightsquigarrow}$ as well:
538
+ − 827
543
+ − 828
\begin{lemma}\label{fltsPreserves}
+ − 829
$rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$
+ − 830
\end{lemma}
+ − 831
\begin{proof}
586
+ − 832
By an induction on $rs$.
543
+ − 833
\end{proof}
+ − 834
\noindent
+ − 835
The function $\bsimpalts$ preserves rewritability:
+ − 836
\begin{lemma}\label{bsimpaltsPreserves}
+ − 837
$_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$
+ − 838
\end{lemma}
+ − 839
\noindent
586
+ − 840
The simplification function
+ − 841
$\textit{bsimp}$ only transforms the regex $r$ using steps specified by
+ − 842
$\rightsquigarrow^*$ and nothing else.
543
+ − 843
\begin{lemma}
+ − 844
$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
+ − 845
\end{lemma}
+ − 846
\begin{proof}
+ − 847
By an induction on $r$.
586
+ − 848
The most involved case would be the alternative,
+ − 849
where we use lemmas \ref{bsimpaltsPreserves},
+ − 850
\ref{fltsPreserves} and \ref{dBPreserves} to do a series of rewriting:\\
+ − 851
\begin{center}
+ − 852
\begin{tabular}{lcl}
+ − 853
$rs$ & $\stackrel{s*}{\rightsquigarrow}$ & $ \map \; \textit{bsimp} \; rs$\\
+ − 854
& $\stackrel{s*}{\rightsquigarrow}$ & $ \flts \; (\map \; \textit{bsimp} \; rs)$\\
+ − 855
& $\stackrel{s*}{\rightsquigarrow}$ & $ \distinctBy \;
+ − 856
(\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerases \; \phi$\\
+ − 857
\end{tabular}
+ − 858
\end{center}
+ − 859
Using this we derive the following rewrite relation:\\
+ − 860
\begin{center}
+ − 861
\begin{tabular}{lcl}
+ − 862
$r$ & $=$ & $_{bs}\sum rs$\\[1.5ex]
+ − 863
& $\rightsquigarrow^*$ & $\bsimpalts \; bs \; rs$ \\[1.5ex]
+ − 864
& $\rightsquigarrow^*$ & $\ldots$ \\ [1.5ex]
+ − 865
& $\rightsquigarrow^*$ & $\bsimpalts \; bs \;
+ − 866
(\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs))
+ − 867
\; \rerases \; \phi)$\\[1.5ex]
+ − 868
%& $\rightsquigarrow^*$ & $ _{bs} \sum (\distinctBy \;
+ − 869
%(\flts \; (\map \; \textit{bsimp}\; rs)) \; \;
+ − 870
%\rerases \; \;\phi) $\\[1.5ex]
+ − 871
& $\rightsquigarrow^*$ & $\textit{bsimp} \; r$\\[1.5ex]
+ − 872
\end{tabular}
+ − 873
\end{center}
543
+ − 874
\end{proof}
585
+ − 875
\subsubsection{Property 3: $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}
586
+ − 876
The rewritability relation
+ − 877
$\rightsquigarrow$ is preserved under derivatives--
+ − 878
it is just that we might need multiple steps
588
+ − 879
where originally only one step was needed:
543
+ − 880
\begin{lemma}\label{rewriteBder}
588
+ − 881
\hspace{0em}
+ − 882
\begin{itemize}
+ − 883
\item
+ − 884
If $r_1 \rightsquigarrow r_2$, then $r_1 \backslash c
+ − 885
\rightsquigarrow^* r_2 \backslash c$
+ − 886
\item
+ − 887
If $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$, then $
+ − 888
\map \; (\_\backslash c) \; rs_1
+ − 889
\stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$
+ − 890
\end{itemize}
543
+ − 891
\end{lemma}
+ − 892
\begin{proof}
588
+ − 893
By induction on $\rightsquigarrow$
+ − 894
and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.
+ − 895
\end{proof}
+ − 896
\noindent
+ − 897
Now we can prove property 3, as an immediate corollary:
+ − 898
\begin{corollary}\label{rewritesBder}
+ − 899
$r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^*
+ − 900
r_2 \backslash c$
+ − 901
\end{corollary}
+ − 902
\begin{proof}
+ − 903
By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma \ref{rewriteBder}.
543
+ − 904
\end{proof}
+ − 905
\noindent
588
+ − 906
This can be extended and combined with $r \rightsquigarrow^* \textit{bsimp} \; r$
+ − 907
to obtain the rewritability between
+ − 908
$\blexer$ and $\blexersimp$'s intermediate
+ − 909
derivative regular expressions
543
+ − 910
\begin{lemma}\label{bderBderssimp}
588
+ − 911
$a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $
543
+ − 912
\end{lemma}
588
+ − 913
\begin{proof}
+ − 914
By an induction on $s$.
+ − 915
\end{proof}
543
+ − 916
\subsection{Main Theorem}
588
+ − 917
Now with \ref{bderBderssimp} we are ready for the main theorem.
543
+ − 918
\begin{theorem}
+ − 919
$\blexer \; r \; s = \blexersimp{r}{s}$
+ − 920
\end{theorem}
+ − 921
\noindent
576
+ − 922
\begin{proof}
588
+ − 923
One can rewrite in many steps from the original lexer's
+ − 924
derivative regular expressions to the
+ − 925
lexer with simplification applied (by lemma \ref{bderBderssimp}):
+ − 926
\begin{center}
+ − 927
$a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $.
+ − 928
\end{center}
+ − 929
we know that they give out the same bits, if the lexing result is a match:
+ − 930
\begin{center}
+ − 931
$\bnullable \; (a \backslash s)
+ − 932
\implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
+ − 933
\end{center}
+ − 934
Now that they give out the same bits, we know that they give the same value after decoding.
+ − 935
\begin{center}
+ − 936
$\bnullable \; (a \backslash s)
+ − 937
\implies \decode \; r \; (\bmkeps \; (a \backslash s)) =
+ − 938
\decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$
+ − 939
\end{center}
+ − 940
Which is equivalent to our proof goal:
+ − 941
\begin{center}
+ − 942
$\blexer \; r \; s = \blexersimp \; r \; s$.
+ − 943
\end{center}
576
+ − 944
\end{proof}
+ − 945
\noindent
+ − 946
As a corollary,
+ − 947
we link this result with the lemma we proved earlier that
+ − 948
\begin{center}
589
+ − 949
$(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = v$
576
+ − 950
\end{center}
+ − 951
and obtain the corollary that the bit-coded lexer with simplification is
+ − 952
indeed correctly outputting POSIX lexing result, if such a result exists.
+ − 953
\begin{corollary}
589
+ − 954
$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s $
576
+ − 955
\end{corollary}
532
+ − 956
543
+ − 957
\subsection{Comments on the Proof Techniques Used}
589
+ − 958
Straightforward and simple as the proof may seem,
+ − 959
the efforts we spent obtaining it was far from trivial.\\
+ − 960
We initially attempted to re-use the argument
+ − 961
in \cref{flex_retrieve}.
+ − 962
The problem was that both functions $\inj$ and $\retrieve$ require
+ − 963
that the annotated regular expressions stay unsimplified,
+ − 964
so that one can
+ − 965
correctly compare $v_{i+1}$ and $r_i$ and $v_i$
+ − 966
in diagram \ref{graph:inj} and
+ − 967
``fit the key into the lock hole''.
543
+ − 968
589
+ − 969
\noindent
+ − 970
We also tried to prove
+ − 971
\begin{center}
+ − 972
$\textit{bsimp} \;\; (\bderssimp{a}{s}) =
+ − 973
\textit{bsimp} \;\; (a\backslash s)$,
+ − 974
\end{center}
+ − 975
but this turns out to be not true.
+ − 976
A counterexample would be
+ − 977
\[ a = [(_{Z}1+_{S}c)\cdot [bb \cdot (_{Z}1+_{S}c)]] \;\;
+ − 978
\text{and} \;\; s = bb.
+ − 979
\]
+ − 980
\noindent
+ − 981
Then we would have
+ − 982
\begin{center}
+ − 983
$\textit{bsimp}\;\; ( a \backslash s )$ =
+ − 984
$_{[]}(_{ZZ}\ONE + _{ZS}c ) $
+ − 985
\end{center}
+ − 986
\noindent
+ − 987
whereas
+ − 988
\begin{center}
+ − 989
$\textit{bsimp} \;\;( \bderssimp{a}{s} )$ =
+ − 990
$_{Z}(_{Z} \ONE + _{S} c)$.
+ − 991
\end{center}
+ − 992
Unfortunately,
+ − 993
if we apply $\textit{bsimp}$ differently
+ − 994
we will always have this discrepancy.
+ − 995
This is due to
+ − 996
the $\map \; (\fuse\; bs) \; as$ operation
+ − 997
happening at different locations in the regular expression.\\
+ − 998
The rewriting relation
+ − 999
$\rightsquigarrow^*$
+ − 1000
allows us to ignore this discrepancy
+ − 1001
and view the expressions
+ − 1002
\begin{center}
+ − 1003
$_{[]}(_{ZZ}\ONE + _{ZS}c ) $\\
+ − 1004
and\\
+ − 1005
$_{Z}(_{Z} \ONE + _{S} c)$
543
+ − 1006
589
+ − 1007
\end{center}
+ − 1008
as equal, because they were both re-written
+ − 1009
from the same expression.\\
+ − 1010
Having correctness property is good.
+ − 1011
But we would also a guarantee that the lexer is not slow in
+ − 1012
some sense, for exampe, not grinding to a halt regardless of the input.
+ − 1013
As we have already seen, Sulzmann and Lu's simplification function
+ − 1014
$\simpsulz$ cannot achieve this, because their claim that
+ − 1015
the regular expression size does not grow arbitrary large
+ − 1016
was not true.
+ − 1017
In the next chapter we shall prove that with our $\simp$,
+ − 1018
for a given $r$, the internal derivative size is always
543
+ − 1019
finitely bounded by a constant.