532
+ − 1
% Chapter Template
+ − 2
+ − 3
% Main chapter title
+ − 4
\chapter{Correctness of Bit-coded Algorithm with Simplification}
+ − 5
+ − 6
\label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
+ − 7
%Then we illustrate how the algorithm without bitcodes falls short for such aggressive
+ − 8
%simplifications and therefore introduce our version of the bitcoded algorithm and
+ − 9
%its correctness proof in
+ − 10
%Chapter 3\ref{Chapter3}.
+ − 11
+ − 12
+ − 13
600
+ − 14
In this chapter we introduce simplifications
583
+ − 15
on annotated regular expressions that can be applied to
+ − 16
each intermediate derivative result. This allows
+ − 17
us to make $\blexer$ much more efficient.
600
+ − 18
Sulzmann and Lu already had some bit-coded simplifications,
+ − 19
but their simplification functions were inefficient.
+ − 20
We contrast our simplification function
+ − 21
with Sulzmann and Lu's, indicating the simplicity of our algorithm.
+ − 22
This is another case for the usefulness
+ − 23
and reliability of formal proofs on algorithms.
583
+ − 24
These ``aggressive'' simplifications would not be possible in the injection-based
+ − 25
lexing we introduced in chapter \ref{Inj}.
600
+ − 26
We then prove the correctness with the improved version of
583
+ − 27
$\blexer$, called $\blexersimp$, by establishing
+ − 28
$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
538
+ − 29
585
+ − 30
\section{Simplifications by Sulzmann and Lu}
600
+ − 31
Consider the derivatives of examples such as $(a^*a^*)^*$
+ − 32
and $(a^* + (aa)^*)^*$:
543
+ − 33
\begin{center}
585
+ − 34
$(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\
+ − 35
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$
543
+ − 36
\end{center}
+ − 37
\noindent
600
+ − 38
As can be seen, there is a lot of duplication
+ − 39
in the example we have already mentioned in
+ − 40
\ref{eqn:growth2}.
+ − 41
A simple-minded simplification function cannot simplify
585
+ − 42
the third regular expression in the above chain of derivative
600
+ − 43
regular expressions, namely
583
+ − 44
\begin{center}
579
+ − 45
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
583
+ − 46
\end{center}
600
+ − 47
because the duplicates are
+ − 48
not next to each other and therefore the rule
+ − 49
$r+ r \rightarrow r$ does not fire.
+ − 50
One would expect a better simplification function to work in the
579
+ − 51
following way:
+ − 52
\begin{gather*}
+ − 53
((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
+ − 54
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\
+ − 55
\bigg\downarrow \\
+ − 56
(a^*a^* + a^*
+ − 57
\color{gray} + a^* \color{black})\cdot(a^*a^*)^* +
+ − 58
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\
+ − 59
\bigg\downarrow \\
+ − 60
(a^*a^* + a^*
+ − 61
)\cdot(a^*a^*)^*
583
+ − 62
\color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*\\
+ − 63
\bigg\downarrow \\
+ − 64
(a^*a^* + a^*
+ − 65
)\cdot(a^*a^*)^*
579
+ − 66
\end{gather*}
+ − 67
\noindent
600
+ − 68
In the first step, the nested alternative regular expression
+ − 69
$(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$.
+ − 70
Now the third term $a^*$ is clearly identified as a duplicate
+ − 71
and therefore removed in the second step. This causes the two
+ − 72
top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$
+ − 73
removed in the final step.\\
+ − 74
This motivating example is from testing Sulzmann and Lu's
579
+ − 75
algorithm: their simplification does
+ − 76
not work!
600
+ − 77
Consider their simplification (using our notations):
579
+ − 78
\begin{center}
+ − 79
\begin{tabular}{lcl}
585
+ − 80
$\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ &
579
+ − 81
$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\
+ − 82
& &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\
585
+ − 83
$\simpsulz \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if}
579
+ − 84
\; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\;
+ − 85
\textit{then} \;\; \ZERO$\\
585
+ − 86
& & $\textit{else}\;\;_{bs}((\simpsulz \;r_1)\cdot
+ − 87
(\simpsulz \; r_2))$\\
+ − 88
$\simpsulz \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\
+ − 89
$\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
579
+ − 90
$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
585
+ − 91
$\simpsulz \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\simpsulz \; r)$\\
+ − 92
$\simpsulz \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum
+ − 93
(\nub \; (\filter \; (\not \circ \zeroable)\;((\simpsulz \; r) :: \map \; \simpsulz \; rs)))$\\
579
+ − 94
+ − 95
\end{tabular}
+ − 96
\end{center}
+ − 97
\noindent
600
+ − 98
where the $\textit{zeroable}$ predicate
+ − 99
tests whether the regular expression
579
+ − 100
is equivalent to $\ZERO$,
600
+ − 101
can be defined as:
579
+ − 102
\begin{center}
+ − 103
\begin{tabular}{lcl}
+ − 104
$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\;
+ − 105
\zeroable \;_{[]}\sum\;rs $\\
+ − 106
$\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\
+ − 107
$\zeroable\;_{bs}r^*$ & $\dn$ & $\textit{false}$ \\
+ − 108
$\zeroable\;_{bs}c$ & $\dn$ & $\textit{false}$\\
+ − 109
$\zeroable\;_{bs}\ONE$ & $\dn$ & $\textit{false}$\\
+ − 110
$\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$
+ − 111
\end{tabular}
+ − 112
\end{center}
+ − 113
\noindent
585
+ − 114
They suggested that the $\simpsulz $ function should be
584
+ − 115
applied repeatedly until a fixpoint is reached.
585
+ − 116
We call this construction $\textit{sulzSimp}$:
+ − 117
\begin{center}
+ − 118
\begin{tabular}{lcl}
+ − 119
$\textit{sulzSimp} \; r$ & $\dn$ &
+ − 120
$\textit{while}((\simpsulz \; r)\; \cancel{=} \; r)$ \\
+ − 121
& & $\quad r := \simpsulz \; r$\\
+ − 122
& & $\textit{return} \; r$
+ − 123
\end{tabular}
+ − 124
\end{center}
+ − 125
We call the operation of alternatingly
+ − 126
applying derivatives and simplifications
+ − 127
(until the string is exhausted) Sulz-simp-derivative,
+ − 128
written $\backslash_{sulzSimp}$:
+ − 129
\begin{center}
+ − 130
\begin{tabular}{lcl}
+ − 131
$r \backslash_{sulzSimp} (c\!::\!s) $ & $\dn$ & $(\textit{sulzSimp} \; (r \backslash c)) \backslash_{sulzSimp}\, s$ \\
+ − 132
$r \backslash_{sulzSimp} [\,] $ & $\dn$ & $r$
+ − 133
\end{tabular}
+ − 134
\end{center}
+ − 135
\noindent
+ − 136
After the derivatives have been taken, the bitcodes
+ − 137
are extracted and decoded in the same manner
+ − 138
as $\blexer$:
+ − 139
\begin{center}
+ − 140
\begin{tabular}{lcl}
+ − 141
$\textit{blexer\_sulzSimp}\;r\,s$ & $\dn$ &
+ − 142
$\textit{let}\;a = (r^\uparrow)\backslash_{sulzSimp}\, s\;\textit{in}$\\
+ − 143
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+ − 144
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+ − 145
& & $\;\;\textit{else}\;\textit{None}$
+ − 146
\end{tabular}
+ − 147
\end{center}
+ − 148
\noindent
+ − 149
We implemented this lexing algorithm in Scala,
+ − 150
and found that the final derivative regular expression
584
+ − 151
size grows exponentially fast:
+ − 152
\begin{figure}[H]
585
+ − 153
\centering
584
+ − 154
\begin{tikzpicture}
+ − 155
\begin{axis}[
+ − 156
xlabel={$n$},
585
+ − 157
ylabel={size},
584
+ − 158
ymode = log,
585
+ − 159
legend entries={Final Derivative Size},
584
+ − 160
legend pos=north west,
+ − 161
legend cell align=left]
+ − 162
\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexer.data};
+ − 163
\end{axis}
+ − 164
\end{tikzpicture}
585
+ − 165
\caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form
584
+ − 166
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
585
+ − 167
$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer}
584
+ − 168
\end{figure}
+ − 169
\noindent
+ − 170
At $n= 20$ we already get an out of memory error with Scala's normal
585
+ − 171
JVM heap size settings.
+ − 172
In fact their simplification does not improve over
+ − 173
the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}.
+ − 174
The time required also grows exponentially:
+ − 175
\begin{figure}[H]
+ − 176
\centering
+ − 177
\begin{tikzpicture}
+ − 178
\begin{axis}[
+ − 179
xlabel={$n$},
+ − 180
ylabel={time},
+ − 181
ymode = log,
+ − 182
legend entries={time in secs},
+ − 183
legend pos=north west,
+ − 184
legend cell align=left]
+ − 185
\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexerTime.data};
+ − 186
\end{axis}
+ − 187
\end{tikzpicture}
+ − 188
\caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form
+ − 189
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
+ − 190
$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexerTime}
+ − 191
\end{figure}
+ − 192
\noindent
+ − 193
which seems like a counterexample for
+ − 194
their linear complexity claim:
584
+ − 195
\begin{quote}\it
585
+ − 196
Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations:
+ − 197
simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input.
+ − 198
\end{quote}
+ − 199
\noindent
+ − 200
The assumption that the size of the regular expressions
+ − 201
in the algorithm
+ − 202
would stay below a finite constant is not ture.
600
+ − 203
The main reason behind this is that (i) The $\textit{nub}$
+ − 204
function requires identical annotations between two
+ − 205
annotated regular expressions to qualify as duplicates,
+ − 206
and cannot simplify the cases like $_{SZZ}a^*+_{SZS}a^*$
+ − 207
even if both $a^*$ denote the same language.
+ − 208
(ii) The ``flattening'' only applies to the head of the list
+ − 209
in the
+ − 210
\begin{center}
+ − 211
\begin{tabular}{lcl}
+ − 212
+ − 213
$\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
+ − 214
$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
+ − 215
\end{tabular}
+ − 216
\end{center}
+ − 217
\noindent
+ − 218
clause, and therefore is not thorough enough to simplify all
+ − 219
needed parts of the regular expression.\\
585
+ − 220
In addition to that, even if the regular expressions size
+ − 221
do stay finite, one has to take into account that
+ − 222
the $\simpsulz$ function is applied many times
+ − 223
in each derivative step, and that number is not necessarily
+ − 224
a constant with respect to the size of the regular expression.
+ − 225
To not get ``caught off guard'' by
+ − 226
these counterexamples,
+ − 227
one needs to be more careful when designing the
+ − 228
simplification function and making claims about them.
584
+ − 229
585
+ − 230
\section{Our $\textit{Simp}$ Function}
+ − 231
We will now introduce our simplification function,
+ − 232
by making a contrast with $\simpsulz$.
+ − 233
We describe
+ − 234
the ideas behind components in their algorithm
+ − 235
and why they fail to achieve the desired effect, followed
+ − 236
by our solution. These solutions come with correctness
+ − 237
statements that are backed up by formal proofs.
+ − 238
\subsection{Flattening Nested Alternatives}
600
+ − 239
The idea behind the clause
585
+ − 240
\begin{center}
+ − 241
$\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
+ − 242
_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$
+ − 243
\end{center}
600
+ − 244
is that it allows
585
+ − 245
duplicate removal of regular expressions at different
600
+ − 246
``levels'' of alternatives.
585
+ − 247
For example, this would help with the
+ − 248
following simplification:
538
+ − 249
585
+ − 250
\begin{center}
+ − 251
$(a+r)+r \longrightarrow a+r$
+ − 252
\end{center}
+ − 253
The problem here is that only the head element
+ − 254
is ``spilled out'',
+ − 255
whereas we would want to flatten
600
+ − 256
an entire list to open up possibilities for further simplifications.
585
+ − 257
Not flattening the rest of the elements also means that
+ − 258
the later de-duplication processs
600
+ − 259
does not fully remove further duplicates.
585
+ − 260
For example,
+ − 261
using $\simpsulz$ we could not
+ − 262
simplify
+ − 263
\begin{center}
600
+ − 264
$((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+
585
+ − 265
((a^*a^*)+a^*)\cdot (a^*a^*)^*$
+ − 266
\end{center}
+ − 267
due to the underlined part not in the first element
+ − 268
of the alternative.\\
+ − 269
We define a flatten operation that flattens not only
+ − 270
the first regular expression of an alternative,
+ − 271
but the entire list:
543
+ − 272
\begin{center}
+ − 273
\begin{tabular}{@{}lcl@{}}
+ − 274
$\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
+ − 275
(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\
+ − 276
$\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \; \textit{as'} $ \\
+ − 277
$\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise)
+ − 278
\end{tabular}
+ − 279
\end{center}
+ − 280
\noindent
585
+ − 281
Our $\flts$ operation
+ − 282
also throws away $\ZERO$s
+ − 283
as they do not contribute to a lexing result.
+ − 284
\subsection{Duplicate Removal}
+ − 285
After flattening is done, we are ready to deduplicate.
+ − 286
The de-duplicate function is called $\distinctBy$,
+ − 287
and that is where we make our second improvement over
+ − 288
Sulzmann and Lu's.
+ − 289
The process goes as follows:
+ − 290
\begin{center}
+ − 291
$rs \stackrel{\textit{flts}}{\longrightarrow}
+ − 292
rs_{flat}
+ − 293
\xrightarrow{\distinctBy \;
+ − 294
rs_{flat} \; \rerases\; \varnothing} rs_{distinct}$
+ − 295
%\stackrel{\distinctBy \;
+ − 296
%rs_{flat} \; \erase\; \varnothing}{\longrightarrow} \; rs_{distinct}$
+ − 297
\end{center}
+ − 298
where the $\distinctBy$ function is defined as:
543
+ − 299
\begin{center}
+ − 300
\begin{tabular}{@{}lcl@{}}
+ − 301
$\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\
585
+ − 302
$\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & $\quad \textit{if} (f \; x \in acc)\;\; \textit{then} \;\; \distinctBy \; xs \; f \; acc$\\
+ − 303
& & $\quad \textit{else}\;\; x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$
543
+ − 304
\end{tabular}
+ − 305
\end{center}
+ − 306
\noindent
+ − 307
The reason we define a distinct function under a mapping $f$ is because
585
+ − 308
we want to eliminate regular expressions that are syntactically the same,
+ − 309
but with different bit-codes.
+ − 310
For example, we can remove the second $a^*a^*$ from
+ − 311
$_{ZSZ}a^*a^* + _{SZZ}a^*a^*$, because it
+ − 312
represents a match with shorter initial sub-match
+ − 313
(and therefore is definitely not POSIX),
+ − 314
and will be discarded by $\bmkeps$ later.
+ − 315
\begin{center}
+ − 316
$_{ZSZ}\underbrace{a^*}_{ZS:\; match \; 1\; times\quad}\underbrace{a^*}_{Z: \;match\; 1 \;times} +
+ − 317
_{SZZ}\underbrace{a^*}_{S: \; match \; 0 \; times\quad}\underbrace{a^*}_{ZZ: \; match \; 2 \; times}
+ − 318
$
+ − 319
\end{center}
+ − 320
%$_{bs1} r_1 + _{bs2} r_2 \text{where} (r_1)_{\downarrow} = (r_2)_{\downarrow}$
+ − 321
Due to the way our algorithm works,
+ − 322
the matches that conform to the POSIX standard
+ − 323
will always be placed further to the left. When we
+ − 324
traverse the list from left to right,
+ − 325
regular expressions we have already seen
+ − 326
will definitely not contribute to a POSIX value,
+ − 327
even if they are attached with different bitcodes.
+ − 328
These duplicates therefore need to be removed.
+ − 329
To achieve this, we call $\rerases$ as the function $f$ during the distinction
+ − 330
operation.\\
+ − 331
$\rerases$ is very similar to $\erase$, except that it preserves the structure
+ − 332
when erasing an alternative regular expression.
+ − 333
The reason why we use $\rerases$ instead of $\erase$ is that
+ − 334
it keeps the structures of alternative
+ − 335
annotated regular expressions
+ − 336
whereas $\erase$ would turn it back into a binary structure.
+ − 337
Not having to mess with the structure
590
+ − 338
greatly simplifies the finiteness proof in chapter
+ − 339
\ref{Finite} (we will follow up with more details there).
585
+ − 340
We give the definitions of $\rerases$ here together with
+ − 341
the new datatype used by $\rerases$ (as our plain
590
+ − 342
regular expression datatype does not allow non-binary alternatives).
585
+ − 343
For the moment the reader can just think of
+ − 344
$\rerases$ as $\erase$ and $\rrexp$ as plain regular expressions.
590
+ − 345
\begin{figure}[H]
+ − 346
\begin{center}
+ − 347
$\rrexp ::= \RZERO \mid \RONE
585
+ − 348
\mid \RCHAR{c}
+ − 349
\mid \RSEQ{r_1}{r_2}
+ − 350
\mid \RALTS{rs}
590
+ − 351
\mid \RSTAR{r} $
+ − 352
\end{center}
+ − 353
\caption{$\rrexp$: plain regular expressions, but with $\sum$ alternative
+ − 354
constructor}\label{rrexpDef}
+ − 355
\end{figure}
585
+ − 356
The notation of $\rerases$ also follows that of $\erase$,
+ − 357
which is a postfix operator written as a subscript,
+ − 358
except that it has an \emph{r} attached to it to distinguish against $\erase$:
+ − 359
\begin{center}
+ − 360
\begin{tabular}{lcl}
+ − 361
$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
+ − 362
$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
+ − 363
$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
+ − 364
$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
+ − 365
$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
+ − 366
$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
+ − 367
\end{tabular}
+ − 368
\end{center}
+ − 369
+ − 370
\subsection{Putting Things Together}
543
+ − 371
A recursive definition of our simplification function
585
+ − 372
is given below:
+ − 373
%that looks somewhat similar to our Scala code is
538
+ − 374
\begin{center}
+ − 375
\begin{tabular}{@{}lcl@{}}
+ − 376
543
+ − 377
$\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp} \; a_2) $ \\
585
+ − 378
$\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \rerases \; \varnothing) $ \\
543
+ − 379
$\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
538
+ − 380
\end{tabular}
+ − 381
\end{center}
+ − 382
+ − 383
\noindent
585
+ − 384
The simplification (named $\textit{bsimp}$ for \emph{b}it-coded)
+ − 385
does a pattern matching on the regular expression.
538
+ − 386
When it detected that the regular expression is an alternative or
543
+ − 387
sequence, it will try to simplify its children regular expressions
538
+ − 388
recursively and then see if one of the children turns into $\ZERO$ or
+ − 389
$\ONE$, which might trigger further simplification at the current level.
543
+ − 390
Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$,
+ − 391
using rules such as $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$.
+ − 392
\begin{center}
+ − 393
\begin{tabular}{@{}lcl@{}}
+ − 394
$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
+ − 395
&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
+ − 396
&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
+ − 397
&&$\quad\textit{case} \; (_{bs1}\ONE, a_2') \Rightarrow \textit{fuse} \; (bs@bs_1) \; a_2'$ \\
+ − 398
&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$
+ − 399
\end{tabular}
+ − 400
\end{center}
538
+ − 401
\noindent
543
+ − 402
The most involved part is the $\sum$ clause, where we first call $\flts$ on
+ − 403
the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$.
+ − 404
and then call $\distinctBy$ on that list, the predicate determining whether two
585
+ − 405
elements are the same is $\rerases \; r_1 = \rerases\; r_2$.
543
+ − 406
Finally, depending on whether the regular expression list $as'$ has turned into a
+ − 407
singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$
+ − 408
decides whether to keep the current level constructor $\sum$ as it is, and
+ − 409
removes it when there are less than two elements:
+ − 410
\begin{center}
+ − 411
\begin{tabular}{lcl}
+ − 412
$\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\
+ − 413
&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
+ − 414
&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
+ − 415
&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
+ − 416
\end{tabular}
+ − 417
+ − 418
\end{center}
+ − 419
Having defined the $\bsimp$ function,
+ − 420
we add it as a phase after a derivative is taken,
+ − 421
so it stays small:
+ − 422
\begin{center}
+ − 423
\begin{tabular}{lcl}
+ − 424
$r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
+ − 425
\end{tabular}
+ − 426
\end{center}
585
+ − 427
%Following previous notations
+ − 428
%when extending from derivatives w.r.t.~character to derivative
+ − 429
%w.r.t.~string, we define the derivative that nests simplifications
+ − 430
%with derivatives:%\comment{simp in the [] case?}
+ − 431
We extend this from character to string:
538
+ − 432
\begin{center}
+ − 433
\begin{tabular}{lcl}
543
+ − 434
$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
+ − 435
$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$
538
+ − 436
\end{tabular}
+ − 437
\end{center}
+ − 438
+ − 439
\noindent
585
+ − 440
The lexer that extracts bitcodes from the
+ − 441
derivatives with simplifications from our $\simp$ function
+ − 442
is called $\blexersimp$:
+ − 443
\begin{center}
538
+ − 444
\begin{tabular}{lcl}
+ − 445
$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
+ − 446
$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
+ − 447
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+ − 448
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+ − 449
& & $\;\;\textit{else}\;\textit{None}$
+ − 450
\end{tabular}
+ − 451
\end{center}
+ − 452
+ − 453
\noindent
585
+ − 454
This algorithm keeps the regular expression size small.
538
+ − 455
+ − 456
600
+ − 457
\subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$
+ − 458
After Simplification}
+ − 459
Recall the
585
+ − 460
previous $(a^*a^*)^*$ example
+ − 461
where $\simpsulz$ could not
600
+ − 462
prevent the fast growth (over
585
+ − 463
3 million nodes just below $20$ input length)
600
+ − 464
will be reduced to just 15 and stays constant no matter how long the
585
+ − 465
input string is.
600
+ − 466
This is shown in the graphs below.
585
+ − 467
\begin{figure}[H]
543
+ − 468
\begin{center}
+ − 469
\begin{tabular}{ll}
+ − 470
\begin{tikzpicture}
+ − 471
\begin{axis}[
+ − 472
xlabel={$n$},
+ − 473
ylabel={derivative size},
+ − 474
width=7cm,
+ − 475
height=4cm,
+ − 476
legend entries={Lexer with $\textit{bsimp}$},
539
+ − 477
legend pos= south east,
+ − 478
legend cell align=left]
+ − 479
\addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};
+ − 480
\end{axis}
+ − 481
\end{tikzpicture} %\label{fig:BitcodedLexer}
+ − 482
&
+ − 483
\begin{tikzpicture}
+ − 484
\begin{axis}[
+ − 485
xlabel={$n$},
+ − 486
ylabel={derivative size},
+ − 487
width = 7cm,
+ − 488
height = 4cm,
585
+ − 489
legend entries={Lexer with $\simpsulz$},
539
+ − 490
legend pos= north west,
+ − 491
legend cell align=left]
+ − 492
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
+ − 493
\end{axis}
+ − 494
\end{tikzpicture}
+ − 495
\end{tabular}
543
+ − 496
\end{center}
585
+ − 497
\caption{Our Improvement over Sulzmann and Lu's in terms of size}
+ − 498
\end{figure}
+ − 499
\noindent
+ − 500
Given the size difference, it is not
+ − 501
surprising that our $\blexersimp$ significantly outperforms
+ − 502
$\textit{blexer\_sulzSimp}$.
+ − 503
In the next section we are going to establish the
+ − 504
first important property of our lexer--the correctness.
543
+ − 505
%----------------------------------------------------------------------------------------
+ − 506
% SECTION rewrite relation
+ − 507
%----------------------------------------------------------------------------------------
585
+ − 508
\section{Correctness of $\blexersimp$}
+ − 509
In this section we give details
+ − 510
of the correctness proof of $\blexersimp$,
600
+ − 511
one of the contributions of this thesis.\\
585
+ − 512
We first introduce the rewriting relation \emph{rrewrite}
+ − 513
($\rrewrite$) between two regular expressions,
+ − 514
which expresses an atomic
600
+ − 515
simplification.
585
+ − 516
We then prove properties about
+ − 517
this rewriting relation and its reflexive transitive closure.
+ − 518
Finally we leverage these properties to show
+ − 519
an equivalence between the internal data structures of
+ − 520
$\blexer$ and $\blexersimp$.
+ − 521
+ − 522
\subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
576
+ − 523
In the $\blexer$'s correctness proof, we
600
+ − 524
did not directly derive the fact that $\blexer$ generates the POSIX value,
585
+ − 525
but first proved that $\blexer$ is linked with $\lexer$.
576
+ − 526
Then we re-use
585
+ − 527
the correctness of $\lexer$
+ − 528
to obtain
+ − 529
\begin{center}
+ − 530
$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$.
+ − 531
\end{center}
+ − 532
Here we apply this
+ − 533
modularised technique again
+ − 534
by first proving that
576
+ − 535
$\blexersimp \; r \; s $
+ − 536
produces the same output as $\blexer \; r\; s$,
585
+ − 537
and then piecing it together with
+ − 538
$\blexer$'s correctness to achieve our main
600
+ − 539
theorem:\footnote{ The case when
+ − 540
$s$ is not in $L \; r$, is routine to establish.}
576
+ − 541
\begin{center}
585
+ − 542
$(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = v$
576
+ − 543
\end{center}
+ − 544
\noindent
585
+ − 545
The overall idea for the proof
+ − 546
of $\blexer \;r \;s = \blexersimp \; r \;s$
+ − 547
is that the transition from $r$ to $\textit{bsimp}\; r$ can be
+ − 548
broken down into finitely many rewrite steps:
+ − 549
\begin{center}
+ − 550
$r \rightsquigarrow^* \textit{bsimp} \; r$
+ − 551
\end{center}
+ − 552
where each rewrite step, written $\rightsquigarrow$,
+ − 553
is an ``atomic'' simplification that
600
+ − 554
is similar to a small-step reduction in operational semantics:
585
+ − 555
\begin{figure}[H]
+ − 556
\begin{mathpar}
+ − 557
\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
538
+ − 558
585
+ − 559
\inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}
543
+ − 560
585
+ − 561
\inferrule * [Right = $S_1$]{\vspace{0em}}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\
543
+ − 562
+ − 563
+ − 564
585
+ − 565
\inferrule * [Right = $SL$] {\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\}
543
+ − 566
585
+ − 567
\inferrule * [Right = $SR$] {\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\
543
+ − 568
585
+ − 569
\inferrule * [Right = $A0$] {\vspace{0em}}{ _{bs}\sum [] \rightsquigarrow \ZERO}
543
+ − 570
585
+ − 571
\inferrule * [Right = $A1$] {\vspace{0em}}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a}
543
+ − 572
585
+ − 573
\inferrule * [Right = $AL$] {\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2}
543
+ − 574
585
+ − 575
\inferrule * [Right = $LE$] {\vspace{0em}}{ [] \stackrel{s}{\rightsquigarrow} []}
543
+ − 576
585
+ − 577
\inferrule * [Right = $LT$] {rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{ r :: rs_1 \stackrel{s}{\rightsquigarrow} r :: rs_2 }
543
+ − 578
585
+ − 579
\inferrule * [Right = $LH$] {r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs}
543
+ − 580
585
+ − 581
\inferrule * [Right = $L\ZERO$] {\vspace{0em}}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs}
543
+ − 582
585
+ − 583
\inferrule * [Right = $LS$] {\vspace{0em}}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) }
543
+ − 584
591
+ − 585
\inferrule * [Right = $LD$] {\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rs_c \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c}
543
+ − 586
+ − 587
\end{mathpar}
585
+ − 588
\caption{
+ − 589
The rewrite rules that generate simplified regular expressions
+ − 590
in small steps: $r_1 \rightsquigarrow r_2$ is for bitcoded regular expressions
+ − 591
and $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$ for
+ − 592
lists of bitcoded regular expressions.
+ − 593
Interesting is the LD rule that allows copies of regular expressions
+ − 594
to be removed provided a regular expression
+ − 595
earlier in the list can match the same strings.
+ − 596
}\label{rrewriteRules}
+ − 597
\end{figure}
+ − 598
\noindent
+ − 599
The rules such as $LT$ and $LH$ are for rewriting between two regular expression lists
+ − 600
such that one regular expression
+ − 601
in the left-hand-side list is rewritable in one step
+ − 602
to the right-hand-side's regular expression at the same position.
+ − 603
This helps with defining the ``context rules'' such as $AL$.\\
+ − 604
The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
+ − 605
are defined in the usual way:
+ − 606
\begin{figure}[H]
+ − 607
\centering
+ − 608
\begin{mathpar}
+ − 609
\inferrule{\vspace{0em}}{ r \rightsquigarrow^* r \\}
543
+ − 610
585
+ − 611
\inferrule{\vspace{0em}}{rs \stackrel{s*}{\rightsquigarrow} rs \\}
543
+ − 612
585
+ − 613
\inferrule{r_1 \rightsquigarrow^* r_2 \land \; r_2 \rightsquigarrow^* r_3}{r_1 \rightsquigarrow^* r_3\\}
+ − 614
+ − 615
\inferrule{rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \land \; rs_2 \stackrel{s*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{s*}{\rightsquigarrow} rs_3}
+ − 616
\end{mathpar}
+ − 617
\caption{The Reflexive Transitive Closure of
+ − 618
$\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure}
+ − 619
\end{figure}
600
+ − 620
%Two rewritable terms will remain rewritable to each other
+ − 621
%even after a derivative is taken:
+ − 622
Rewriting is preserved under derivatives,
+ − 623
namely
543
+ − 624
\begin{center}
585
+ − 625
$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$
543
+ − 626
\end{center}
585
+ − 627
And finally, if two terms are rewritable to each other,
+ − 628
then they produce the same bitcodes:
+ − 629
\begin{center}
+ − 630
$r \rightsquigarrow^* r' \;\; \textit{then} \; \; \bmkeps \; r = \bmkeps \; r'$
+ − 631
\end{center}
+ − 632
The decoding phase of both $\blexer$ and $\blexersimp$
+ − 633
are the same, which means that if they get the same
+ − 634
bitcodes before the decoding phase,
+ − 635
they get the same value after decoding is done.
+ − 636
We will prove the three properties
+ − 637
we mentioned above in the next sub-section.
+ − 638
\subsection{Important Properties of $\rightsquigarrow$}
+ − 639
First we prove some basic facts
+ − 640
about $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$,
+ − 641
$\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$,
+ − 642
which will be needed later.\\
+ − 643
The inference rules (\ref{rrewriteRules}) we
+ − 644
gave in the previous section
+ − 645
have their ``many-steps version'':
543
+ − 646
586
+ − 647
\begin{lemma}\label{squig1}
585
+ − 648
\hspace{0em}
+ − 649
\begin{itemize}
+ − 650
\item
+ − 651
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$
+ − 652
\item
586
+ − 653
$r \rightsquigarrow^* r' \implies _{bs} \sum (r :: rs)\; \rightsquigarrow^*\; _{bs} \sum (r' :: rs)$
+ − 654
+ − 655
\item
+ − 656
The rewriting in many steps property is composible
+ − 657
in terms of the sequence constructor:\\
+ − 658
$r_1 \rightsquigarrow^* r_2
+ − 659
\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* \;
+ − 660
_{bs} r_2 \cdot r_3 \quad $
+ − 661
and
+ − 662
$\quad r_3 \rightsquigarrow^* r_4
+ − 663
\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* _{bs} \; r_1 \cdot r_4$
+ − 664
\item
+ − 665
The rewriting in many steps properties
+ − 666
$\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$
+ − 667
is preserved under the function $\fuse$:\\
+ − 668
$r_1 \rightsquigarrow^* r_2
+ − 669
\implies \fuse \; bs \; r_1 \rightsquigarrow^* \;
+ − 670
\fuse \; bs \; r_2 \quad $ and
+ − 671
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2
+ − 672
\implies \map \; (\fuse \; bs) \; rs_1
+ − 673
\stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$
585
+ − 674
\end{itemize}
543
+ − 675
\end{lemma}
+ − 676
\begin{proof}
585
+ − 677
By an induction on
+ − 678
the inductive cases of $\stackrel{s*}{\rightsquigarrow}$ and $\rightsquigarrow^*$ respectively.
586
+ − 679
The third and fourth points are
+ − 680
by the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and
+ − 681
$rs_2 \stackrel{s}{\rightsquigarrow} rs_3
+ − 682
\implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$,
+ − 683
which can be indutively proven by the inductive cases of $\rightsquigarrow$ and
+ − 684
$\stackrel{s}{\rightsquigarrow}$.
543
+ − 685
\end{proof}
+ − 686
\noindent
585
+ − 687
The inference rules of $\stackrel{s}{\rightsquigarrow}$
+ − 688
are defined in terms of list cons operation, here
+ − 689
we establish that the
+ − 690
$\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$
586
+ − 691
relation is also preserved w.r.t appending and prepending of a list.
+ − 692
In addition, we
+ − 693
also prove some relations
+ − 694
between $\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$.
585
+ − 695
\begin{lemma}\label{ssgqTossgs}
+ − 696
\hspace{0em}
+ − 697
\begin{itemize}
+ − 698
\item
+ − 699
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$
+ − 700
+ − 701
\item
+ − 702
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies
586
+ − 703
rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2 \; \;
+ − 704
\textit{and} \; \;
+ − 705
rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
+ − 706
585
+ − 707
\item
+ − 708
The $\stackrel{s}{\rightsquigarrow} $ relation after appending
+ − 709
a list becomes $\stackrel{s*}{\rightsquigarrow}$:\\
586
+ − 710
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2
+ − 711
\implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
+ − 712
\item
+ − 713
+ − 714
$r_1 \rightsquigarrow^* r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$
+ − 715
\item
+ − 716
+ − 717
$rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \rightsquigarrow^* r_2 \implies
+ − 718
r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$
+ − 719
\item
+ − 720
If we could rewrite a regular expression
+ − 721
in many steps to $\ZERO$, then
+ − 722
we could also rewrite any sequence containing it to $\ZERO$:\\
+ − 723
$r_1 \rightsquigarrow^* \ZERO
+ − 724
\implies _{bs}r_1\cdot r_2 \rightsquigarrow^* \ZERO$
585
+ − 725
\end{itemize}
543
+ − 726
\end{lemma}
+ − 727
\begin{proof}
585
+ − 728
The first part is by induction on the list $rs$.
+ − 729
The second part is by induction on the inductive cases of $\stackrel{s*}{\rightsquigarrow}$.
+ − 730
The third part is
+ − 731
by rule induction of $\stackrel{s}{\rightsquigarrow}$.
586
+ − 732
The fourth sub-lemma is
+ − 733
by rule induction of
+ − 734
$\stackrel{s*}{\rightsquigarrow}$ and using part one to three.
+ − 735
The fifth part is a corollary of part four.
+ − 736
The last part is proven by rule induction again on $\rightsquigarrow^*$.
543
+ − 737
\end{proof}
586
+ − 738
\noindent
+ − 739
Now we are ready to give the proofs of the below properties:
585
+ − 740
\begin{itemize}
+ − 741
\item
586
+ − 742
$(r \rightsquigarrow^* r'\land \bnullable \; r_1)
585
+ − 743
\implies \bmkeps \; r = \bmkeps \; r'$. \\
+ − 744
\item
+ − 745
$r \rightsquigarrow^* \textit{bsimp} \;r$.\\
+ − 746
\item
+ − 747
$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow^* r'\backslash c$.\\
+ − 748
\end{itemize}
+ − 749
These properties would work together towards the correctness theorem.
586
+ − 750
\subsubsection{Property 1: $(r \rightsquigarrow^* r'\land \bnullable \; r_1)
+ − 751
\implies \bmkeps \; r = \bmkeps \; r'$}
+ − 752
Intuitively, this property says we can
+ − 753
extract the same bitcodes using $\bmkeps$ from the nullable
+ − 754
components of two regular expressions $r$ and $r'$,
+ − 755
if we can rewrite from one to the other in finitely
+ − 756
many steps.\\
+ − 757
For convenience,
+ − 758
we define a predicate for a list of regular expressions
+ − 759
having at least one nullable regular expressions:
+ − 760
\begin{center}
+ − 761
$\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$
+ − 762
\end{center}
+ − 763
\noindent
+ − 764
The rewriting relation $\rightsquigarrow$ preserves nullability:
+ − 765
\begin{lemma}\label{rewritesBnullable}
+ − 766
\hspace{0em}
+ − 767
\begin{itemize}
+ − 768
\item
+ − 769
$\text{If} \; r_1 \rightsquigarrow r_2, \;
+ − 770
\text{then} \; \bnullable \; r_1 = \bnullable \; r_2$
+ − 771
\item
+ − 772
$\text{If} \; rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \;
+ − 773
\text{then} \; \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$
+ − 774
\item
+ − 775
$r_1 \rightsquigarrow^* r_2
+ − 776
\implies \bnullable \; r_1 = \bnullable \; r_2$
+ − 777
\end{itemize}
+ − 778
\end{lemma}
+ − 779
\begin{proof}
+ − 780
By rule induction of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$.
+ − 781
The third point is a corollary of the second.
+ − 782
\end{proof}
+ − 783
\noindent
+ − 784
For convenience again,
+ − 785
we define $\bmkepss$ on a list $rs$,
+ − 786
which extracts the bit-codes on the first $\bnullable$ element in $rs$:
+ − 787
\begin{center}
+ − 788
\begin{tabular}{lcl}
+ − 789
$\bmkepss \; [] $ & $\dn$ & $[]$\\
+ − 790
$\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \;(\bnullable \; r) \;\;
+ − 791
\textit{then} \;\; \bmkeps \; r \; \textit{else} \;\; \bmkepss \; rs$
+ − 792
\end{tabular}
+ − 793
\end{center}
+ − 794
\noindent
+ − 795
If both regular expressions in a rewriting relation are nullable, then they
+ − 796
produce the same bitcodes:
+ − 797
\begin{lemma}\label{rewriteBmkepsAux}
+ − 798
\hspace{0em}
+ − 799
\begin{itemize}
+ − 800
\item
+ − 801
$r_1 \rightsquigarrow r_2 \implies
+ − 802
(\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 =
+ − 803
\bmkeps \; r_2)$
+ − 804
\item
+ − 805
and
+ − 806
$rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2
+ − 807
\implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies
+ − 808
\bmkepss \; rs_1 = \bmkepss \; rs2)$
+ − 809
\end{itemize}
+ − 810
\end{lemma}
+ − 811
\begin{proof}
+ − 812
By rule induction over the cases that lead to $r_1 \rightsquigarrow r_2$.
+ − 813
\end{proof}
+ − 814
\noindent
+ − 815
With lemma \ref{rewriteBmkepsAux} we are ready to prove its
+ − 816
many-step version:
+ − 817
\begin{lemma}
+ − 818
$\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r, \;\;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$
+ − 819
\end{lemma}
+ − 820
\begin{proof}
+ − 821
By rule induction of $\stackrel{*}{\rightsquigarrow} $.
+ − 822
$\ref{rewritesBnullable}$ tells us both $r$ and $r'$ are nullable.
+ − 823
\ref{rewriteBmkepsAux} solves the inductive case.
+ − 824
\end{proof}
585
+ − 825
586
+ − 826
\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$}
+ − 827
Now we get to the ``meaty'' part of the proof,
+ − 828
which says that our simplification's helper functions
+ − 829
such as $\distinctBy$ and $\flts$ conform to
+ − 830
the $\stackrel{s*}{\rightsquigarrow}$ and
+ − 831
$\rightsquigarrow^* $ rewriting relations.\\
+ − 832
The first lemma to prove is a more general version of
+ − 833
$rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$:
543
+ − 834
\begin{lemma}
586
+ − 835
$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow}
+ − 836
(rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$
543
+ − 837
\end{lemma}
586
+ − 838
\noindent
+ − 839
It says that that for a list made of two parts $rs_1 @ rs_2$,
+ − 840
one can throw away the duplicate
+ − 841
elements in $rs_2$, as well as those that have appeared in $rs_1$.
543
+ − 842
\begin{proof}
+ − 843
By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary.
+ − 844
\end{proof}
586
+ − 845
\noindent
+ − 846
Setting $rs_2$ to be empty,
+ − 847
we get the corollary
+ − 848
\begin{corollary}\label{dBPreserves}
+ − 849
$rs_1 \stackrel{s*}{\rightsquigarrow} \distinctBy \; rs_1 \; \phi$.
+ − 850
\end{corollary}
+ − 851
\noindent
+ − 852
The flatten function $\flts$ conforms to
+ − 853
$\stackrel{s*}{\rightsquigarrow}$ as well:
538
+ − 854
543
+ − 855
\begin{lemma}\label{fltsPreserves}
+ − 856
$rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$
+ − 857
\end{lemma}
+ − 858
\begin{proof}
586
+ − 859
By an induction on $rs$.
543
+ − 860
\end{proof}
+ − 861
\noindent
+ − 862
The function $\bsimpalts$ preserves rewritability:
+ − 863
\begin{lemma}\label{bsimpaltsPreserves}
+ − 864
$_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$
+ − 865
\end{lemma}
+ − 866
\noindent
586
+ − 867
The simplification function
+ − 868
$\textit{bsimp}$ only transforms the regex $r$ using steps specified by
+ − 869
$\rightsquigarrow^*$ and nothing else.
543
+ − 870
\begin{lemma}
+ − 871
$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
+ − 872
\end{lemma}
+ − 873
\begin{proof}
+ − 874
By an induction on $r$.
586
+ − 875
The most involved case would be the alternative,
+ − 876
where we use lemmas \ref{bsimpaltsPreserves},
+ − 877
\ref{fltsPreserves} and \ref{dBPreserves} to do a series of rewriting:\\
+ − 878
\begin{center}
+ − 879
\begin{tabular}{lcl}
+ − 880
$rs$ & $\stackrel{s*}{\rightsquigarrow}$ & $ \map \; \textit{bsimp} \; rs$\\
+ − 881
& $\stackrel{s*}{\rightsquigarrow}$ & $ \flts \; (\map \; \textit{bsimp} \; rs)$\\
+ − 882
& $\stackrel{s*}{\rightsquigarrow}$ & $ \distinctBy \;
+ − 883
(\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerases \; \phi$\\
+ − 884
\end{tabular}
+ − 885
\end{center}
+ − 886
Using this we derive the following rewrite relation:\\
+ − 887
\begin{center}
+ − 888
\begin{tabular}{lcl}
+ − 889
$r$ & $=$ & $_{bs}\sum rs$\\[1.5ex]
+ − 890
& $\rightsquigarrow^*$ & $\bsimpalts \; bs \; rs$ \\[1.5ex]
+ − 891
& $\rightsquigarrow^*$ & $\ldots$ \\ [1.5ex]
+ − 892
& $\rightsquigarrow^*$ & $\bsimpalts \; bs \;
+ − 893
(\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs))
+ − 894
\; \rerases \; \phi)$\\[1.5ex]
+ − 895
%& $\rightsquigarrow^*$ & $ _{bs} \sum (\distinctBy \;
+ − 896
%(\flts \; (\map \; \textit{bsimp}\; rs)) \; \;
+ − 897
%\rerases \; \;\phi) $\\[1.5ex]
+ − 898
& $\rightsquigarrow^*$ & $\textit{bsimp} \; r$\\[1.5ex]
+ − 899
\end{tabular}
+ − 900
\end{center}
543
+ − 901
\end{proof}
585
+ − 902
\subsubsection{Property 3: $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}
586
+ − 903
The rewritability relation
+ − 904
$\rightsquigarrow$ is preserved under derivatives--
+ − 905
it is just that we might need multiple steps
588
+ − 906
where originally only one step was needed:
543
+ − 907
\begin{lemma}\label{rewriteBder}
588
+ − 908
\hspace{0em}
+ − 909
\begin{itemize}
+ − 910
\item
+ − 911
If $r_1 \rightsquigarrow r_2$, then $r_1 \backslash c
+ − 912
\rightsquigarrow^* r_2 \backslash c$
+ − 913
\item
+ − 914
If $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$, then $
+ − 915
\map \; (\_\backslash c) \; rs_1
+ − 916
\stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$
+ − 917
\end{itemize}
543
+ − 918
\end{lemma}
+ − 919
\begin{proof}
588
+ − 920
By induction on $\rightsquigarrow$
+ − 921
and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.
+ − 922
\end{proof}
+ − 923
\noindent
+ − 924
Now we can prove property 3, as an immediate corollary:
+ − 925
\begin{corollary}\label{rewritesBder}
+ − 926
$r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^*
+ − 927
r_2 \backslash c$
+ − 928
\end{corollary}
+ − 929
\begin{proof}
+ − 930
By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma \ref{rewriteBder}.
543
+ − 931
\end{proof}
+ − 932
\noindent
588
+ − 933
This can be extended and combined with $r \rightsquigarrow^* \textit{bsimp} \; r$
+ − 934
to obtain the rewritability between
+ − 935
$\blexer$ and $\blexersimp$'s intermediate
+ − 936
derivative regular expressions
543
+ − 937
\begin{lemma}\label{bderBderssimp}
588
+ − 938
$a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $
543
+ − 939
\end{lemma}
588
+ − 940
\begin{proof}
+ − 941
By an induction on $s$.
+ − 942
\end{proof}
543
+ − 943
\subsection{Main Theorem}
588
+ − 944
Now with \ref{bderBderssimp} we are ready for the main theorem.
543
+ − 945
\begin{theorem}
+ − 946
$\blexer \; r \; s = \blexersimp{r}{s}$
+ − 947
\end{theorem}
+ − 948
\noindent
576
+ − 949
\begin{proof}
588
+ − 950
One can rewrite in many steps from the original lexer's
+ − 951
derivative regular expressions to the
+ − 952
lexer with simplification applied (by lemma \ref{bderBderssimp}):
+ − 953
\begin{center}
+ − 954
$a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $.
+ − 955
\end{center}
+ − 956
we know that they give out the same bits, if the lexing result is a match:
+ − 957
\begin{center}
+ − 958
$\bnullable \; (a \backslash s)
+ − 959
\implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
+ − 960
\end{center}
+ − 961
Now that they give out the same bits, we know that they give the same value after decoding.
+ − 962
\begin{center}
+ − 963
$\bnullable \; (a \backslash s)
+ − 964
\implies \decode \; r \; (\bmkeps \; (a \backslash s)) =
+ − 965
\decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$
+ − 966
\end{center}
+ − 967
Which is equivalent to our proof goal:
+ − 968
\begin{center}
+ − 969
$\blexer \; r \; s = \blexersimp \; r \; s$.
+ − 970
\end{center}
576
+ − 971
\end{proof}
+ − 972
\noindent
+ − 973
As a corollary,
+ − 974
we link this result with the lemma we proved earlier that
+ − 975
\begin{center}
589
+ − 976
$(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = v$
576
+ − 977
\end{center}
+ − 978
and obtain the corollary that the bit-coded lexer with simplification is
+ − 979
indeed correctly outputting POSIX lexing result, if such a result exists.
+ − 980
\begin{corollary}
589
+ − 981
$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s $
576
+ − 982
\end{corollary}
532
+ − 983
543
+ − 984
\subsection{Comments on the Proof Techniques Used}
589
+ − 985
Straightforward and simple as the proof may seem,
600
+ − 986
the efforts we spent obtaining it were far from trivial.\\
589
+ − 987
We initially attempted to re-use the argument
+ − 988
in \cref{flex_retrieve}.
+ − 989
The problem was that both functions $\inj$ and $\retrieve$ require
+ − 990
that the annotated regular expressions stay unsimplified,
+ − 991
so that one can
+ − 992
correctly compare $v_{i+1}$ and $r_i$ and $v_i$
+ − 993
in diagram \ref{graph:inj} and
+ − 994
``fit the key into the lock hole''.
543
+ − 995
589
+ − 996
\noindent
+ − 997
We also tried to prove
+ − 998
\begin{center}
+ − 999
$\textit{bsimp} \;\; (\bderssimp{a}{s}) =
+ − 1000
\textit{bsimp} \;\; (a\backslash s)$,
+ − 1001
\end{center}
+ − 1002
but this turns out to be not true.
+ − 1003
A counterexample would be
+ − 1004
\[ a = [(_{Z}1+_{S}c)\cdot [bb \cdot (_{Z}1+_{S}c)]] \;\;
+ − 1005
\text{and} \;\; s = bb.
+ − 1006
\]
+ − 1007
\noindent
+ − 1008
Then we would have
+ − 1009
\begin{center}
+ − 1010
$\textit{bsimp}\;\; ( a \backslash s )$ =
+ − 1011
$_{[]}(_{ZZ}\ONE + _{ZS}c ) $
+ − 1012
\end{center}
+ − 1013
\noindent
+ − 1014
whereas
+ − 1015
\begin{center}
+ − 1016
$\textit{bsimp} \;\;( \bderssimp{a}{s} )$ =
+ − 1017
$_{Z}(_{Z} \ONE + _{S} c)$.
+ − 1018
\end{center}
+ − 1019
Unfortunately,
+ − 1020
if we apply $\textit{bsimp}$ differently
+ − 1021
we will always have this discrepancy.
+ − 1022
This is due to
+ − 1023
the $\map \; (\fuse\; bs) \; as$ operation
+ − 1024
happening at different locations in the regular expression.\\
+ − 1025
The rewriting relation
+ − 1026
$\rightsquigarrow^*$
+ − 1027
allows us to ignore this discrepancy
+ − 1028
and view the expressions
+ − 1029
\begin{center}
+ − 1030
$_{[]}(_{ZZ}\ONE + _{ZS}c ) $\\
+ − 1031
and\\
+ − 1032
$_{Z}(_{Z} \ONE + _{S} c)$
543
+ − 1033
589
+ − 1034
\end{center}
+ − 1035
as equal, because they were both re-written
+ − 1036
from the same expression.\\
600
+ − 1037
The simplification rewriting rules
+ − 1038
given in \ref{rrewriteRules} are by no means
+ − 1039
final,
+ − 1040
one could come up new rules
+ − 1041
such as
+ − 1042
$\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow
+ − 1043
\SEQs [r_1, r_2, r_3]$.
+ − 1044
This does not fit with the proof technique
+ − 1045
of our main theorem, but seem to not violate the POSIX
+ − 1046
property.\\
589
+ − 1047
Having correctness property is good.
+ − 1048
But we would also a guarantee that the lexer is not slow in
+ − 1049
some sense, for exampe, not grinding to a halt regardless of the input.
+ − 1050
As we have already seen, Sulzmann and Lu's simplification function
+ − 1051
$\simpsulz$ cannot achieve this, because their claim that
+ − 1052
the regular expression size does not grow arbitrary large
+ − 1053
was not true.
+ − 1054
In the next chapter we shall prove that with our $\simp$,
+ − 1055
for a given $r$, the internal derivative size is always
543
+ − 1056
finitely bounded by a constant.