% Chapter Template+ −
+ −
% Main chapter title+ −
\chapter{Correctness of Bit-coded Algorithm with Simplification}+ −
+ −
\label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}+ −
%Then we illustrate how the algorithm without bitcodes falls short for such aggressive + −
%simplifications and therefore introduce our version of the bitcoded algorithm and + −
%its correctness proof in + −
%Chapter 3\ref{Chapter3}. + −
%\section{Overview}+ −
\marginpar{\em Added a completely new \\overview section, \\highlighting\\ contributions.}+ −
+ −
This chapter+ −
is the point from which novel contributions of this PhD project are introduced+ −
in detail. + −
The material in the+ −
previous+ −
chapters is necessary for this thesis,+ −
because it provides the context for why we need a new framework for+ −
the proof of $\blexersimp$.+ −
+ −
We will first introduce why aggressive simplifications are needed, after which we+ −
provide our algorithm, contrasting with Sulzmann and Lu's simplifications.+ −
We then explain how our simplifications make+ −
reusing $\blexer$'s correctness proof impossible.+ −
%with some minor modifications+ −
We discuss possible fixes such as rectification functions and then introduce our proof, + −
which involves a weaker inductive+ −
invariant than that used in the correctness proof of $\blexer$.+ −
+ −
\marginpar{Shortened overview.}+ −
%material for setting the scene of the formal proof we+ −
%are about to describe.+ −
\section{Simplifications by Sulzmann and Lu}+ −
\marginpar{moved \\simplification \\section to front \\to make coherent\\ sense.}+ −
The algorithms $\lexer$ and $\blexer$ work beautifully as functional + −
programs, but not as practical code. One main reason for the slowness is due+ −
to the size of intermediate representations--the derivative regular expressions+ −
tend to grow unbounded if the matching involved a large number of possible matches.+ −
Consider the derivatives of the following example $(a^*a^*)^*$:+ −
%and $(a^* + (aa)^*)^*$:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$(a^*a^*)^*$ & $ \stackrel{\backslash a}{\longrightarrow}$ & + −
$ (a^*a^* + a^*)\cdot(a^*a^*)^*$\\+ −
& + −
$ \stackrel{\backslash a}{\longrightarrow} $ & + −
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$\\+ −
& $\stackrel{\backslash a}{+ −
\longrightarrow} $ & $\ldots$\\+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
From the second derivative several duplicate sub-expressions + −
already needs to be eliminated (possible+ −
bitcodes are omitted to make the presentation more concise+ −
because they are not the key part of the simplifications).+ −
A simple-minded simplification function cannot simplify+ −
the third regular expression in the above chain of derivative+ −
regular expressions, namely+ −
\begin{center}+ −
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$+ −
\end{center}+ −
because the duplicates are+ −
not next to each other, and therefore the rule+ −
$r+ r \rightarrow r$ from $\textit{simp}$ does not fire.+ −
One would expect a better simplification function to work in the + −
following way:+ −
\begin{gather*}+ −
((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + + −
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\+ −
\bigg\downarrow (1) \\+ −
(a^*a^* + a^* + −
\color{gray} + a^* \color{black})\cdot(a^*a^*)^* + + −
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\+ −
\bigg\downarrow (2) \\+ −
(a^*a^* + a^* + −
)\cdot(a^*a^*)^* + −
\color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*\\+ −
\bigg\downarrow (3) \\+ −
(a^*a^* + a^* + −
)\cdot(a^*a^*)^* + −
\end{gather*}+ −
\noindent+ −
In the first step, the nested alternative regular expression+ −
$(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$.+ −
Now the third term $a^*$ can clearly be identified as a duplicate+ −
and therefore removed in the second step. + −
This causes the two+ −
top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$ + −
removed in the final step.+ −
Sulzmann and Lu's simplification function (using our notations) can achieve this+ −
simplification:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{simp}\_{SL} \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & + −
$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\+ −
& &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\+ −
$\textit{simp}\_{SL} \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if} + −
\; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\;+ −
\textit{then} \;\; \ZERO$\\+ −
& & $\textit{else}\;\;_{bs}((\textit{simp}\_{SL} \;r_1)\cdot+ −
(\textit{simp}\_{SL} \; r_2))$\\+ −
$\textit{simp}\_{SL} \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\+ −
$\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &+ −
$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\+ −
$\textit{simp}\_{SL} \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL} \; r)$\\+ −
$\textit{simp}\_{SL} \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum + −
(\nub \; (\filter \; (\neg\zeroable)\;((\textit{simp}\_{SL} \; r) :: \map \; \textit{simp}\_{SL} \; rs)))$\\ + −
+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
The $\textit{zeroable}$ predicate + −
tests whether the regular expression+ −
is equivalent to $\ZERO$, and+ −
can be defined as:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\;+ −
\zeroable \;_{[]}\sum\;rs $\\+ −
$\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\+ −
$\zeroable\;_{bs}r^*$ & $\dn$ & $\textit{false}$ \\+ −
$\zeroable\;_{bs}c$ & $\dn$ & $\textit{false}$\\+ −
$\zeroable\;_{bs}\ONE$ & $\dn$ & $\textit{false}$\\+ −
$\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
The + −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &+ −
$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
clause does flatten the alternative as required in step (1),+ −
but $\textit{simp}\_{SL}$ is insufficient if we want to do steps (2) and (3),+ −
as these ``identical'' terms have different bit-annotations.+ −
They also suggested that the $\textit{simp}\_{SL} $ function should be+ −
applied repeatedly until a fixpoint is reached.+ −
We call this construction $\textit{SLSimp}$:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{SLSimp} \; r$ & $\dn$ & + −
$\textit{while}((\textit{simp}\_{SL} \; r)\; \cancel{=} \; r)$ \\+ −
& & $\quad r := \textit{simp}\_{SL} \; r$\\+ −
& & $\textit{return} \; r$+ −
\end{tabular}+ −
\end{center}+ −
We call the operation of alternatingly + −
applying derivatives and simplifications+ −
(until the string is exhausted) Sulz-simp-derivative,+ −
written $\backslash_{SLSimp}$:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$r \backslash_{SLSimp} (c\!::\!s) $ & $\dn$ & $(\textit{SLSimp} \; (r \backslash c)) \backslash_{SLSimp}\, s$ \\+ −
$r \backslash_{SLSimp} [\,] $ & $\dn$ & $r$+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
After the derivatives have been taken, the bitcodes+ −
are extracted and decoded in the same manner+ −
as $\blexer$:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{blexer\_SLSimp}\;r\,s$ & $\dn$ &+ −
$\textit{let}\;a = (r^\uparrow)\backslash_{SLSimp}\, s\;\textit{in}$\\ + −
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\+ −
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\+ −
& & $\;\;\textit{else}\;\textit{None}$+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
We implemented this lexing algorithm in Scala, + −
and found that the final derivative regular expression+ −
size still grows exponentially (note the logarithmic scale):+ −
\begin{figure}[H]+ −
\centering+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={$n$},+ −
ylabel={size},+ −
ymode = log,+ −
legend entries={Final Derivative Size}, + −
legend pos=north west,+ −
legend cell align=left]+ −
\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexer.data};+ −
\end{axis}+ −
\end{tikzpicture} + −
\caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form+ −
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}+ −
$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer}+ −
\end{figure}+ −
\noindent+ −
At $n= 20$ we already get an out-of-memory error with Scala's normal + −
JVM heap size settings.+ −
In fact their simplification does not improve much over+ −
the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}.+ −
The time required also grows exponentially:+ −
\begin{figure}[H]+ −
\centering+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={$n$},+ −
ylabel={time},+ −
%ymode = log,+ −
legend entries={time in secs}, + −
legend pos=north west,+ −
legend cell align=left]+ −
\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexerTime.data};+ −
\end{axis}+ −
\end{tikzpicture} + −
\caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form+ −
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}+ −
$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexerTime}+ −
\end{figure}+ −
\noindent+ −
which seems like a counterexample for + −
Sulzmann and Lu's linear complexity claim+ −
in their paper \cite{Sulzmann2014}:+ −
\begin{quote}\it+ −
``Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations:+ −
simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input.'' + −
\end{quote}+ −
\noindent+ −
The assumption that the size of the regular expressions+ −
in the algorithm+ −
would stay below a finite constant is not true, at least not in the+ −
examples we considered.+ −
The main reason behind this is that (i) Haskell's $\textit{nub}$+ −
function requires identical annotations between two + −
annotated regular expressions to qualify as duplicates,+ −
and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$+ −
even if both $a^*$ denote the same language, and+ −
(ii) the ``flattening'' only applies to the head of the list+ −
in the + −
\begin{center}+ −
\begin{tabular}{lcl}+ −
+ −
$\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &+ −
$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
clause, and therefore is not strong enough to simplify all+ −
needed parts of the regular expression. Moreover,+ −
the $\textit{simp}\_{SL}$ function is applied repeatedly+ −
in each derivative step until a fixed point is reached, + −
which makes the algorithm even more+ −
unpredictable and inefficient.+ −
%To not get ``caught off guard'' by+ −
%these counterexamples,+ −
%one needs to be more careful when designing the+ −
%simplification function and making claims about them.+ −
+ −
+ −
+ −
\section{Our $\textit{Simp}$ Function}+ −
We will now introduce our own simplification function.+ −
%by making a contrast with $\textit{simp}\_{SL}$.+ −
We also describe+ −
the ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$+ −
algorithm + −
and why it fails to achieve the desired effect of keeping the sizes of derivatives finitely bounded. + −
In addition, our simplification function will come with a formal+ −
correctness proof.+ −
\subsection{Flattening Nested Alternatives}+ −
The idea behind the clause+ −
\begin{center}+ −
$\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad+ −
_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$+ −
\end{center}+ −
is that it allows+ −
duplicate removal of regular expressions at different+ −
``levels'' of alternatives.+ −
For example, this would help with the+ −
following simplification:+ −
+ −
\begin{center}+ −
$(a+r)+r \longrightarrow a+r$+ −
\end{center}+ −
The problem is that only the head element+ −
is ``spilled out''.+ −
It is more desirable+ −
to flatten+ −
an entire list to open up possibilities for further simplifications+ −
with later regular expressions.+ −
Not flattening the rest of the elements also means that+ −
the later de-duplication process + −
does not fully remove further duplicates.+ −
For example,+ −
using $\textit{simp}\_{SL}$ we cannot+ −
simplify+ −
\begin{center}+ −
$((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*++ −
((a^*a^*)+a^*)\cdot (a^*a^*)^*$+ −
\end{center}+ −
due to the underlined part not being the head + −
of the alternative.+ −
+ −
We define our flatten operation so that it flattens + −
the entire list: + −
\begin{center}+ −
\begin{tabular}{@{}lcl@{}}+ −
$\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;+ −
(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\+ −
$\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \; \textit{as'} $ \\+ −
$\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise) + −
\end{tabular} + −
\end{center} + −
\noindent+ −
Our $\flts$ operation + −
also throws away $\ZERO$s+ −
as they do not contribute to a lexing result.+ −
\subsection{Duplicate Removal}+ −
After flattening is done, we can deduplicate.+ −
The de-duplicate function is called $\distinctBy$,+ −
and that is where we make our second improvement over+ −
Sulzmann and Lu's simplification method.+ −
The process goes as follows:+ −
\begin{center}+ −
$rs \stackrel{\textit{flts}}{\longrightarrow} + −
rs_{flat} + −
\xrightarrow{\distinctBy \; + −
rs_{flat} \; \rerases\; \varnothing} rs_{distinct}$+ −
%\stackrel{\distinctBy \; + −
%rs_{flat} \; \erase\; \varnothing}{\longrightarrow} \; rs_{distinct}$+ −
\end{center}+ −
where the $\distinctBy$ function is defined as:+ −
\begin{center}+ −
\begin{tabular}{@{}lcl@{}}+ −
$\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\+ −
$\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & $\quad \textit{if} (f \; x \in acc)\;\; \textit{then} \;\; \distinctBy \; xs \; f \; acc$\\+ −
& & $\quad \textit{else}\;\; x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$ + −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
The reason we define a distinct function under a mapping $f$ is because+ −
we want to eliminate regular expressions that are syntactically the same,+ −
but have different bit-codes.+ −
For example, we can remove the second $a^*a^*$ from+ −
$_{ZSZ}a^*a^* + _{SZZ}a^*a^*$, because it+ −
represents a match with shorter initial sub-match + −
(and therefore is definitely not POSIX),+ −
and will be discarded by $\bmkeps$ later.+ −
\begin{center}+ −
$_{ZSZ}\underbrace{a^*}_{ZS:\; match \; 1\; times\quad}\underbrace{a^*}_{Z: \;match\; 1 \;times} + + −
_{SZZ}\underbrace{a^*}_{S: \; match \; 0 \; times\quad}\underbrace{a^*}_{ZZ: \; match \; 2 \; times}+ −
$+ −
\end{center}+ −
%$_{bs1} r_1 + _{bs2} r_2 \text{where} (r_1)_{\downarrow} = (r_2)_{\downarrow}$+ −
Due to the way our algorithm works,+ −
the matches that conform to the POSIX standard + −
will always be placed further to the left. When we + −
traverse the list from left to right,+ −
regular expressions we have already seen+ −
will definitely not contribute to a POSIX value,+ −
even if they are attached with different bitcodes.+ −
These duplicates therefore need to be removed.+ −
+ −
+ −
To achieve this, we call $\rerases$ as the function $f$ during the distinction+ −
operation. The function+ −
$\rerases$ is very similar to $\erase$, except that it preserves the structure+ −
when erasing an alternative regular expression.+ −
The reason why we use $\rerases$ instead of $\erase$ is that+ −
it keeps the structures of alternative + −
annotated regular expressions+ −
whereas $\erase$ would turn it back into a binary tree structure.+ −
Not having to mess with the structure + −
greatly simplifies the finiteness proof in chapter + −
\ref{Finite}.+ −
We give the definitions of $\rerases$ here together with+ −
the new datatype used by $\rerases$ (as our plain+ −
regular expression datatype does not allow non-binary alternatives).+ −
For now we can think of + −
$\rerases$ as the function erase ($(\_)_\downarrow$) defined in chapter \ref{Bitcoded1}+ −
and $\rrexp$ as plain regular expressions, but having a general list constructor+ −
for alternatives:+ −
\begin{figure}[H]+ −
\begin{center} + −
$\rrexp ::= \RZERO \mid \RONE+ −
\mid \RCHAR{c} + −
\mid \RSEQ{r_1}{r_2}+ −
\mid \RALTS{rs}+ −
\mid \RSTAR{r} $+ −
\end{center}+ −
\caption{$\rrexp$: plain regular expressions, but with $\sum$ alternative + −
constructor}\label{rrexpDef}+ −
\end{figure}+ −
The function $\rerases$ we define as follows:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\+ −
$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\+ −
$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\+ −
$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\+ −
$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\+ −
$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$+ −
\end{tabular}+ −
\end{center}+ −
We will provide more details in \ref{whyRerase} for why a new+ −
erase function and new datatype is needed. But briefly speaking+ −
it is for backward-compatibility with $\blexer$'s correctness proof and + −
the path we (naturally) took during our proof engineering of the finiteness property.+ −
+ −
\subsection{Putting Things Together}+ −
We can now give the definition of our simplification function:+ −
%that looks somewhat similar to our Scala code is + −
\begin{center}+ −
\begin{tabular}{@{}lcl@{}}+ −
+ −
$\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp} \; a_2) $ \\+ −
$\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \rerases \; \varnothing) $ \\+ −
$\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ + −
\end{tabular} + −
\end{center} + −
+ −
\noindent+ −
The simplification (named $\textit{bsimp}$ for \emph{b}it-coded) + −
does a pattern matching on the regular expression.+ −
When it detects that the regular expression is an alternative or+ −
sequence, it will try to simplify its children regular expressions+ −
recursively and then see if one of the children turns into $\ZERO$ or+ −
$\ONE$, which might trigger further simplification at the current level.+ −
Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$,+ −
using rules such as $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$.+ −
\begin{center}+ −
\begin{tabular}{@{}lcl@{}}+ −
$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\+ −
&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\+ −
&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\+ −
&&$\quad\textit{case} \; (_{bs1}\ONE, a_2') \Rightarrow \textit{fuse} \; (bs@bs_1) \; a_2'$ \\+ −
&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ + −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
The most involved part is the $\sum$ clause, where we first call $\flts$ on+ −
the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$,+ −
and then call $\distinctBy$ on that list. The predicate used in $\distinctBy$ for determining whether two + −
elements are the same is $\rerases \; r_1 = \rerases\; r_2$.+ −
Finally, depending on whether the regular expression list $as'$ has turned into a+ −
singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$+ −
decides whether to keep the current level constructor $\sum$ as it is, and + −
removes it when there are fewer than two elements:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\ + −
&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\+ −
&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\+ −
&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ + −
\end{tabular}+ −
+ −
\end{center}+ −
Having defined the $\textit{bsimp}$ function,+ −
we add it as a phase after a derivative is taken.+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$a \backslash_{bsimp} c$ & $\dn$ & $\textit{bsimp}(a \backslash c)$+ −
\end{tabular}+ −
\end{center}+ −
%Following previous notations+ −
%when extending from derivatives w.r.t.~character to derivative+ −
%w.r.t.~string, we define the derivative that nests simplifications + −
%with derivatives:%\comment{simp in the [] case?}+ −
We extend this from characters to strings:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$a \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(a \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\+ −
$a \backslash_{bsimps} [\,] $ & $\dn$ & $a$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
The lexer that extracts bitcodes from the + −
derivatives with simplifications from our $\simp$ function+ −
is called $\blexersimp$:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{blexer\_simp}\;r\,s$ & $\dn$ &+ −
$\textit{let}\;a = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\ + −
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\+ −
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\+ −
& & $\;\;\textit{else}\;\textit{None}$+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
This algorithm keeps the regular expression size small, + −
as we shall demonstrate with some examples in the next section.+ −
+ −
+ −
\subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$+ −
After Simplification}+ −
Recall the+ −
previous $(a^*a^*)^*$ example+ −
where $\textit{simp}\_{SL}$ could not+ −
prevent the fast growth (over+ −
3 million nodes just below $20$ input length)+ −
will be reduced to just 15 and stays constant no matter how long the+ −
input string is.+ −
This is shown in the graphs below.+ −
\begin{figure}[H]+ −
\begin{center}+ −
\begin{tabular}{ll}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={$n$},+ −
ylabel={derivative size},+ −
width=7cm,+ −
height=4cm, + −
legend entries={Lexer with $\textit{bsimp}$}, + −
legend pos= south east,+ −
legend cell align=left]+ −
\addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};+ −
\end{axis}+ −
\end{tikzpicture} %\label{fig:BitcodedLexer}+ −
&+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={$n$},+ −
ylabel={derivative size},+ −
width = 7cm,+ −
height = 4cm,+ −
legend entries={Lexer with $\textit{simp}\_{SL}$}, + −
legend pos= north west,+ −
legend cell align=left]+ −
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};+ −
\end{axis}+ −
\end{tikzpicture} + −
\end{tabular}+ −
\end{center}+ −
\caption{Our Improvement over Sulzmann and Lu's in terms of size of the derivatives.}+ −
\end{figure}+ −
\noindent+ −
Given the size difference, it is not+ −
surprising that our $\blexersimp$ significantly outperforms+ −
$\textit{blexer\_SLSimp}$ by Sulzmann and Lu.+ −
Indeed $\blexersimp$ seems to be a correct algorithm that effectively+ −
bounds the size of intermediate representations.+ −
\marginpar{\em more connecting material to make narration more coherent.}+ −
As promised we will use formal proofs to show that our speculation+ −
based on these experimental results indeed hold.+ −
%intuitions indeed hold.+ −
In the next section we are going to establish that our+ −
simplification preserves the correctness of the algorithm.+ −
+ −
+ −
\section{Correctness of $\blexersimp$}+ −
A natural thought would be to directly re-use the formal+ −
proof of $\blexer$'s correctness, with some minor modifications+ −
but keeping the way the induction is done.+ −
However we were not able to find a simple way to re-factor+ −
proof of \ref{blexerCorrect} in chapter \ref{Bitcoded1}.+ −
+ −
\subsection{Why $\textit{Blexer}$'s Proof Does Not Work}+ −
The fundamental reason is %we cannot extend the correctness proof of theorem 4+ −
because lemma \ref{retrieveStepwise} does not hold + −
anymore when simplifications are involved.+ −
\marginpar{\em rephrased things \\so why new \\proof makes sense.}+ −
%The proof details are necessary materials for this thesis+ −
%because it provides necessary context to explain why we need a+ −
%new framework for the proof of $\blexersimp$, which involves+ −
%simplifications that cause structural changes to the regular expression.+ −
%A new formal proof of the correctness of $\blexersimp$, where the + −
%proof of $\blexer$+ −
%is not applicatble in the sense that we cannot straightforwardly extend the+ −
%proof of theorem \ref{blexerCorrect} because lemma \ref{retrieveStepwise} does+ −
%not hold anymore.+ −
%This is because the structural induction on the stepwise correctness+ −
%of $\inj$ breaks due to each pair of $r_i$ and $v_i$ described+ −
%in chapter \ref{Inj} and \ref{Bitcoded1} no longer correspond to+ −
%each other.+ −
%In this chapter we introduce simplifications+ −
%for annotated regular expressions that can be applied to + −
%each intermediate derivative result. This allows+ −
%us to make $\blexer$ much more efficient.+ −
%Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions,+ −
%but their simplification functions could have been more efficient and in some cases needed fixing.+ −
+ −
+ −
In particular, the correctness theorem + −
of the un-optimised bit-coded lexer $\blexer$ in + −
chapter \ref{Bitcoded1} formalised by Ausaf et al.+ −
relies crucially on lemma \ref{retrieveStepwise} that says+ −
any value can be retrieved in a stepwise manner, namely:+ −
\begin{equation}\label{eq:stepwise}%eqref: this proposition needs to be referred + −
\vdash v : ((a_\downarrow) \backslash c) \implies \retrieve \; (a \backslash c) \; v= \retrieve \; a \; (\inj \; (a_\downarrow) \; c\; v)+ −
\end{equation}+ −
%This no longer holds once we introduce simplifications.+ −
The regular expressions $a$ and $a\backslash c$ correspond to the intermediate+ −
result before and after the derivative with respect to $c$, and similarly+ −
$\inj\; a_\downarrow \; c \; v$ and $v$ correspond to the value before and after the derivative.+ −
They go in lockstep pairs+ −
\[+ −
(a, \; \inj\; a_\downarrow \; c \; v)\; \text{and} \; (a\backslash c,\; v)+ −
\]+ −
and the structure of annotated regular expression and + −
value within a pair always align with each other.+ −
+ −
As $\blexersimp$ integrates+ −
$\textit{bsimp}$ by applying it after each call to the derivatives function,+ −
%Simplifications are necessary to control the size of derivatives,+ −
%but they also destroy the structures of the regular expressions+ −
%such that \ref{eq:stepwise} does not hold.+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$a \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(\textit{bsimp} \; (a \backslash\, c)) \backslash_{bsimps}\, s$ \\+ −
%$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$+ −
\end{tabular}+ −
%\begin{tabular}{lcl}+ −
% $\textit{blexer\_simp}\;r\,s$ & $\dn$ &+ −
% $\textit{let}\;a = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\ + −
% & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\+ −
% & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\+ −
% & & $\;\;\textit{else}\;\textit{None}$+ −
%\end{tabular}+ −
\end{center}+ −
\noindent+ −
it becomes a problem to maintain a similar property as \ref{retrieveStepwise}.+ −
Previously without $\textit{bsimp}$ the exact structure of each intermediate + −
regular expression is preserved, + −
%allowing pairs of inhabitation relations + −
%in the form $\vdash v : r \backslash c $ and+ −
%$\vdash \inj \; r\; c \; v : r $ to hold in \ref{eq:stepwise}.+ −
We can illustrate this using the diagram \ref{fig:inj} in chapter \ref{Inj},+ −
by zooming in to the middle bit involving $r_i$, $r_{i+1}$, $v_i$ and $v_{i+1}$,+ −
and adding the bottom row to show how bitcodes encoding the lexing information+ −
can be extracted from every pair $(r_i, \; v_i)$:+ −
\begin{center}\label{graph:injZoom}+ −
\begin{tikzpicture}[->, >=stealth', shorten >= 1pt, auto, thick]+ −
%\node [rectangle ] (1) at (-7, 2) {$\ldots$};+ −
%\node [rectangle, draw] (2) at (-4, 2) {$r_i = _{bs'}(_Za+_Saa)^*$};+ −
%\node [rectangle, draw] (3) at (4, 2) {$r_{i+1} = _{bs'}(_Z(_Z\ONE + _S(\ONE \cdot a)))\cdot(_Za+_Saa)^*$};+ −
%\node [rectangle] (4) at (9, 2) {$\ldots$};+ −
%\node [rectangle] (5) at (-7, -2) {$\ldots$};+ −
%\node [rectangle, draw] (6) at (-4, -2) {$v_i = \Stars \; [\Left (a)]$};+ −
%\node [rectangle, draw] (7) at ( 4, -2) {$v_{i+1} = \Seq (\Alt (\Left \; \Empty)) \; \Stars \, []$};+ −
%\node [rectangle] (8) at ( 9, -2) {$\ldots$};+ −
%\node [rectangle] (9) at (-7, -6) {$\ldots$};+ −
%\node [rectangle, draw] (10) at (-4, -6) {$\textit{bits}_{i} = bs' @ ZZS$};+ −
%\node [rectangle, draw] (11) at (4, -6) {$\textit{bits}_{i+1} = bs'@ ZZS$};+ −
%\node [rectangle] (12) at (9, -6) {$\ldots$};+ −
+ −
\node [rectangle ] (1) at (-8, 2) {$\ldots$};+ −
\node [rectangle, draw] (2) at (-5, 2) {$r_i = a_\downarrow$};+ −
\node [rectangle, draw] (3) at (3, 2) {$r_{i+1} = (a\backslash c)_\downarrow$};+ −
\node [rectangle] (4) at (8, 2) {$\ldots$};+ −
\node [rectangle] (5) at (-8, -2) {$\ldots$};+ −
\node [rectangle, draw] (6) at (-5, -2) {$v_i = \inj\; r \; c \; v$};+ −
\node [rectangle, draw] (7) at ( 3, -2) {$v_{i+1} = v$};+ −
\node [rectangle] (8) at ( 8, -2) {$\ldots$};+ −
\node [rectangle] (9) at (-8, -6) {$\ldots$};+ −
\node [rectangle, draw] (10) at (-5, -6) {$\textit{bits}_{i} $};+ −
\node [rectangle, draw] (11) at (3, -6) {$\textit{bits}_{i+1}$};+ −
\node [rectangle] (12) at (8, -6) {$\ldots$};+ −
+ −
+ −
+ −
\path (1) edge [] node {} (2);+ −
\path (6) edge [] node {} (5);+ −
\path (9) edge [] node {} (10);+ −
+ −
\path (11) edge [<-] node {} (12);+ −
\path (8) edge [] node {} (7);+ −
\path (3) edge [] node {} (4);+ −
+ −
+ −
\path (6) edge [dashed,bend right = 30] node {$\retrieve \; a_i \; v_i$} (10);+ −
\path (2) edge [dashed,bend left = 48] node {} (10);+ −
+ −
\path (7) edge [dashed,bend right = 30] node {$\retrieve \; a_{i+1} \; v_{i+1}$} (11);+ −
\path (3) edge [dashed,bend left = 45] node {} (11);+ −
+ −
\path (2) edge [] node {$\backslash c$} (3);+ −
\path (2) edge [dashed, <->] node {$\vdash v_i : r_i$} (6);+ −
\path (3) edge [dashed, <->] node {$\vdash v_{i+1} : r_{i+1}$} (7);+ −
%\path (6) edge [] node {$\vdash v_i : r_i$} (10);+ −
%\path (7) edge [dashed, <->] node {$\vdash v_i : r_i$} (11);+ −
+ −
\path (10) edge [dashed, <->] node {$=$} (11);+ −
+ −
\path (7) edge [] node {$\inj \; r_{i+1} \; c \; v_i$} (6);+ −
+ −
% \node [rectangle, draw] (r) at (-6, -1) {$(aa)^*(b+c)$};+ −
% \node [rectangle, draw] (a) at (-6, 4) {$(aa)^*(_{Z}b + _{S}c)$};+ −
% \path (r)+ −
% edge [] node {$\internalise$} (a);+ −
% \node [rectangle, draw] (a1) at (-3, 1) {$(_{Z}(\ONE \cdot a) \cdot (aa)^*) (_{Z}b + _Sc)$};+ −
% \path (a)+ −
% edge [] node {$\backslash a$} (a1);+ −
%+ −
% \node [rectangle, draw, three sided] (a21) at (-2.5, 4) {$(_{Z}\ONE \cdot (aa)^*)$};+ −
% \node [rectangle, draw, three sided1] (a22) at (-0.8, 4) {$(_{Z}b + _{S}c)$};+ −
% \path (a1)+ −
% edge [] node {$\backslash a$} (a21);+ −
% \node [rectangle, draw] (a3) at (0.5, 2) {$_{ZS}(_{Z}\ONE + \ZERO)$};+ −
% \path (a22)+ −
% edge [] node {$\backslash b$} (a3);+ −
% \path (a21)+ −
% edge [dashed, bend right] node {} (a3);+ −
% \node [rectangle, draw] (bs) at (2, 4) {$ZSZ$};+ −
% \path (a3)+ −
% edge [below] node {$\bmkeps$} (bs);+ −
% \node [rectangle, draw] (v) at (3, 0) {$\Seq \; (\Stars\; [\Seq \; a \; a]) \; (\Left \; b)$};+ −
% \path (bs)+ −
% edge [] node {$\decode$} (v);+ −
+ −
+ −
\end{tikzpicture}+ −
%\caption{$\blexer$ with the regular expression $(aa)^*(b+c)$ and $aab$}+ −
\end{center}+ −
+ −
But $\blexersimp$ introduces simplification after the derivative,+ −
making it difficult to align the structures of values and regular expressions.+ −
If we change the form of property \ref{eq:stepwise} to + −
adapt to the needs of $\blexersimp$ the precondition of becomes+ −
\[+ −
\vdash v : (\textit{bsimp} \; (r\backslash c))+ −
\]+ −
The inhabitation relation of the other pair no longer holds,+ −
because $\inj$ does not work on the simplified value $v$ + −
and the unsimplified regular expression $r$.+ −
The retrieve function will not work either.+ −
\[+ −
\vdash \inj \; r \; c \; v : r+ −
\]+ −
It seems unclear what procedures needs to be+ −
used to create a new value $v_?$ such that+ −
\[+ −
\vdash v_? : r \; \text{and} \; \retrieve \; r \; v_? = \retrieve \; (\textit{bsimp} \; (r\backslash c)) \; v+ −
\]+ −
hold.+ −
%It is clear that once we made + −
%$v$ to align with $\textit{bsimp} \; r_{c}$+ −
%in the inhabitation relation, something different than $v_{r}^{c}$ needs to be plugged+ −
%in for the above statement to hold.+ −
Ausaf et al. \cite{AusafDyckhoffUrban2016}+ −
used something they call rectification functions to restore the original value from the simplified value.+ −
The idea is that simplification functions not only returns a regular expression,+ −
but also a rectification function + −
\[+ −
\textit{simp}^{rect} : Regex \Rightarrow (Value \Rightarrow Value, Regex)+ −
%\textit{frect} : Value \Rightarrow Value+ −
\]+ −
that is recorded recursively,+ −
and then applied to the previous value + −
to obtain the correct value for $\inj$ to work on. + −
The recursive case of the lexer is defined as something like+ −
\[+ −
\textit{slexer} \; r \; (c\!::\!s) \dn let \;(\textit{frect}, r_c) = \textit{simp}^{rect} \;(r \backslash c) \;\; \textit{in}\;\;+ −
\inj \; r \; c \; (\textit{frect} \; (\textit{slexer} \; r_c\; s))+ −
%\textit{match} \; s \; \textit{case} [+ −
\]+ −
However this approach (including $\textit{slexer}$'s correctness proof) only + −
works without bitcodes, and it limits the kind of simplifications one can introduce.+ −
%and they have not yet extended their relatively simple simplifications+ −
%to more aggressive ones.+ −
See the thesis by Ausaf \cite{Ausaf}+ −
for details.+ −
+ −
%\begin{center}+ −
% $\vdash v: (r\backslash c) \implies \retrieve \; (\mathord{?}(\textit{bsimp} \; r_c)) \; v =\retrieve \; r \;(\mathord{?} v_{r}^{c}) $+ −
%\end{center}+ −
%\noindent+ −
+ −
We were not able to use their idea for our very strong simplification rules.+ −
Therefore we are taking another route that completely+ −
disposes of lemma \ref{retrieveStepwise},+ −
and prove a weakened inductive invariant instead.+ −
+ −
Let us first explain why lemma \ref{retrieveStepwise}'s requirement + −
is too strong, and suggest a few possible fixes, which leads to+ −
our proof which we believe was the most natural and effective method.+ −
+ −
+ −
+ −
\subsection{Why Lemma \ref{retrieveStepwise}'s Requirement is too Strong}+ −
+ −
%From this chapter we start with the main contribution of this thesis, which+ −
+ −
The $\blexer$ proof relies on $r_i, \; v_i$ to match each other in lockstep+ −
for each derivative step $i$, however only $v_0$ is needed and intermediate+ −
$v_i$s are purely proof scaffolding.+ −
Moreover property \ref{eq:stepwise}+ −
is stronger than needed for POSIX lexing: the precondition+ −
$\vdash v_{i+1}:r_{i+1}$ is too general in the sense that it allows arbitrary + −
values inhabiting in $r_i$ to retrieve bitcodes.+ −
%correspondence between the lexical value and the+ −
%regular expression in derivative and injection operations at the same step $i$.+ −
%If we revisit the diagram \ref{graph:injZoom} with an example+ −
Consider a concrete example where $a_i = (_{ZZ}x + _{ZS}y + _{S}x)$ and+ −
$a_{i+1} = (_{ZZ}\ONE + \ZERO + _{S}\ONE)$.+ −
What is required in the proof of $\blexer$ is that for the POSIX value+ −
$v_i = \Left \; (\Left \; Empty)$,+ −
the property+ −
\[+ −
%\vdash \Left \; (\Left \; Empty) : (\ONE+\ZERO+\ONE) \implies + −
\retrieve \; (_{ZZ}\ONE + \ZERO + _{S}\ONE) \; (\Left \; (\Left \; \Empty) ) =+ −
\retrieve \; (_{ZZ}x + _{ZS}y + _{S}x ) \; (\Left \; (\Left \; \Char\; x) )+ −
\]+ −
holds, and for $\blexersimp$ a property of similar shape to+ −
\[+ −
\retrieve \; _{ZZ}\ONE \; \; Empty =+ −
\retrieve \; _{ZZ}x \; (\Char\; x)+ −
\]+ −
needs to hold as well.+ −
However for the definitely non-POSIX value $v_i' = \Right \; \Empty$+ −
the precondition $\vdash \Right \; \Empty : x+y+x$ holds as well, and therefore+ −
the following equality+ −
\[+ −
\retrieve \; (_{ZZ}\ONE + \ZERO + _{S}\ONE) \; (\Right \; \Empty) =+ −
\retrieve \; (_{ZZ}x + _{ZS}y + _{S}x ) \; (\Right \; (\Char\; x))+ −
\]+ −
by lemma \ref{retrieveStepwise} holds for $\blexer$ as well.+ −
This cannot hold or be proven anymore with $\blexersimp$ as the corresponding+ −
sub-regular expressions and values have been eliminated during the + −
de-duplication procedure of our smplification.+ −
We are stuck with a property that holds in $\blexer$ but does not have a counterpart+ −
in $\blexersimp$.+ −
This needs not hold for the purpose of POSIX lexing though--we know the rightmost + −
subexpression $x$ is not POSIX by the left priority rule.+ −
The inductive invariant \ref{eq:stepwise} can be weakened by restricting the precondition+ −
$\vdash v_i:r_i$ to $\exists s_i. \; (s_i, r_i) \rightarrow v_i$. + −
We tried this route but it did not work well since we need to+ −
use a similar technique as the rectification functions by Ausaf et al, + −
and they can get very complicated with our simplifications.+ −
After some trial-and-error we found a property of the form+ −
\begin{property}+ −
If a POSIX value can be extracted from $a \backslash s$, then+ −
it can be extracted from $a \backslash_{bsimps} s$ as well.+ −
\end{property}+ −
\noindent+ −
most natural to work with, and we defined a binary relation to capture+ −
the connection between $a\backslash s$ and $a \backslash_{bsimps} s$.+ −
%and look specifically at+ −
%the pairs $v_i, r_i$ and $v_{i+1},\, r_{i+1}$, we get the diagram demonstrating+ −
%the invariant that the same bitcodes can be extracted from the pairs:+ −
%\tikzset{three sided/.style={+ −
% draw=none,+ −
% append after command={+ −
% [-,shorten <= -0.5\pgflinewidth]+ −
% ([shift={(-1.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east)+ −
% edge([shift={( 0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west) + −
% ([shift={( 0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west)+ −
% edge([shift={( 0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west) + −
% ([shift={( 0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)+ −
% edge([shift={(-1.0\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)+ −
% }+ −
% }+ −
%}+ −
%+ −
%\tikzset{three sided1/.style={+ −
% draw=none,+ −
% append after command={+ −
% [-,shorten <= -0.5\pgflinewidth]+ −
% ([shift={(1.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west)+ −
% edge([shift={(-0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east) + −
% ([shift={(-0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east)+ −
% edge([shift={(-0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east) + −
% ([shift={(-0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)+ −
% edge([shift={(1.0\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)+ −
% }+ −
% }+ −
%}+ −
%+ −
%\begin{center}+ −
% \begin{tikzpicture}[->, >=stealth', shorten >= 1pt, auto, thick]+ −
% %\node [rectangle ] (1) at (-7, 2) {$\ldots$};+ −
% %\node [rectangle, draw] (2) at (-4, 2) {$r_i = _{bs'}(_Za+_Saa)^*$};+ −
% %\node [rectangle, draw] (3) at (4, 2) {$r_{i+1} = _{bs'}(_Z(_Z\ONE + _S(\ONE \cdot a)))\cdot(_Za+_Saa)^*$};+ −
% %\node [rectangle] (4) at (9, 2) {$\ldots$};+ −
% %\node [rectangle] (5) at (-7, -2) {$\ldots$};+ −
% %\node [rectangle, draw] (6) at (-4, -2) {$v_i = \Stars \; [\Left (a)]$};+ −
% %\node [rectangle, draw] (7) at ( 4, -2) {$v_{i+1} = \Seq (\Alt (\Left \; \Empty)) \; \Stars \, []$};+ −
% %\node [rectangle] (8) at ( 9, -2) {$\ldots$};+ −
% %\node [rectangle] (9) at (-7, -6) {$\ldots$};+ −
% %\node [rectangle, draw] (10) at (-4, -6) {$\textit{bits}_{i} = bs' @ ZZS$};+ −
% %\node [rectangle, draw] (11) at (4, -6) {$\textit{bits}_{i+1} = bs'@ ZZS$};+ −
% %\node [rectangle] (12) at (9, -6) {$\ldots$};+ −
% + −
% \node [rectangle ] (1) at (-8, 2) {$\ldots$};+ −
% \node [rectangle, draw] (2) at (-5, 2) {$r_i = _{bs'}(_Za+_Saa)^*$};+ −
% \node [rectangle, draw] (3) at (3, 2) {$r_{i+1} = _{bs'}(_Z(_Z\ONE + _S(\ONE \cdot a)))\cdot(_Za+_Saa)^*$};+ −
% \node [rectangle] (4) at (8, 2) {$\ldots$};+ −
% \node [rectangle] (5) at (-8, -2) {$\ldots$};+ −
% \node [rectangle, draw] (6) at (-5, -2) {$v_i = \Stars \; [\Left (a)]$};+ −
% \node [rectangle, draw] (7) at ( 3, -2) {$v_{i+1} = \Seq (\Alt (\Left \; \Empty)) \; \Stars \, []$};+ −
% \node [rectangle] (8) at ( 8, -2) {$\ldots$};+ −
% \node [rectangle] (9) at (-8, -6) {$\ldots$};+ −
% \node [rectangle, draw] (10) at (-5, -6) {$\textit{bits}_{i} = bs' @ ZZS$};+ −
% \node [rectangle, draw] (11) at (3, -6) {$\textit{bits}_{i+1} = bs'@ ZZS$};+ −
% \node [rectangle] (12) at (8, -6) {$\ldots$};+ −
%+ −
%+ −
%+ −
% \path (1) edge [] node {} (2);+ −
% \path (5) edge [] node {} (6);+ −
% \path (9) edge [] node {} (10);+ −
%+ −
% \path (11) edge [] node {} (12);+ −
% \path (7) edge [] node {} (8);+ −
% \path (3) edge [] node {} (4);+ −
%+ −
%+ −
% \path (6) edge [dashed,bend right = 30] node {$\retrieve \; r_i \; v_i$} (10);+ −
% \path (2) edge [dashed,bend left = 48] node {} (10);+ −
%+ −
% \path (7) edge [dashed,bend right = 30] node {$\retrieve \; r_{i+1} \; v_{i+1}$} (11);+ −
% \path (3) edge [dashed,bend left = 45] node {} (11);+ −
% + −
% \path (2) edge [] node {$\backslash a$} (3);+ −
% \path (2) edge [dashed, <->] node {$\vdash v_i : r_i$} (6);+ −
% \path (3) edge [dashed, <->] node {$\vdash v_{i+1} : r_{i+1}$} (7);+ −
% %\path (6) edge [] node {$\vdash v_i : r_i$} (10);+ −
% %\path (7) edge [dashed, <->] node {$\vdash v_i : r_i$} (11);+ −
%+ −
% \path (10) edge [dashed, <->] node {$=$} (11);+ −
%+ −
% \path (7) edge [] node {$\inj \; r_{i+1} \; a \; v_i$} (6);+ −
%+ −
%% \node [rectangle, draw] (r) at (-6, -1) {$(aa)^*(b+c)$};+ −
%% \node [rectangle, draw] (a) at (-6, 4) {$(aa)^*(_{Z}b + _{S}c)$};+ −
%% \path (r)+ −
%% edge [] node {$\internalise$} (a);+ −
%% \node [rectangle, draw] (a1) at (-3, 1) {$(_{Z}(\ONE \cdot a) \cdot (aa)^*) (_{Z}b + _Sc)$};+ −
%% \path (a)+ −
%% edge [] node {$\backslash a$} (a1);+ −
%%+ −
%% \node [rectangle, draw, three sided] (a21) at (-2.5, 4) {$(_{Z}\ONE \cdot (aa)^*)$};+ −
%% \node [rectangle, draw, three sided1] (a22) at (-0.8, 4) {$(_{Z}b + _{S}c)$};+ −
%% \path (a1)+ −
%% edge [] node {$\backslash a$} (a21);+ −
%% \node [rectangle, draw] (a3) at (0.5, 2) {$_{ZS}(_{Z}\ONE + \ZERO)$};+ −
%% \path (a22)+ −
%% edge [] node {$\backslash b$} (a3);+ −
%% \path (a21)+ −
%% edge [dashed, bend right] node {} (a3);+ −
%% \node [rectangle, draw] (bs) at (2, 4) {$ZSZ$};+ −
%% \path (a3)+ −
%% edge [below] node {$\bmkeps$} (bs);+ −
%% \node [rectangle, draw] (v) at (3, 0) {$\Seq \; (\Stars\; [\Seq \; a \; a]) \; (\Left \; b)$};+ −
%% \path (bs)+ −
%% edge [] node {$\decode$} (v);+ −
%+ −
%+ −
% \end{tikzpicture}+ −
% %\caption{$\blexer$ with the regular expression $(aa)^*(b+c)$ and $aab$}+ −
%\end{center}+ −
+ −
%When simplifications are added, the inhabitation relation no longer holds,+ −
%causing the above diagram to break.+ −
%+ −
%Ausaf addressed this with an augmented lexer he called $\textit{slexer}$.+ −
%+ −
%+ −
%+ −
%we note that the invariant+ −
%$\vdash v_{i+1}: r_{i+1} \implies \retrieve \; r_{i+1} \; v_{i+1} $ is too strong+ −
%to maintain because the precondition $\vdash v_i : r_i$ is too weak.+ −
%It does not require $v_i$ to be a POSIX value + −
%+ −
%+ −
%which is essential for getting an understanding this thesis+ −
%in chapter \ref{Bitcoded1}, which is necessary for understanding why+ −
%the proof + −
%+ −
%In this chapter,+ −
+ −
%We contrast our simplification function + −
%with Sulzmann and Lu's, indicating the simplicity of our algorithm.+ −
%This is another case for the usefulness + −
%and reliability of formal proofs on algorithms.+ −
%These ``aggressive'' simplifications would not be possible in the injection-based + −
%lexing we introduced in chapter \ref{Inj}.+ −
%We then prove the correctness with the improved version of + −
%$\blexer$, called $\blexersimp$, by establishing + −
%$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.+ −
%+ −
%----------------------------------------------------------------------------------------+ −
% SECTION rewrite relation+ −
%----------------------------------------------------------------------------------------+ −
In the next section we first introduce the rewriting relation \emph{rrewrite}+ −
($\rrewrite$) between two regular expressions,+ −
which stands for an atomic+ −
simplification.+ −
We then prove properties about+ −
this rewriting relation and its reflexive transitive closure.+ −
Finally we leverage these properties to show+ −
an equivalence between the results generated by+ −
$\blexer$ and $\blexersimp$.+ −
+ −
\subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}+ −
In the $\blexer$'s correctness proof, we+ −
did not directly derive the fact that $\blexer$ generates the POSIX value,+ −
but first proved that $\blexer$ generates the same result as $\lexer$.+ −
Then we re-use+ −
the correctness of $\lexer$+ −
to obtain + −
\begin{center}+ −
$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$\\+ −
$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\;+ −
r\;s = \None$.+ −
\end{center}+ −
%\begin{center}+ −
% $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$.+ −
%\end{center}+ −
Here we apply this+ −
modularised technique again+ −
by first proving that+ −
$\blexersimp \; r \; s $ + −
produces the same output as $\blexer \; r\; s$,+ −
and then piecing it together with + −
$\blexer$'s correctness to achieve our main+ −
theorem:+ −
\begin{center}+ −
$(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = \Some \;v$+ −
\\+ −
$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\;+ −
r\;s = \None$+ −
\end{center}+ −
\noindent+ −
The overall idea for the proof+ −
of $\blexer \;r \;s = \blexersimp \; r \;s$ + −
is that the transition from $r$ to $\textit{bsimp}\; r$ can be+ −
broken down into smaller rewrite steps of the form:+ −
\begin{center}+ −
$r \rightsquigarrow^* \textit{bsimp} \; r$+ −
\end{center}+ −
where each rewrite step, written $\rightsquigarrow$,+ −
is an ``atomic'' simplification that+ −
is similar to a small-step reduction in operational semantics (+ −
see figure \ref{rrewriteRules} for the rules):+ −
\begin{figure}[H]+ −
\begin{mathpar}+ −
\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}+ −
+ −
\inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}+ −
+ −
\inferrule * [Right = $S_1$]{\vspace{0em}}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\+ −
+ −
+ −
+ −
\inferrule * [Right = $SL$] {\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\}+ −
+ −
\inferrule * [Right = $SR$] {\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\+ −
+ −
\inferrule * [Right = $A0$] {\vspace{0em}}{ _{bs}\sum [] \rightsquigarrow \ZERO}+ −
+ −
\inferrule * [Right = $A1$] {\vspace{0em}}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a}+ −
+ −
\inferrule * [Right = $AL$] {\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2}+ −
+ −
\inferrule * [Right = $LE$] {\vspace{0em}}{ [] \stackrel{s}{\rightsquigarrow} []}+ −
+ −
\inferrule * [Right = $LT$] {rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{ r :: rs_1 \stackrel{s}{\rightsquigarrow} r :: rs_2 }+ −
+ −
\inferrule * [Right = $LH$] {r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs}+ −
+ −
\inferrule * [Right = $L\ZERO$] {\vspace{0em}}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs}+ −
+ −
\inferrule * [Right = $LS$] {\vspace{0em}}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) }+ −
+ −
\inferrule * [Right = $LD$] {\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rs_c \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c}+ −
+ −
\end{mathpar}+ −
\caption{+ −
The rewrite rules that generate simplified regular expressions + −
in small steps: $r_1 \rightsquigarrow r_2$ is for bitcoded regular expressions + −
and $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$ for + −
lists of bitcoded regular expressions. + −
Interesting is the LD rule that allows copies of regular expressions + −
to be removed provided a regular expression + −
earlier in the list can match the same strings.+ −
}\label{rrewriteRules}+ −
\end{figure}+ −
\noindent+ −
The rules $LT$ and $LH$ are for rewriting two regular expression lists+ −
such that one regular expression+ −
in the left-hand-side list is rewritable in one step+ −
to the right-hand side's regular expression at the same position.+ −
This helps with defining the ``context rule'' $AL$.+ −
+ −
The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$+ −
are defined in the usual way:+ −
\begin{figure}[H]+ −
\centering+ −
\begin{mathpar}+ −
\inferrule{\vspace{0em}}{ r \rightsquigarrow^* r \\}+ −
+ −
\inferrule{\vspace{0em}}{rs \stackrel{s*}{\rightsquigarrow} rs \\}+ −
+ −
\inferrule{r_1 \rightsquigarrow^* r_2 \land \; r_2 \rightsquigarrow^* r_3}{r_1 \rightsquigarrow^* r_3\\}+ −
+ −
\inferrule{rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \land \; rs_2 \stackrel{s*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{s*}{\rightsquigarrow} rs_3}+ −
\end{mathpar}+ −
\caption{The Reflexive Transitive Closure of + −
$\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure}+ −
\end{figure}+ −
%Two rewritable terms will remain rewritable to each other+ −
%even after a derivative is taken:+ −
The main point of our rewriting relation+ −
is that it is preserved under derivatives,+ −
namely+ −
\begin{center}+ −
$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$+ −
\end{center}+ −
And also, if two terms are rewritable to each other,+ −
then they produce the same bitcodes:+ −
\begin{center}+ −
$r \rightsquigarrow^* r' \;\; \textit{then} \; \; \bmkeps \; r = \bmkeps \; r'$+ −
\end{center}+ −
The decoding phase of both $\blexer$ and $\blexersimp$+ −
are the same, which means that if they receive the same+ −
bitcodes before the decoding phase,+ −
they generate the same value after decoding is done.+ −
We will prove the three properties + −
we mentioned above in the next sub-section.+ −
\subsection{Important Properties of $\rightsquigarrow$}+ −
First we prove some basic facts + −
about $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$, + −
$\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$,+ −
which will be needed later.\\+ −
The inference rules (\ref{rrewriteRules}) we + −
gave in the previous section + −
have their ``many-steps version'':+ −
+ −
\begin{lemma}\label{squig1}+ −
\hspace{0em}+ −
\begin{itemize}+ −
\item+ −
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$+ −
\item+ −
$r \rightsquigarrow^* r' \implies _{bs} \sum (r :: rs)\; \rightsquigarrow^*\; _{bs} \sum (r' :: rs)$+ −
+ −
\item+ −
The rewriting in many steps property is composable + −
in terms of the sequence constructor:\\+ −
$r_1 \rightsquigarrow^* r_2 + −
\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* \; + −
_{bs} r_2 \cdot r_3 \quad $ + −
and + −
$\quad r_3 \rightsquigarrow^* r_4 + −
\implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* _{bs} \; r_1 \cdot r_4$+ −
\item+ −
The rewriting in many steps properties + −
$\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ + −
is preserved under the function $\fuse$:\\+ −
$r_1 \rightsquigarrow^* r_2 + −
\implies \fuse \; bs \; r_1 \rightsquigarrow^* \; + −
\fuse \; bs \; r_2 \quad $ and + −
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 + −
\implies \map \; (\fuse \; bs) \; rs_1 + −
\stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$+ −
\end{itemize}+ −
\end{lemma}+ −
\begin{proof}+ −
By an induction on + −
the inductive cases of $\stackrel{s*}{\rightsquigarrow}$ and $\rightsquigarrow^*$ respectively.+ −
The third and fourth points are + −
by the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and+ −
$rs_2 \stackrel{s}{\rightsquigarrow} rs_3 + −
\implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$,+ −
which can be inductively proven by the inductive cases of $\rightsquigarrow$ and + −
$\stackrel{s}{\rightsquigarrow}$.+ −
\end{proof}+ −
\noindent+ −
The inference rules of $\stackrel{s}{\rightsquigarrow}$+ −
are defined in terms of the list cons operation, where+ −
we establish that the + −
$\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ + −
relation is also preserved w.r.t appending and prepending of a list.+ −
In addition, we+ −
also prove some relations + −
between $\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$.+ −
\begin{lemma}\label{ssgqTossgs}+ −
\hspace{0em}+ −
\begin{itemize}+ −
\item+ −
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$+ −
+ −
\item+ −
$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies + −
rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2 \; \;+ −
\textit{and} \; \;+ −
rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$+ −
+ −
\item+ −
The $\stackrel{s}{\rightsquigarrow} $ relation after appending + −
a list becomes $\stackrel{s*}{\rightsquigarrow}$:\\+ −
$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 + −
\implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$+ −
\item+ −
+ −
$r_1 \rightsquigarrow^* r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$+ −
\item+ −
+ −
$rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \rightsquigarrow^* r_2 \implies+ −
r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$+ −
\item + −
If we can rewrite a regular expression + −
in many steps to $\ZERO$, then + −
we can also rewrite any sequence containing it to $\ZERO$:\\+ −
$r_1 \rightsquigarrow^* \ZERO + −
\implies _{bs}r_1\cdot r_2 \rightsquigarrow^* \ZERO$+ −
\end{itemize}+ −
\end{lemma}+ −
\begin{proof}+ −
The first part is by induction on the list $rs$.+ −
The second part is by induction on the inductive cases of $\stackrel{s*}{\rightsquigarrow}$.+ −
The third part is + −
by rule induction of $\stackrel{s}{\rightsquigarrow}$.+ −
The fourth sub-lemma is + −
by rule induction of + −
$\stackrel{s*}{\rightsquigarrow}$ and using part one to three. + −
The fifth part is a corollary of part four.+ −
The last part is proven by rule induction again on $\rightsquigarrow^*$.+ −
\end{proof}+ −
\noindent+ −
Now we are ready to give the proofs of the following properties:+ −
\begin{itemize}+ −
\item+ −
$r \rightsquigarrow^* r'\land \bnullable \; r_1 + −
\implies \bmkeps \; r = \bmkeps \; r'$. \\+ −
\item+ −
$r \rightsquigarrow^* \textit{bsimp} \;r$.\\+ −
\item+ −
$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow^* r'\backslash c$.\\+ −
\end{itemize}+ −
+ −
\subsubsection{Property 1: $r \rightsquigarrow^* r'\land \bnullable \; r_1 + −
\implies \bmkeps \; r = \bmkeps \; r'$}+ −
Intuitively, this property says we can + −
extract the same bitcodes using $\bmkeps$ from the nullable+ −
components of two regular expressions $r$ and $r'$,+ −
if we can rewrite from one to the other in finitely+ −
many steps.+ −
+ −
For convenience, + −
we define a predicate for a list of regular expressions+ −
having at least one nullable regular expression:+ −
\begin{center}+ −
$\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$+ −
\end{center}+ −
\noindent+ −
The rewriting relation $\rightsquigarrow$ preserves (b)nullability:+ −
\begin{lemma}\label{rewritesBnullable}+ −
\hspace{0em}+ −
\begin{itemize}+ −
\item+ −
$\text{If} \; r_1 \rightsquigarrow r_2, \; + −
\text{then} \; \bnullable \; r_1 = \bnullable \; r_2$+ −
\item + −
$\text{If} \; rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \;+ −
\text{then} \; \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$+ −
\item+ −
$r_1 \rightsquigarrow^* r_2 + −
\implies \bnullable \; r_1 = \bnullable \; r_2$+ −
\end{itemize}+ −
\end{lemma}+ −
\begin{proof}+ −
By rule induction of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$.+ −
The third point is a result of the second.+ −
\end{proof}+ −
\noindent+ −
For convenience again,+ −
we define $\bmkepss$ on a list $rs$,+ −
which extracts the bit-codes on the first $\bnullable$ element in $rs$:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\bmkepss \; [] $ & $\dn$ & $[]$\\+ −
$\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \;(\bnullable \; r) \;\; + −
\textit{then} \;\; \bmkeps \; r \; \textit{else} \;\; \bmkepss \; rs$+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
If both regular expressions in a rewriting relation are nullable, then they + −
produce the same bitcodes:+ −
\begin{lemma}\label{rewriteBmkepsAux}+ −
\hspace{0em}+ −
\begin{itemize}+ −
\item+ −
$r_1 \rightsquigarrow r_2 \implies + −
(\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 = + −
\bmkeps \; r_2)$ + −
\item+ −
and+ −
$rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2 + −
\implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies + −
\bmkepss \; rs_1 = \bmkepss \; rs2)$+ −
\end{itemize}+ −
\end{lemma}+ −
\begin{proof}+ −
By rule induction over the cases that lead to $r_1 \rightsquigarrow r_2$.+ −
\end{proof}+ −
\noindent+ −
With lemma \ref{rewriteBmkepsAux} in place we are ready to prove its+ −
many-step version: + −
\begin{lemma}+ −
$\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r, \;\;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$+ −
\end{lemma}+ −
\begin{proof}+ −
By rule induction of $\stackrel{*}{\rightsquigarrow} $. Lemma + −
$\ref{rewritesBnullable}$ gives us both $r$ and $r'$ are nullable.+ −
The lemma \ref{rewriteBmkepsAux} solves the inductive case.+ −
\end{proof}+ −
+ −
\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$}+ −
Now we get to the key part of the proof, + −
which says that our simplification's helper functions + −
such as $\distinctBy$ and $\flts$ describe+ −
reducts of $\stackrel{s*}{\rightsquigarrow}$ and + −
$\rightsquigarrow^* $.+ −
+ −
The first lemma to prove is a more general version of + −
$rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$:+ −
\begin{lemma}+ −
$rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} + −
(rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$+ −
\end{lemma}+ −
\noindent+ −
It says that for a list made of two parts $rs_1 @ rs_2$, + −
one can throw away the duplicate+ −
elements in $rs_2$, as well as those that have appeared in $rs_1$.+ −
\begin{proof}+ −
By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary.+ −
\end{proof}+ −
\noindent+ −
Setting $rs_2$ to be empty,+ −
we get the corollary+ −
\begin{corollary}\label{dBPreserves}+ −
$rs_1 \stackrel{s*}{\rightsquigarrow} \distinctBy \; rs_1 \; \phi$.+ −
\end{corollary}+ −
\noindent+ −
Similarly the flatten function $\flts$ describes a reduct of+ −
$\stackrel{s*}{\rightsquigarrow}$ as well:+ −
+ −
\begin{lemma}\label{fltsPreserves}+ −
$rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$+ −
\end{lemma}+ −
\begin{proof}+ −
By an induction on $rs$.+ −
\end{proof}+ −
\noindent+ −
The function $\bsimpalts$ preserves rewritability:+ −
\begin{lemma}\label{bsimpaltsPreserves}+ −
$_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$+ −
\end{lemma}+ −
\noindent+ −
The simplification function+ −
$\textit{bsimp}$ only transforms the regular expression using steps specified by + −
$\rightsquigarrow^*$ and nothing else:+ −
\begin{lemma}+ −
$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$+ −
\end{lemma}+ −
\begin{proof}+ −
By an induction on $r$.+ −
The most involved case is the alternative, + −
where we use lemmas \ref{bsimpaltsPreserves},+ −
\ref{fltsPreserves} and \ref{dBPreserves} to do a series of rewriting:\\+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$rs$ & $\stackrel{s*}{\rightsquigarrow}$ & $ \map \; \textit{bsimp} \; rs$\\+ −
& $\stackrel{s*}{\rightsquigarrow}$ & $ \flts \; (\map \; \textit{bsimp} \; rs)$\\+ −
& $\stackrel{s*}{\rightsquigarrow}$ & $ \distinctBy \; + −
(\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerases \; \phi$\\+ −
\end{tabular}+ −
\end{center}+ −
Using this we can derive the following rewrite sequence:\\+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$r$ & $=$ & $_{bs}\sum rs$\\[1.5ex]+ −
& $\rightsquigarrow^*$ & $\bsimpalts \; bs \; rs$ \\[1.5ex]+ −
& $\rightsquigarrow^*$ & $\ldots$ \\ [1.5ex]+ −
& $\rightsquigarrow^*$ & $\bsimpalts \; bs \; + −
(\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) + −
\; \rerases \; \phi)$\\[1.5ex]+ −
%& $\rightsquigarrow^*$ & $ _{bs} \sum (\distinctBy \; + −
%(\flts \; (\map \; \textit{bsimp}\; rs)) \; \;+ −
%\rerases \; \;\phi) $\\[1.5ex]+ −
& $\rightsquigarrow^*$ & $\textit{bsimp} \; r$\\[1.5ex]+ −
\end{tabular}+ −
\end{center} + −
\end{proof}+ −
\subsubsection{Property 3: $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}+ −
The rewrite relation + −
$\rightsquigarrow$ changes into $\stackrel{*}{\rightsquigarrow}$+ −
after derivatives are taken on both sides:+ −
\begin{lemma}\label{rewriteBder}+ −
\hspace{0em}+ −
\begin{itemize}+ −
\item+ −
If $r_1 \rightsquigarrow r_2$, then $r_1 \backslash c + −
\rightsquigarrow^* r_2 \backslash c$ + −
\item + −
If $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$, then $ + −
\map \; (\_\backslash c) \; rs_1 + −
\stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$+ −
\end{itemize}+ −
\end{lemma}+ −
\begin{proof}+ −
By induction on $\rightsquigarrow$ + −
and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.+ −
\end{proof}+ −
\noindent+ −
Now we can prove property 3 as an immediate corollary:+ −
\begin{corollary}\label{rewritesBder}+ −
$r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^* + −
r_2 \backslash c$+ −
\end{corollary}+ −
\begin{proof}+ −
By rule induction of $\stackrel{*}{\rightsquigarrow} $ and lemma \ref{rewriteBder}.+ −
\end{proof}+ −
\noindent+ −
This can be extended and combined with $r \rightsquigarrow^* \textit{bsimp} \; r$+ −
to obtain the correspondence between+ −
$\blexer$ and $\blexersimp$'s intermediate+ −
derivative regular expressions + −
\begin{lemma}\label{bderBderssimp}+ −
$a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $+ −
\end{lemma}+ −
\begin{proof}+ −
By an induction on $s$.+ −
\end{proof}+ −
\subsection{Main Theorem}+ −
Now with \ref{bderBderssimp} in place we are ready for the main theorem.+ −
\begin{theorem}+ −
$\blexer \; r \; s = \blexersimp{r}{s}$+ −
\end{theorem}+ −
\noindent+ −
\begin{proof}+ −
We can rewrite in many steps from the original lexer's + −
derivative regular expressions to the + −
lexer with simplification applied (by lemma \ref{bderBderssimp}):+ −
\begin{center}+ −
$a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $.+ −
\end{center}+ −
We know that they generate the same bits, if the lexing result is a match:+ −
\begin{center}+ −
$\bnullable \; (a \backslash s) + −
\implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$+ −
\end{center}+ −
Now that they generate the same bits, we know they also give the same value after decoding.+ −
\begin{center}+ −
$\bnullable \; (a \backslash s) + −
\implies \decode \; r \; (\bmkeps \; (a \backslash s)) = + −
\decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$+ −
\end{center}+ −
Which is required by our proof goal:+ −
\begin{center}+ −
$\blexer \; r \; s = \blexersimp \; r \; s$.+ −
\end{center} + −
\end{proof}+ −
\noindent+ −
As a corollary,+ −
we can link this result with the lemma we proved earlier that + −
\begin{center}+ −
$(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = \Some \;v$\\+ −
$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\;+ −
r\;s = \None$.+ −
\end{center}+ −
and obtain the property that the bit-coded lexer with simplification is+ −
indeed correctly generating a POSIX lexing result, if such a result exists.+ −
\begin{corollary}+ −
$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\+ −
$\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\;+ −
r\;s = \None$.+ −
\end{corollary}+ −
+ −
\subsection{Comments on the Proof Techniques Used}+ −
Straightforward as the proof may seem,+ −
the efforts we spent obtaining it were far from trivial.+ −
We initially attempted to re-use the argument + −
in \cref{flex_retrieve}. + −
The problem is that both functions $\inj$ and $\retrieve$ require + −
that the annotated regular expressions stay unsimplified, + −
so that one can + −
correctly compare $v_{i+1}$ and $r_i$ and $v_i$ + −
in diagram \ref{graph:inj}.+ −
+ −
We also tried to prove + −
\begin{center}+ −
$\textit{bsimp} \;\; (\bderssimp{a}{s}) = + −
\textit{bsimp} \;\; (a\backslash s)$,+ −
\end{center}+ −
but this turns out to be not true.+ −
A counterexample is+ −
\[ a = [(_{Z}1+_{S}c)\cdot [bb \cdot (_{Z}1+_{S}c)]] \;\; + −
\text{and} \;\; s = bb.+ −
\]+ −
\noindent+ −
Then we would have + −
\begin{center}+ −
$\textit{bsimp}\;\; ( a \backslash s )$ =+ −
$_{[]}(_{ZZ}\ONE + _{ZS}c ) $+ −
\end{center}+ −
\noindent+ −
whereas + −
\begin{center}+ −
$\textit{bsimp} \;\;( \bderssimp{a}{s} )$ = + −
$_{Z}(_{Z} \ONE + _{S} c)$.+ −
\end{center}+ −
Unfortunately, + −
if we apply $\textit{bsimp}$ differently+ −
we will always have this discrepancy. + −
This is due to + −
the $\map \; (\fuse\; bs) \; as$ operation + −
happening at different locations in the regular expression.+ −
+ −
The rewriting relation + −
$\rightsquigarrow^*$ + −
allows us to ignore this discrepancy+ −
and view the expressions + −
\begin{center}+ −
$_{[]}(_{ZZ}\ONE + _{ZS}c ) $\\+ −
and\\+ −
$_{Z}(_{Z} \ONE + _{S} c)$+ −
+ −
\end{center}+ −
as equal because they were both re-written+ −
from the same expression.+ −
+ −
The simplification rewriting rules+ −
given in \ref{rrewriteRules} are by no means+ −
final,+ −
one could come up with new rules+ −
such as + −
$\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow+ −
\SEQs [r_1, r_2, r_3]$.+ −
However this does not fit with the proof technique+ −
of our main theorem, but seem to not violate the POSIX+ −
property.+ −
+ −
Having established the correctness of our+ −
$\blexersimp$, in the next chapter we shall prove that with our $\simp$ function,+ −
for a given $r$, the derivative size is always+ −
finitely bounded by a constant.+ −