lexing: ChengsongTanPhdThesis/Chapters/Bitcoded2.tex@7cf9f17aa179


% Chapter Template

% Main chapter title
\chapter{Correctness of Bit-coded Algorithm with Simplification}

\label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
%Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
%simplifications and therefore introduce our version of the bitcoded algorithm and 
%its correctness proof in 
%Chapter 3\ref{Chapter3}. 



Now we introduce the simplifications, which is why we introduce the 
bitcodes in the first place.

\subsection*{Simplification Rules}

This section introduces aggressive (in terms of size) simplification rules
on annotated regular expressions
to keep derivatives small. Such simplifications are promising
as we have
generated test data that show
that a good tight bound can be achieved. We could only
partially cover the search space as there are infinitely many regular
expressions and strings. 

One modification we introduced is to allow a list of annotated regular
expressions in the $\sum$ constructor. This allows us to not just
delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
also unnecessary ``copies'' of regular expressions (very similar to
simplifying $r + r$ to just $r$, but in a more general setting). Another
modification is that we use simplification rules inspired by Antimirov's
work on partial derivatives. They maintain the idea that only the first
``copy'' of a regular expression in an alternative contributes to the
calculation of a POSIX value. All subsequent copies can be pruned away from
the regular expression. A recursive definition of our  simplification function 
that looks somewhat similar to our Scala code is given below:
%\comment{Use $\ZERO$, $\ONE$ and so on. 
%Is it $ALTS$ or $ALTS$?}\\

\begin{center}
  \begin{tabular}{@{}lcl@{}}
   
  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\

  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\
  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 

   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
\end{tabular}    
\end{center}    

\noindent
The simplification does a pattern matching on the regular expression.
When it detected that the regular expression is an alternative or
sequence, it will try to simplify its child regular expressions
recursively and then see if one of the children turns into $\ZERO$ or
$\ONE$, which might trigger further simplification at the current level.
The most involved part is the $\sum$ clause, where we use two
auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
alternatives and reduce as many duplicates as possible. Function
$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
Its recursive definition is given below:

 \begin{center}
  \begin{tabular}{@{}lcl@{}}
  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
\end{tabular}    
\end{center}  

\noindent
Here $\textit{flatten}$ behaves like the traditional functional programming flatten
function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.

Having defined the $\simp$ function,
we can use the previous notation of  natural
extension from derivative w.r.t.~character to derivative
w.r.t.~string:%\comment{simp in  the [] case?}

\begin{center}
\begin{tabular}{lcl}
$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
$r \backslash_{simp} [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}

\noindent
to obtain an optimised version of the algorithm:

 \begin{center}
\begin{tabular}{lcl}
  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  & & $\;\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}

\noindent
This algorithm keeps the regular expression size small, for example,
with this simplification our previous $(a + aa)^*$ example's 8000 nodes
will be reduced to just 17 and stays constant, no matter how long the
input string is.



\section{ $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
We 
Unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current
simplification rules (we put the graphs together to show the contrast)
\begin{tabular}{ll}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={derivative size},
        width=7cm,
    height=4cm, 
    legend entries={Lexer with $\bsimp$},  
    legend pos=  south east,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};
\end{axis}
\end{tikzpicture} %\label{fig:BitcodedLexer}
&
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={derivative size},
    width = 7cm,
    height = 4cm,
    legend entries={Lexer without $\bsimp$},  
    legend pos=  north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
\end{axis}
\end{tikzpicture} 
\end{tabular}





%----------------------------------------------------------------------------------------
%	SECTION common identities
%----------------------------------------------------------------------------------------
\section{Common Identities In Simplification-Related Functions} 
Need to prove these before starting on the big correctness proof.

%-----------------------------------
%	SUBSECTION 
%-----------------------------------
\subsection{Idempotency of $\simp$}

\begin{equation}
	\simp \;r = \simp\; \simp \; r 
\end{equation}
This property means we do not have to repeatedly
apply simplification in each step, which justifies
our definition of $\blexersimp$.
It will also be useful in future proofs where properties such as 
closed forms are needed.
The proof is by structural induction on $r$.

%-----------------------------------
%	SUBSECTION 
%-----------------------------------
\subsection{Syntactic Equivalence Under $\simp$}
We prove that minor differences can be annhilated
by $\simp$.
For example,
\begin{center}
$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
 \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
\end{center}







%----------------------------------------------------------------------------------------
%	SECTION corretness proof
%----------------------------------------------------------------------------------------
\section{Proof Technique of Correctness of Bit-coded Algorithm with Simplification}
The non-trivial part of proving the correctness of the algorithm with simplification
compared with not having simplification is that we can no longer use the argument 
in \cref{flex_retrieve}.
The function \retrieve needs the structure of the annotated regular expression to 
agree with the structure of the value, but simplification will always mess with the 
structure:
%TODO: after simp does not agree with each other: (a + 0) --> a v.s. Left(Char(a))
author	Chengsong
	Thu, 09 Jun 2022 22:07:44 +0100
changeset 539	7cf9f17aa179
parent 538	8016a2480704
child 543	b2bea5968b89
permissions	-rwxr-xr-x