| 532 |      1 | % Chapter Template
 | 
|  |      2 | 
 | 
|  |      3 | % Main chapter title
 | 
|  |      4 | \chapter{Correctness of Bit-coded Algorithm with Simplification}
 | 
|  |      5 | 
 | 
|  |      6 | \label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
 | 
|  |      7 | %Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
 | 
|  |      8 | %simplifications and therefore introduce our version of the bitcoded algorithm and 
 | 
|  |      9 | %its correctness proof in 
 | 
|  |     10 | %Chapter 3\ref{Chapter3}. 
 | 
|  |     11 | 
 | 
|  |     12 | 
 | 
|  |     13 | 
 | 
| 538 |     14 | Now we introduce the simplifications, which is why we introduce the 
 | 
|  |     15 | bitcodes in the first place.
 | 
|  |     16 | 
 | 
|  |     17 | \subsection*{Simplification Rules}
 | 
|  |     18 | 
 | 
|  |     19 | This section introduces aggressive (in terms of size) simplification rules
 | 
|  |     20 | on annotated regular expressions
 | 
|  |     21 | to keep derivatives small. Such simplifications are promising
 | 
|  |     22 | as we have
 | 
|  |     23 | generated test data that show
 | 
|  |     24 | that a good tight bound can be achieved. We could only
 | 
|  |     25 | partially cover the search space as there are infinitely many regular
 | 
|  |     26 | expressions and strings. 
 | 
|  |     27 | 
 | 
|  |     28 | One modification we introduced is to allow a list of annotated regular
 | 
|  |     29 | expressions in the $\sum$ constructor. This allows us to not just
 | 
|  |     30 | delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
 | 
|  |     31 | also unnecessary ``copies'' of regular expressions (very similar to
 | 
|  |     32 | simplifying $r + r$ to just $r$, but in a more general setting). Another
 | 
|  |     33 | modification is that we use simplification rules inspired by Antimirov's
 | 
|  |     34 | work on partial derivatives. They maintain the idea that only the first
 | 
|  |     35 | ``copy'' of a regular expression in an alternative contributes to the
 | 
|  |     36 | calculation of a POSIX value. All subsequent copies can be pruned away from
 | 
|  |     37 | the regular expression. A recursive definition of our  simplification function 
 | 
|  |     38 | that looks somewhat similar to our Scala code is given below:
 | 
|  |     39 | %\comment{Use $\ZERO$, $\ONE$ and so on. 
 | 
|  |     40 | %Is it $ALTS$ or $ALTS$?}\\
 | 
|  |     41 | 
 | 
|  |     42 | \begin{center}
 | 
|  |     43 |   \begin{tabular}{@{}lcl@{}}
 | 
|  |     44 |    
 | 
|  |     45 |   $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
 | 
|  |     46 |    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
 | 
|  |     47 |    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
 | 
|  |     48 |    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
 | 
|  |     49 |    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
 | 
|  |     50 |    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
 | 
|  |     51 | 
 | 
|  |     52 |   $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\
 | 
|  |     53 |   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
 | 
|  |     54 |    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
 | 
|  |     55 |    &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
 | 
|  |     56 | 
 | 
|  |     57 |    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
 | 
|  |     58 | \end{tabular}    
 | 
|  |     59 | \end{center}    
 | 
|  |     60 | 
 | 
|  |     61 | \noindent
 | 
|  |     62 | The simplification does a pattern matching on the regular expression.
 | 
|  |     63 | When it detected that the regular expression is an alternative or
 | 
|  |     64 | sequence, it will try to simplify its child regular expressions
 | 
|  |     65 | recursively and then see if one of the children turns into $\ZERO$ or
 | 
|  |     66 | $\ONE$, which might trigger further simplification at the current level.
 | 
|  |     67 | The most involved part is the $\sum$ clause, where we use two
 | 
|  |     68 | auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
 | 
|  |     69 | alternatives and reduce as many duplicates as possible. Function
 | 
|  |     70 | $\textit{distinct}$  keeps the first occurring copy only and removes all later ones
 | 
|  |     71 | when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
 | 
|  |     72 | Its recursive definition is given below:
 | 
|  |     73 | 
 | 
|  |     74 |  \begin{center}
 | 
|  |     75 |   \begin{tabular}{@{}lcl@{}}
 | 
|  |     76 |   $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
 | 
|  |     77 |      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
 | 
|  |     78 |   $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
 | 
|  |     79 |     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
 | 
|  |     80 | \end{tabular}    
 | 
|  |     81 | \end{center}  
 | 
|  |     82 | 
 | 
|  |     83 | \noindent
 | 
|  |     84 | Here $\textit{flatten}$ behaves like the traditional functional programming flatten
 | 
|  |     85 | function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
 | 
|  |     86 | removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
 | 
|  |     87 | 
 | 
|  |     88 | Having defined the $\simp$ function,
 | 
|  |     89 | we can use the previous notation of  natural
 | 
|  |     90 | extension from derivative w.r.t.~character to derivative
 | 
|  |     91 | w.r.t.~string:%\comment{simp in  the [] case?}
 | 
|  |     92 | 
 | 
|  |     93 | \begin{center}
 | 
|  |     94 | \begin{tabular}{lcl}
 | 
|  |     95 | $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
 | 
|  |     96 | $r \backslash_{simp} [\,] $ & $\dn$ & $r$
 | 
|  |     97 | \end{tabular}
 | 
|  |     98 | \end{center}
 | 
|  |     99 | 
 | 
|  |    100 | \noindent
 | 
|  |    101 | to obtain an optimised version of the algorithm:
 | 
|  |    102 | 
 | 
|  |    103 |  \begin{center}
 | 
|  |    104 | \begin{tabular}{lcl}
 | 
|  |    105 |   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
 | 
|  |    106 |       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
 | 
|  |    107 |   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
 | 
|  |    108 |   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
 | 
|  |    109 |   & & $\;\;\textit{else}\;\textit{None}$
 | 
|  |    110 | \end{tabular}
 | 
|  |    111 | \end{center}
 | 
|  |    112 | 
 | 
|  |    113 | \noindent
 | 
|  |    114 | This algorithm keeps the regular expression size small, for example,
 | 
|  |    115 | with this simplification our previous $(a + aa)^*$ example's 8000 nodes
 | 
|  |    116 | will be reduced to just 17 and stays constant, no matter how long the
 | 
|  |    117 | input string is.
 | 
|  |    118 | 
 | 
|  |    119 | 
 | 
|  |    120 | 
 | 
| 539 |    121 | \section{ $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
 | 
|  |    122 | We 
 | 
|  |    123 | Unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current
 | 
|  |    124 | simplification rules (we put the graphs together to show the contrast)
 | 
|  |    125 | \begin{tabular}{ll}
 | 
|  |    126 | \begin{tikzpicture}
 | 
|  |    127 | \begin{axis}[
 | 
|  |    128 |     xlabel={$n$},
 | 
|  |    129 |     ylabel={derivative size},
 | 
|  |    130 |         width=7cm,
 | 
|  |    131 |     height=4cm, 
 | 
|  |    132 |     legend entries={Lexer with $\bsimp$},  
 | 
|  |    133 |     legend pos=  south east,
 | 
|  |    134 |     legend cell align=left]
 | 
|  |    135 | \addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};
 | 
|  |    136 | \end{axis}
 | 
|  |    137 | \end{tikzpicture} %\label{fig:BitcodedLexer}
 | 
|  |    138 | &
 | 
|  |    139 | \begin{tikzpicture}
 | 
|  |    140 | \begin{axis}[
 | 
|  |    141 |     xlabel={$n$},
 | 
|  |    142 |     ylabel={derivative size},
 | 
|  |    143 |     width = 7cm,
 | 
|  |    144 |     height = 4cm,
 | 
|  |    145 |     legend entries={Lexer without $\bsimp$},  
 | 
|  |    146 |     legend pos=  north west,
 | 
|  |    147 |     legend cell align=left]
 | 
|  |    148 | \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
 | 
|  |    149 | \end{axis}
 | 
|  |    150 | \end{tikzpicture} 
 | 
|  |    151 | \end{tabular}
 | 
| 538 |    152 | 
 | 
|  |    153 | 
 | 
|  |    154 | 
 | 
|  |    155 | 
 | 
| 532 |    156 | 
 | 
|  |    157 | %----------------------------------------------------------------------------------------
 | 
|  |    158 | %	SECTION common identities
 | 
|  |    159 | %----------------------------------------------------------------------------------------
 | 
|  |    160 | \section{Common Identities In Simplification-Related Functions} 
 | 
|  |    161 | Need to prove these before starting on the big correctness proof.
 | 
|  |    162 | 
 | 
|  |    163 | %-----------------------------------
 | 
|  |    164 | %	SUBSECTION 
 | 
|  |    165 | %-----------------------------------
 | 
|  |    166 | \subsection{Idempotency of $\simp$}
 | 
|  |    167 | 
 | 
|  |    168 | \begin{equation}
 | 
|  |    169 | 	\simp \;r = \simp\; \simp \; r 
 | 
|  |    170 | \end{equation}
 | 
|  |    171 | This property means we do not have to repeatedly
 | 
|  |    172 | apply simplification in each step, which justifies
 | 
|  |    173 | our definition of $\blexersimp$.
 | 
|  |    174 | It will also be useful in future proofs where properties such as 
 | 
|  |    175 | closed forms are needed.
 | 
|  |    176 | The proof is by structural induction on $r$.
 | 
|  |    177 | 
 | 
|  |    178 | %-----------------------------------
 | 
|  |    179 | %	SUBSECTION 
 | 
|  |    180 | %-----------------------------------
 | 
|  |    181 | \subsection{Syntactic Equivalence Under $\simp$}
 | 
|  |    182 | We prove that minor differences can be annhilated
 | 
|  |    183 | by $\simp$.
 | 
|  |    184 | For example,
 | 
|  |    185 | \begin{center}
 | 
|  |    186 | $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
 | 
|  |    187 |  \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
 | 
|  |    188 | \end{center}
 | 
|  |    189 | 
 | 
|  |    190 | 
 | 
|  |    191 | 
 | 
|  |    192 | 
 | 
|  |    193 | 
 | 
|  |    194 | 
 | 
|  |    195 | 
 | 
|  |    196 | %----------------------------------------------------------------------------------------
 | 
|  |    197 | %	SECTION corretness proof
 | 
|  |    198 | %----------------------------------------------------------------------------------------
 | 
|  |    199 | \section{Proof Technique of Correctness of Bit-coded Algorithm with Simplification}
 | 
|  |    200 | The non-trivial part of proving the correctness of the algorithm with simplification
 | 
|  |    201 | compared with not having simplification is that we can no longer use the argument 
 | 
|  |    202 | in \cref{flex_retrieve}.
 | 
|  |    203 | The function \retrieve needs the structure of the annotated regular expression to 
 | 
|  |    204 | agree with the structure of the value, but simplification will always mess with the 
 | 
|  |    205 | structure:
 | 
|  |    206 | %TODO: after simp does not agree with each other: (a + 0) --> a v.s. Left(Char(a)) |