ChengsongTanPhdThesis/Chapters/Bitcoded1.tex
changeset 538 8016a2480704
parent 537 50e590823220
child 542 a7344c9afbaf
--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Mon Jun 06 23:17:45 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Thu Jun 09 12:57:53 2022 +0100
@@ -1,17 +1,17 @@
 % Chapter Template
 
 % Main chapter title
-\chapter{Correctness of Bit-coded Algorithm without Simplification}
+\chapter{Bit-coded Algorithm of Sulzmann and Lu}
 
 \label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
 %simplifications and therefore introduce our version of the bitcoded algorithm and 
 %its correctness proof in 
 %Chapter 3\ref{Chapter3}. 
-
+In this chapter, we are going to introduce the bit-coded algorithm
+introduced by Sulzmann and Lu to address the problem of 
+under-simplified regular expressions. 
 \section{Bit-coded Algorithm}
-
-
 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
 stores information of previous lexing steps
 on a stack, in the form of regular expressions
@@ -135,8 +135,9 @@
 \end{envForCaption}
 
 \noindent
-The $S$ and $Z$ are not in bold in order to avoid 
-confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
+Using $S$ and $Z$ rather than $1$ and $0$ is to avoid
+confusion with the regular expressions $\ZERO$ and $\ONE$.
+Bitcodes (or
 bit-lists) can be used to encode values (or potentially incomplete values) in a
 compact form. This can be straightforwardly seen in the following
 coding function from values to bitcodes: 
@@ -203,15 +204,31 @@
   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
        \textit{else}\;\textit{None}$                       
 \end{tabular}    
-\end{center}
-\caption{Bit-decoding of values}    
+\end{center} 
 \end{envForCaption}
 %\end{definition}
 
-Sulzmann and Lu's integrated the bitcodes into regular expressions to
-create annotated regular expressions \cite{Sulzmann2014}.
-\emph{Annotated regular expressions} are defined by the following
-grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
+\noindent
+$\decode'$ does most of the job while $\decode$ throws
+away leftover bit-codes and returns the value only.
+$\decode$ is terminating as $\decode'$ is terminating.
+We have the property that $\decode$ and $\code$ are
+reverse operations of one another:
+\begin{lemma}
+\[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \]
+\end{lemma}
+\begin{proof}
+By proving a more general version of the lemma, on $\decode'$:
+\[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
+Then setting $ds$ to be $[]$ and unfolding $\decode$ definition
+we get the lemma.
+\end{proof}
+With the $\code$ and $\decode$ functions in hand, we know how to 
+switch between bit-codes and value--the two different representations of 
+lexing information. 
+The next step is to integrate this information into the working regular expression.
+Attaching bits to the front of regular expressions is the solution Sulzamann and Lu
+gave for storing partial values on the fly:
 
 \begin{center}
 \begin{tabular}{lcl}
@@ -226,13 +243,130 @@
 %(in \textit{ALTS})
 
 \noindent
-where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
+We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}.
+$bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
 expressions and $as$ for a list of annotated regular expressions.
-The alternative constructor($\sum$) has been generalized to 
+The alternative constructor ($\sum$) has been generalised to 
 accept a list of annotated regular expressions rather than just 2.
-We will show that these bitcodes encode information about
-the (POSIX) value that should be generated by the Sulzmann and Lu
-algorithm.
+%We will show that these bitcodes encode information about
+%the ($\POSIX$) value that should be generated by the Sulzmann and Lu
+%algorithm.
+The most central question is how these partial lexing information
+represented as bit-codes is augmented and carried around 
+during a derivative is taken.
+
+This is done by adding bitcodes to the 
+derivatives, for example when one more star iteratoin is taken (we
+call the operation of derivatives on annotated regular expressions $\bder$
+because it is derivatives on regexes with bitcodes):
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $\bder \; c\; (_{bs}a^*) $ & $\dn$ &
+      $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot
+       (_{[]}a^*))$
+\end{tabular}    
+\end{center}    
+
+\noindent
+For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when
+there is no danger of confusion with derivatives on plain regular expressions, 
+for example, the above can be expressed as
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
+      $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot
+       (_{[]}a^*))$
+\end{tabular}    
+\end{center}   
+
+
+Using the picture we used earlier to depict this, the transformation when 
+taking a derivative w.r.t a star is like below:
+\centering
+\begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        {$bs$
+         \nodepart{two} $a^*$ };
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture} 
+&
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        {$v_{\text{previous iterations}}$
+         \nodepart{two} $a^*$};
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\\
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        { $bs$ + [Z]
+         \nodepart{two}  $(a\backslash c )\cdot a^*$ };
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+&
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        {$v_{\text{previous iterations}}$ + 1 more iteration
+         \nodepart{two} $(a\backslash c )\cdot a^*$ };
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\end{tabular}    
+\noindent
+The operation $\fuse$ is just to attach bit-codes 
+to the front of an annotated regular expression:
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
+  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
+     $_{bs @ bs'}\ONE$\\
+  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
+     $_{bs@bs'}{\bf c}$\\
+  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
+     $_{bs@bs'}\sum\textit{as}$\\
+  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
+     $_{bs@bs'}a_1 \cdot a_2$\\
+  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
+     $_{bs @ bs'}a^*$
+\end{tabular}    
+\end{center} 
+
+\noindent
+Another place in the $\bder$ function where it differs
+from normal derivatives on un-annotated regular expressions
+is the sequence case:
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+
+  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
+					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
+  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
+\end{tabular}    
+\end{center}    
+Here 
+
+
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
+  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
+  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
+        $\textit{if}\;c=d\; \;\textit{then}\;
+         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
+  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
+  $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\
+  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
+					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
+  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
+  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
+      $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
+       (_{[]}r^*))$
+\end{tabular}    
+\end{center}    
 
 
 To do lexing using annotated regular expressions, we shall first
@@ -264,21 +398,7 @@
 attach bits to the front of an annotated regular expression. Its
 definition is as follows:
 
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
-  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
-     $_{bs @ bs'}\ONE$\\
-  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
-     $_{bs@bs'}{\bf c}$\\
-  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
-     $_{bs@bs'}\sum\textit{as}$\\
-  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
-     $_{bs@bs'}a_1 \cdot a_2$\\
-  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
-     $_{bs @ bs'}a^*$
-\end{tabular}    
-\end{center}  
+ 
 
 \noindent
 After internalising the regular expression, we perform successive
@@ -288,70 +408,8 @@
 the bitcodes:
 
 
-\iffalse
- %\begin{definition}{bder}
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
-  $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
-  $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
-  $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
-					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
-  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
-  $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
-      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
-       (\textit{STAR}\,[]\,r)$
-\end{tabular}    
-\end{center}    
-%\end{definition}
 
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
-  $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
-  $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
-  $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
-					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
-  & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
-  $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
-      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [Z] \; r\,\backslash c )\,
-       (_{bs}\textit{STAR}\,[]\,r)$
-\end{tabular}    
-\end{center}    
-%\end{definition}
-\fi
 
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
-  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
-  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
-  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
-  $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\
-  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
-					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
-  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
-  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
-      $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
-       (_{[]}r^*))$
-\end{tabular}    
-\end{center}    
 
 %\end{definition}
 \noindent
@@ -417,119 +475,6 @@
 operation from characters to strings (just like the derivatives for un-annotated
 regular expressions).
 
-Now we introduce the simplifications, which is why we introduce the 
-bitcodes in the first place.
-
-\subsection*{Simplification Rules}
-
-This section introduces aggressive (in terms of size) simplification rules
-on annotated regular expressions
-to keep derivatives small. Such simplifications are promising
-as we have
-generated test data that show
-that a good tight bound can be achieved. We could only
-partially cover the search space as there are infinitely many regular
-expressions and strings. 
-
-One modification we introduced is to allow a list of annotated regular
-expressions in the $\sum$ constructor. This allows us to not just
-delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
-also unnecessary ``copies'' of regular expressions (very similar to
-simplifying $r + r$ to just $r$, but in a more general setting). Another
-modification is that we use simplification rules inspired by Antimirov's
-work on partial derivatives. They maintain the idea that only the first
-``copy'' of a regular expression in an alternative contributes to the
-calculation of a POSIX value. All subsequent copies can be pruned away from
-the regular expression. A recursive definition of our  simplification function 
-that looks somewhat similar to our Scala code is given below:
-%\comment{Use $\ZERO$, $\ONE$ and so on. 
-%Is it $ALTS$ or $ALTS$?}\\
-
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-   
-  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
-   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
-   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
-   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
-
-  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\
-  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
-   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
-
-   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
-\end{tabular}    
-\end{center}    
-
-\noindent
-The simplification does a pattern matching on the regular expression.
-When it detected that the regular expression is an alternative or
-sequence, it will try to simplify its child regular expressions
-recursively and then see if one of the children turns into $\ZERO$ or
-$\ONE$, which might trigger further simplification at the current level.
-The most involved part is the $\sum$ clause, where we use two
-auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
-alternatives and reduce as many duplicates as possible. Function
-$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
-when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
-Its recursive definition is given below:
-
- \begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
-     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
-  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
-    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
-\end{tabular}    
-\end{center}  
-
-\noindent
-Here $\textit{flatten}$ behaves like the traditional functional programming flatten
-function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
-removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
-
-Having defined the $\simp$ function,
-we can use the previous notation of  natural
-extension from derivative w.r.t.~character to derivative
-w.r.t.~string:%\comment{simp in  the [] case?}
-
-\begin{center}
-\begin{tabular}{lcl}
-$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
-$r \backslash_{simp} [\,] $ & $\dn$ & $r$
-\end{tabular}
-\end{center}
-
-\noindent
-to obtain an optimised version of the algorithm:
-
- \begin{center}
-\begin{tabular}{lcl}
-  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
-      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
-  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
-  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
-  & & $\;\;\textit{else}\;\textit{None}$
-\end{tabular}
-\end{center}
-
-\noindent
-This algorithm keeps the regular expression size small, for example,
-with this simplification our previous $(a + aa)^*$ example's 8000 nodes
-will be reduced to just 17 and stays constant, no matter how long the
-input string is.
-
-
-
-
-
-
-
-
-
 
 
 %-----------------------------------