ChengsongTanPhdThesis/Chapters/Bitcoded1.tex
changeset 537 50e590823220
parent 536 aff7bf93b9c7
child 538 8016a2480704
--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Mon Jun 06 16:45:42 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Mon Jun 06 23:17:45 2022 +0100
@@ -9,67 +9,191 @@
 %its correctness proof in 
 %Chapter 3\ref{Chapter3}. 
 
-\section*{Bit-coded Algorithm}
-Bits and bitcodes (lists of bits) are defined as:
+\section{Bit-coded Algorithm}
+
+
+The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
+stores information of previous lexing steps
+on a stack, in the form of regular expressions
+and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
+\begin{envForCaption}
+\begin{ceqn}
+\begin{equation}%\label{graph:injLexer}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
+v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure}
+\end{envForCaption}
+\noindent
+This is both inefficient and prone to stack overflow.
+A natural question arises as to whether we can store lexing
+information on the fly, while still using regular expression 
+derivatives?
+
+In a lexing algorithm's run, split by the current input position,
+we have a sub-string that has been consumed,
+and the sub-string that has yet to come.
+We already know what was before, and this should be reflected in the value 
+and the regular expression at that step as well. But this is not the 
+case for injection-based regular expression derivatives.
+Take the regex $(aa)^* \cdot bc$ matching the string $aabc$
+as an example, if we have just read the two former characters $aa$:
 
 \begin{center}
-		$b ::=   1 \mid  0 \qquad
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        {Consumed: $aa$
+         \nodepart{two} Not Yet Reached: $bc$ };
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\caption{Partially matched String} 
+\end{envForCaption}
+\end{center}
+%\caption{Input String}\label{StringPartial}
+%\end{figure}
+
+\noindent
+We have the value that has already been partially calculated,
+and the part that has yet to come:
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        {$\Seq(\Stars[\Char(a), \Char(a)], ???)$
+         \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\caption{Partially constructed Value} 
+\end{envForCaption}
+\end{center}
+
+In the regex derivative part , (after simplification)
+all we have is just what is about to come:
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
+        {$???$
+         \nodepart{two} To Come: $b c$};
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\caption{Derivative} 
+\end{envForCaption}
+\end{center}
+\noindent
+The previous part is missing.
+How about keeping the partially constructed value 
+attached to the front of the regular expression?
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
+         \nodepart{two} To Come: $b c$};
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\caption{Derivative} 
+\end{envForCaption}
+\end{center}
+\noindent
+If we do this kind of "attachment"
+and each time augment the attached partially
+constructed value when taking off a 
+character:
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
+         \nodepart{two} To Come: $c$};
+\end{tikzpicture}\\
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
+         \nodepart{two} EOF};
+\end{tikzpicture}
+\caption{After $\backslash b$ and $\backslash c$} 
+\end{envForCaption}
+\end{center}
+\noindent
+In the end we could recover the value without a backward phase.
+But (partial) values are a bit clumsy to stick together with a regular expression, so 
+we instead use bit-codes to encode them.
+
+Bits and bitcodes (lists of bits) are defined as:
+\begin{envForCaption}
+\begin{center}
+		$b ::=   S \mid  Z \qquad
 bs ::= [] \mid b::bs    
 $
 \end{center}
+\caption{Bit-codes datatype}
+\end{envForCaption}
 
 \noindent
-The $1$ and $0$ are not in bold in order to avoid 
+The $S$ and $Z$ are not in bold in order to avoid 
 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
 bit-lists) can be used to encode values (or potentially incomplete values) in a
 compact form. This can be straightforwardly seen in the following
 coding function from values to bitcodes: 
-
+\begin{envForCaption}
 \begin{center}
 \begin{tabular}{lcl}
   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
-  $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
-  $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
+  $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
+  $\textit{code}(\Right\,v)$ & $\dn$ & $S :: code(v)$\\
   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
-  $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
-  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
+  $\textit{code}(\Stars\,[])$ & $\dn$ & $[Z]$\\
+  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\;
                                                  code(\Stars\,vs)$
 \end{tabular}    
 \end{center} 
+\caption{Coding Function for Values}
+\end{envForCaption}
 
 \noindent
-Here $\textit{code}$ encodes a value into a bitcodes by converting
-$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
-star iteration by $1$. The border where a local star terminates
-is marked by $0$. This coding is lossy, as it throws away the information about
+Here $\textit{code}$ encodes a value into a bit-code by converting
+$\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
+star iteration by $S$. The border where a local star terminates
+is marked by $Z$. 
+This coding is lossy, as it throws away the information about
 characters, and also does not encode the ``boundary'' between two
 sequence values. Moreover, with only the bitcode we cannot even tell
-whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
+whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The
 reason for choosing this compact way of storing information is that the
 relatively small size of bits can be easily manipulated and ``moved
-around'' in a regular expression. In order to recover values, we will 
-need the corresponding regular expression as an extra information. This
-means the decoding function is defined as:
+around'' in a regular expression. 
+
+
+We define the reverse operation of $\code$, which is $\decode$.
+As expected, $\decode$ not only requires the bit-codes,
+but also a regular expression to guide the decoding and 
+fill the gaps of characters:
 
 
 %\begin{definition}[Bitdecoding of Values]\mbox{}
+\begin{envForCaption}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
-  $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+  $\textit{decode}'\,(Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
        (\Left\,v, bs_1)$\\
-  $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+  $\textit{decode}'\,(S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
        (\Right\,v, bs_1)$\\                           
   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
   & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
-  $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
-  $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
+  $\textit{decode}'\,(Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+  $\textit{decode}'\,(S\!::\!bs)\,(r^*)$ & $\dn$ & 
          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
   & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
@@ -79,7 +203,9 @@
   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
        \textit{else}\;\textit{None}$                       
 \end{tabular}    
-\end{center}    
+\end{center}
+\caption{Bit-decoding of values}    
+\end{envForCaption}
 %\end{definition}
 
 Sulzmann and Lu's integrated the bitcodes into regular expressions to
@@ -121,8 +247,8 @@
   $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
   $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
   $(r_1 + r_2)^\uparrow$ & $\dn$ &
-  $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
-  \textit{fuse}\,[1]\,r_2^\uparrow]$\\
+  $_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\,
+  \textit{fuse}\,[S]\,r_2^\uparrow]$\\
   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
          $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
   $(r^*)^\uparrow$ & $\dn$ &
@@ -200,7 +326,7 @@
 					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
   & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
   $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
-      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
+      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [Z] \; r\,\backslash c )\,
        (_{bs}\textit{STAR}\,[]\,r)$
 \end{tabular}    
 \end{center}    
@@ -222,7 +348,7 @@
 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
-      $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
+      $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
        (_{[]}r^*))$
 \end{tabular}    
 \end{center}    
@@ -231,7 +357,7 @@
 \noindent
 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
 we need to unfold it into a sequence,
-and attach an additional bit $0$ to the front of $r \backslash c$
+and attach an additional bit $Z$ to the front of $r \backslash c$
 to indicate one more star iteration. Also the sequence clause
 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
@@ -263,7 +389,7 @@
   $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
   $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
-     $bs \,@\, [0]$
+     $bs \,@\, [Z]$
 \end{tabular}    
 \end{center}    
 %\end{definition}