ChengsongTanPhdThesis/Chapters/Bitcoded1.tex
changeset 564 3cbcd7cda0a9
parent 543 b2bea5968b89
child 575 3178f0e948ac
--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Tue Jul 05 00:42:06 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Wed Jul 13 08:27:28 2022 +0100
@@ -8,96 +8,105 @@
 %simplifications and therefore introduce our version of the bitcoded algorithm and 
 %its correctness proof in 
 %Chapter 3\ref{Chapter3}. 
-In this chapter, we are going to introduce the bit-coded algorithm
-introduced by Sulzmann and Lu to address the problem of 
-under-simplified regular expressions. 
+In this chapter, we are going to describe the bit-coded algorithm
+introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of 
+regular expressions. 
 \section{Bit-coded Algorithm}
 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
 stores information of previous lexing steps
 on a stack, in the form of regular expressions
 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
-\begin{envForCaption}
 \begin{ceqn}
 \begin{equation}%\label{graph:injLexer}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
-v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
+	\begin{tikzcd}[ampersand replacement=\&, execute at end picture={
+			\begin{scope}[on background layer]
+				\node[rectangle, fill={red!30},
+					pattern=north east lines, pattern color=red,
+					fit={(-3,-1) (-3, 1) (1, -1) 
+						(1, 1)}
+				     ] 
+				     {}; ,
+				\node[rectangle, fill={blue!20},
+					pattern=north east lines, pattern color=blue,
+					fit= {(1, -1) (1, 1) (3, -1) (3, 1)}
+					]
+					{};
+				\end{scope}}
+					]
+r_0 \arrow[r, "\backslash c_0"]  \arrow[d] \& r_1 \arrow[r, "\backslash c_1"] \arrow[d] \& r_2 \arrow[r, dashed] \arrow[d] \& r_n \arrow[d, "mkeps" description] \\
+v_0           \& v_1 \arrow[l,"inj_{r_0} c_0"]                \& v_2 \arrow[l, "inj_{r_1} c_1"]              \& v_n \arrow[l, dashed]         \\
 \end{tikzcd}
 \end{equation}
 \end{ceqn}
-\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure}
-\end{envForCaption}
 \noindent
+The red part represents what we already know during the first
+derivative phase,
+and the blue part represents the unknown part of input.
+The red area expands as we move towards $r_n$, 
+indicating an increasing stack size during lexing.
+Despite having some partial lexing information during
+the forward derivative phase, we choose to store them
+temporarily, only to convert the information to lexical
+values at a later stage. In essence we are repeating work we
+have already done.
 This is both inefficient and prone to stack overflow.
 A natural question arises as to whether we can store lexing
 information on the fly, while still using regular expression 
-derivatives?
+derivatives.
 
-In a lexing algorithm's run, split by the current input position,
-we have a sub-string that has been consumed,
-and the sub-string that has yet to come.
-We already know what was before, and this should be reflected in the value 
-and the regular expression at that step as well. But this is not the 
-case for injection-based regular expression derivatives.
-Take the regex $(aa)^* \cdot bc$ matching the string $aabc$
-as an example, if we have just read the two former characters $aa$:
-
+If we remove the details of the individual 
+lexing steps, and use red and blue areas as before
+to indicate consumed (seen) input and constructed
+partial value (before recovering the rest of the stack),
+one could see that the seen part's lexical information
+is stored in the form of a regular expression.
+Consider the regular expression $(aa)^* \cdot bc$ matching the string $aabc$
+and assume we have just read the two characters $aa$:
 \begin{center}
-\begin{envForCaption}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-        {Consumed: $aa$
-         \nodepart{two} Not Yet Reached: $bc$ };
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\caption{Partially matched String} 
-\end{envForCaption}
-\end{center}
-%\caption{Input String}\label{StringPartial}
-%\end{figure}
-
-\noindent
-We have the value that has already been partially calculated,
-and the part that has yet to come:
-\begin{center}
-\begin{envForCaption}
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-        {$\Seq(\Stars[\Char(a), \Char(a)], ???)$
+	    {Partial lexing info: $\ONE \cdot a \cdot (aa)^* \cdot bc$ etc.
          \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
-%\caption{term 1 \ref{term:1}'s matching configuration}
 \end{tikzpicture}
-\caption{Partially constructed Value} 
-\end{envForCaption}
 \end{center}
-
-In the regex derivative part , (after simplification)
-all we have is just what is about to come:
+\noindent
+In the injection-based lexing algorithm, we ``neglect" the red area
+by putting all the characters we have consumed and
+intermediate regular expressions on the stack when 
+we go from left to right in the derivative phase.
+The red area grows till the string is exhausted.
+During the injection phase, the value in the blue area
+is built up incrementally, while the red area shrinks.
+Before we have recovered all characters and intermediate
+derivative regular expressions from the stack,
+what values these characters and regular expressions correspond 
+to are unknown: 
 \begin{center}
-\begin{envForCaption}
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
-        {$???$
-         \nodepart{two} To Come: $b c$};
+	    {$(\ONE \cdot \ONE) \cdot (aa)^* \cdot bc $ correspond to:$???$
+         \nodepart{two}  $b c$ corresponds to  $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
 %\caption{term 1 \ref{term:1}'s matching configuration}
 \end{tikzpicture}
-\caption{Derivative} 
-\end{envForCaption}
 \end{center}
 \noindent
-The previous part is missing.
-How about keeping the partially constructed value 
-attached to the front of the regular expression?
+However, they should be calculable,
+as characters and regular expression shapes
+after taking derivative w.r.t those characters
+have already been known, therefore in our example,
+we know that the value starts with two $a$s,
+and makes up to an iteration in a Kleene star:
+(We have put the injection-based lexing's partial 
+result in the right part of the split rectangle
+to contrast it with the partial valued produced
+in a forward manner)
 \begin{center}
-\begin{envForCaption}
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-        {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
-         \nodepart{two} To Come: $b c$};
+	    {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$
+	\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$  $\stackrel{Inj}{\longleftarrow}$};
 %\caption{term 1 \ref{term:1}'s matching configuration}
 \end{tikzpicture}
-\caption{Derivative} 
-\end{envForCaption}
 \end{center}
 \noindent
 If we do this kind of "attachment"
@@ -105,19 +114,23 @@
 constructed value when taking off a 
 character:
 \begin{center}
-\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint)
+        {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
+         \nodepart{two} Remaining: $b c$};
+\end{tikzpicture}\\
+$\downarrow$\\
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
         {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
-         \nodepart{two} To Come: $c$};
+         \nodepart{two} Remaining: $c$};
 \end{tikzpicture}\\
+$\downarrow$\\
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
         {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
          \nodepart{two} EOF};
 \end{tikzpicture}
-\caption{After $\backslash b$ and $\backslash c$} 
-\end{envForCaption}
 \end{center}
 \noindent
 In the end we could recover the value without a backward phase.
@@ -125,14 +138,11 @@
 we instead use bit-codes to encode them.
 
 Bits and bitcodes (lists of bits) are defined as:
-\begin{envForCaption}
 \begin{center}
 		$b ::=   S \mid  Z \qquad
 bs ::= [] \mid b::bs    
 $
 \end{center}
-\caption{Bit-codes datatype}
-\end{envForCaption}
 
 \noindent
 Using $S$ and $Z$ rather than $1$ and $0$ is to avoid
@@ -141,7 +151,6 @@
 bit-lists) can be used to encode values (or potentially incomplete values) in a
 compact form. This can be straightforwardly seen in the following
 coding function from values to bitcodes: 
-\begin{envForCaption}
 \begin{center}
 \begin{tabular}{lcl}
   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
@@ -154,9 +163,6 @@
                                                  code(\Stars\,vs)$
 \end{tabular}    
 \end{center} 
-\caption{Coding Function for Values}
-\end{envForCaption}
-
 \noindent
 Here $\textit{code}$ encodes a value into a bit-code by converting
 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
@@ -168,9 +174,10 @@
 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The
 reason for choosing this compact way of storing information is that the
 relatively small size of bits can be easily manipulated and ``moved
-around'' in a regular expression. 
+around" in a regular expression. 
 
-
+Because of the lossiness, the process of decoding a bitlist requires additionally 
+a regular expression. The function $\decode$ is defined as:
 We define the reverse operation of $\code$, which is $\decode$.
 As expected, $\decode$ not only requires the bit-codes,
 but also a regular expression to guide the decoding and 
@@ -178,7 +185,6 @@
 
 
 %\begin{definition}[Bitdecoding of Values]\mbox{}
-\begin{envForCaption}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
@@ -205,10 +211,10 @@
        \textit{else}\;\textit{None}$                       
 \end{tabular}    
 \end{center} 
-\end{envForCaption}
 %\end{definition}
 
 \noindent
+The function $\decode'$ returns a pair consisting of a partially decoded value and some leftover:
 $\decode'$ does most of the job while $\decode$ throws
 away leftover bit-codes and returns the value only.
 $\decode$ is terminating as $\decode'$ is terminating.
@@ -251,7 +257,7 @@
 
 
 The first thing we define related to bit-coded regular expressions
-is how we move bits, for instance pasting it at the front of an annotated regex.
+is how we move bits, for instance pasting it at the front of an annotated regular expression.
 The operation $\fuse$ is just to attach bit-codes 
 to the front of an annotated regular expression:
 \begin{center}
@@ -364,7 +370,7 @@
 This is done by adding bitcodes to the 
 derivatives, for example when one more star iteratoin is taken (we
 call the operation of derivatives on annotated regular expressions $\bder$
-because it is derivatives on regexes with \emph{b}itcodes),
+because it is derivatives on regular expressiones with \emph{b}itcodes),
 we need to unfold it into a sequence,
 and attach an additional bit $Z$ to the front of $r \backslash c$
 to indicate one more star iteration. 
@@ -652,7 +658,7 @@
 $\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
 \end{lemma}
 \noindent
-$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regexes of $\blexer$.
+$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regular expressiones of $\blexer$.
 For plain regular expressions something similar is required as well.
 
 \subsection{$\flex$}