--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Tue Jul 05 00:42:06 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Wed Jul 13 08:27:28 2022 +0100
@@ -8,96 +8,105 @@
%simplifications and therefore introduce our version of the bitcoded algorithm and
%its correctness proof in
%Chapter 3\ref{Chapter3}.
-In this chapter, we are going to introduce the bit-coded algorithm
-introduced by Sulzmann and Lu to address the problem of
-under-simplified regular expressions.
+In this chapter, we are going to describe the bit-coded algorithm
+introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of
+regular expressions.
\section{Bit-coded Algorithm}
The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
stores information of previous lexing steps
on a stack, in the form of regular expressions
and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
-\begin{envForCaption}
\begin{ceqn}
\begin{equation}%\label{graph:injLexer}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
-v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
+ \begin{tikzcd}[ampersand replacement=\&, execute at end picture={
+ \begin{scope}[on background layer]
+ \node[rectangle, fill={red!30},
+ pattern=north east lines, pattern color=red,
+ fit={(-3,-1) (-3, 1) (1, -1)
+ (1, 1)}
+ ]
+ {}; ,
+ \node[rectangle, fill={blue!20},
+ pattern=north east lines, pattern color=blue,
+ fit= {(1, -1) (1, 1) (3, -1) (3, 1)}
+ ]
+ {};
+ \end{scope}}
+ ]
+r_0 \arrow[r, "\backslash c_0"] \arrow[d] \& r_1 \arrow[r, "\backslash c_1"] \arrow[d] \& r_2 \arrow[r, dashed] \arrow[d] \& r_n \arrow[d, "mkeps" description] \\
+v_0 \& v_1 \arrow[l,"inj_{r_0} c_0"] \& v_2 \arrow[l, "inj_{r_1} c_1"] \& v_n \arrow[l, dashed] \\
\end{tikzcd}
\end{equation}
\end{ceqn}
-\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure}
-\end{envForCaption}
\noindent
+The red part represents what we already know during the first
+derivative phase,
+and the blue part represents the unknown part of input.
+The red area expands as we move towards $r_n$,
+indicating an increasing stack size during lexing.
+Despite having some partial lexing information during
+the forward derivative phase, we choose to store them
+temporarily, only to convert the information to lexical
+values at a later stage. In essence we are repeating work we
+have already done.
This is both inefficient and prone to stack overflow.
A natural question arises as to whether we can store lexing
information on the fly, while still using regular expression
-derivatives?
+derivatives.
-In a lexing algorithm's run, split by the current input position,
-we have a sub-string that has been consumed,
-and the sub-string that has yet to come.
-We already know what was before, and this should be reflected in the value
-and the regular expression at that step as well. But this is not the
-case for injection-based regular expression derivatives.
-Take the regex $(aa)^* \cdot bc$ matching the string $aabc$
-as an example, if we have just read the two former characters $aa$:
-
+If we remove the details of the individual
+lexing steps, and use red and blue areas as before
+to indicate consumed (seen) input and constructed
+partial value (before recovering the rest of the stack),
+one could see that the seen part's lexical information
+is stored in the form of a regular expression.
+Consider the regular expression $(aa)^* \cdot bc$ matching the string $aabc$
+and assume we have just read the two characters $aa$:
\begin{center}
-\begin{envForCaption}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
- \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
- {Consumed: $aa$
- \nodepart{two} Not Yet Reached: $bc$ };
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\caption{Partially matched String}
-\end{envForCaption}
-\end{center}
-%\caption{Input String}\label{StringPartial}
-%\end{figure}
-
-\noindent
-We have the value that has already been partially calculated,
-and the part that has yet to come:
-\begin{center}
-\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
- {$\Seq(\Stars[\Char(a), \Char(a)], ???)$
+ {Partial lexing info: $\ONE \cdot a \cdot (aa)^* \cdot bc$ etc.
\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
-%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
-\caption{Partially constructed Value}
-\end{envForCaption}
\end{center}
-
-In the regex derivative part , (after simplification)
-all we have is just what is about to come:
+\noindent
+In the injection-based lexing algorithm, we ``neglect" the red area
+by putting all the characters we have consumed and
+intermediate regular expressions on the stack when
+we go from left to right in the derivative phase.
+The red area grows till the string is exhausted.
+During the injection phase, the value in the blue area
+is built up incrementally, while the red area shrinks.
+Before we have recovered all characters and intermediate
+derivative regular expressions from the stack,
+what values these characters and regular expressions correspond
+to are unknown:
\begin{center}
-\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
- {$???$
- \nodepart{two} To Come: $b c$};
+ {$(\ONE \cdot \ONE) \cdot (aa)^* \cdot bc $ correspond to:$???$
+ \nodepart{two} $b c$ corresponds to $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
-\caption{Derivative}
-\end{envForCaption}
\end{center}
\noindent
-The previous part is missing.
-How about keeping the partially constructed value
-attached to the front of the regular expression?
+However, they should be calculable,
+as characters and regular expression shapes
+after taking derivative w.r.t those characters
+have already been known, therefore in our example,
+we know that the value starts with two $a$s,
+and makes up to an iteration in a Kleene star:
+(We have put the injection-based lexing's partial
+result in the right part of the split rectangle
+to contrast it with the partial valued produced
+in a forward manner)
\begin{center}
-\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
- {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
- \nodepart{two} To Come: $b c$};
+ {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$
+ \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$ $\stackrel{Inj}{\longleftarrow}$};
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
-\caption{Derivative}
-\end{envForCaption}
\end{center}
\noindent
If we do this kind of "attachment"
@@ -105,19 +114,23 @@
constructed value when taking off a
character:
\begin{center}
-\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+ \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint)
+ {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
+ \nodepart{two} Remaining: $b c$};
+\end{tikzpicture}\\
+$\downarrow$\\
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
{$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
- \nodepart{two} To Come: $c$};
+ \nodepart{two} Remaining: $c$};
\end{tikzpicture}\\
+$\downarrow$\\
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
{$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
\nodepart{two} EOF};
\end{tikzpicture}
-\caption{After $\backslash b$ and $\backslash c$}
-\end{envForCaption}
\end{center}
\noindent
In the end we could recover the value without a backward phase.
@@ -125,14 +138,11 @@
we instead use bit-codes to encode them.
Bits and bitcodes (lists of bits) are defined as:
-\begin{envForCaption}
\begin{center}
$b ::= S \mid Z \qquad
bs ::= [] \mid b::bs
$
\end{center}
-\caption{Bit-codes datatype}
-\end{envForCaption}
\noindent
Using $S$ and $Z$ rather than $1$ and $0$ is to avoid
@@ -141,7 +151,6 @@
bit-lists) can be used to encode values (or potentially incomplete values) in a
compact form. This can be straightforwardly seen in the following
coding function from values to bitcodes:
-\begin{envForCaption}
\begin{center}
\begin{tabular}{lcl}
$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
@@ -154,9 +163,6 @@
code(\Stars\,vs)$
\end{tabular}
\end{center}
-\caption{Coding Function for Values}
-\end{envForCaption}
-
\noindent
Here $\textit{code}$ encodes a value into a bit-code by converting
$\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
@@ -168,9 +174,10 @@
whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The
reason for choosing this compact way of storing information is that the
relatively small size of bits can be easily manipulated and ``moved
-around'' in a regular expression.
+around" in a regular expression.
-
+Because of the lossiness, the process of decoding a bitlist requires additionally
+a regular expression. The function $\decode$ is defined as:
We define the reverse operation of $\code$, which is $\decode$.
As expected, $\decode$ not only requires the bit-codes,
but also a regular expression to guide the decoding and
@@ -178,7 +185,6 @@
%\begin{definition}[Bitdecoding of Values]\mbox{}
-\begin{envForCaption}
\begin{center}
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
@@ -205,10 +211,10 @@
\textit{else}\;\textit{None}$
\end{tabular}
\end{center}
-\end{envForCaption}
%\end{definition}
\noindent
+The function $\decode'$ returns a pair consisting of a partially decoded value and some leftover:
$\decode'$ does most of the job while $\decode$ throws
away leftover bit-codes and returns the value only.
$\decode$ is terminating as $\decode'$ is terminating.
@@ -251,7 +257,7 @@
The first thing we define related to bit-coded regular expressions
-is how we move bits, for instance pasting it at the front of an annotated regex.
+is how we move bits, for instance pasting it at the front of an annotated regular expression.
The operation $\fuse$ is just to attach bit-codes
to the front of an annotated regular expression:
\begin{center}
@@ -364,7 +370,7 @@
This is done by adding bitcodes to the
derivatives, for example when one more star iteratoin is taken (we
call the operation of derivatives on annotated regular expressions $\bder$
-because it is derivatives on regexes with \emph{b}itcodes),
+because it is derivatives on regular expressiones with \emph{b}itcodes),
we need to unfold it into a sequence,
and attach an additional bit $Z$ to the front of $r \backslash c$
to indicate one more star iteration.
@@ -652,7 +658,7 @@
$\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
\end{lemma}
\noindent
-$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regexes of $\blexer$.
+$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regular expressiones of $\blexer$.
For plain regular expressions something similar is required as well.
\subsection{$\flex$}