--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Mon Jun 06 16:45:42 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Mon Jun 06 23:17:45 2022 +0100
@@ -9,67 +9,191 @@
%its correctness proof in
%Chapter 3\ref{Chapter3}.
-\section*{Bit-coded Algorithm}
-Bits and bitcodes (lists of bits) are defined as:
+\section{Bit-coded Algorithm}
+
+
+The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
+stores information of previous lexing steps
+on a stack, in the form of regular expressions
+and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
+\begin{envForCaption}
+\begin{ceqn}
+\begin{equation}%\label{graph:injLexer}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
+v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure}
+\end{envForCaption}
+\noindent
+This is both inefficient and prone to stack overflow.
+A natural question arises as to whether we can store lexing
+information on the fly, while still using regular expression
+derivatives?
+
+In a lexing algorithm's run, split by the current input position,
+we have a sub-string that has been consumed,
+and the sub-string that has yet to come.
+We already know what was before, and this should be reflected in the value
+and the regular expression at that step as well. But this is not the
+case for injection-based regular expression derivatives.
+Take the regex $(aa)^* \cdot bc$ matching the string $aabc$
+as an example, if we have just read the two former characters $aa$:
\begin{center}
- $b ::= 1 \mid 0 \qquad
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+ \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+ {Consumed: $aa$
+ \nodepart{two} Not Yet Reached: $bc$ };
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\caption{Partially matched String}
+\end{envForCaption}
+\end{center}
+%\caption{Input String}\label{StringPartial}
+%\end{figure}
+
+\noindent
+We have the value that has already been partially calculated,
+and the part that has yet to come:
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+ \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+ {$\Seq(\Stars[\Char(a), \Char(a)], ???)$
+ \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\caption{Partially constructed Value}
+\end{envForCaption}
+\end{center}
+
+In the regex derivative part , (after simplification)
+all we have is just what is about to come:
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+ \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
+ {$???$
+ \nodepart{two} To Come: $b c$};
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\caption{Derivative}
+\end{envForCaption}
+\end{center}
+\noindent
+The previous part is missing.
+How about keeping the partially constructed value
+attached to the front of the regular expression?
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+ \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+ {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
+ \nodepart{two} To Come: $b c$};
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\caption{Derivative}
+\end{envForCaption}
+\end{center}
+\noindent
+If we do this kind of "attachment"
+and each time augment the attached partially
+constructed value when taking off a
+character:
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+ \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+ {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
+ \nodepart{two} To Come: $c$};
+\end{tikzpicture}\\
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+ \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+ {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
+ \nodepart{two} EOF};
+\end{tikzpicture}
+\caption{After $\backslash b$ and $\backslash c$}
+\end{envForCaption}
+\end{center}
+\noindent
+In the end we could recover the value without a backward phase.
+But (partial) values are a bit clumsy to stick together with a regular expression, so
+we instead use bit-codes to encode them.
+
+Bits and bitcodes (lists of bits) are defined as:
+\begin{envForCaption}
+\begin{center}
+ $b ::= S \mid Z \qquad
bs ::= [] \mid b::bs
$
\end{center}
+\caption{Bit-codes datatype}
+\end{envForCaption}
\noindent
-The $1$ and $0$ are not in bold in order to avoid
+The $S$ and $Z$ are not in bold in order to avoid
confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
bit-lists) can be used to encode values (or potentially incomplete values) in a
compact form. This can be straightforwardly seen in the following
coding function from values to bitcodes:
-
+\begin{envForCaption}
\begin{center}
\begin{tabular}{lcl}
$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
- $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
- $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
+ $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
+ $\textit{code}(\Right\,v)$ & $\dn$ & $S :: code(v)$\\
$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
- $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
- $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
+ $\textit{code}(\Stars\,[])$ & $\dn$ & $[Z]$\\
+ $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\;
code(\Stars\,vs)$
\end{tabular}
\end{center}
+\caption{Coding Function for Values}
+\end{envForCaption}
\noindent
-Here $\textit{code}$ encodes a value into a bitcodes by converting
-$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
-star iteration by $1$. The border where a local star terminates
-is marked by $0$. This coding is lossy, as it throws away the information about
+Here $\textit{code}$ encodes a value into a bit-code by converting
+$\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
+star iteration by $S$. The border where a local star terminates
+is marked by $Z$.
+This coding is lossy, as it throws away the information about
characters, and also does not encode the ``boundary'' between two
sequence values. Moreover, with only the bitcode we cannot even tell
-whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
+whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The
reason for choosing this compact way of storing information is that the
relatively small size of bits can be easily manipulated and ``moved
-around'' in a regular expression. In order to recover values, we will
-need the corresponding regular expression as an extra information. This
-means the decoding function is defined as:
+around'' in a regular expression.
+
+
+We define the reverse operation of $\code$, which is $\decode$.
+As expected, $\decode$ not only requires the bit-codes,
+but also a regular expression to guide the decoding and
+fill the gaps of characters:
%\begin{definition}[Bitdecoding of Values]\mbox{}
+\begin{envForCaption}
\begin{center}
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
- $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+ $\textit{decode}'\,(Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
(\Left\,v, bs_1)$\\
- $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+ $\textit{decode}'\,(S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
(\Right\,v, bs_1)$\\
$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
& & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
& & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
- $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
- $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ &
+ $\textit{decode}'\,(Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+ $\textit{decode}'\,(S\!::\!bs)\,(r^*)$ & $\dn$ &
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
& & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
& & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
@@ -79,7 +203,9 @@
& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
\textit{else}\;\textit{None}$
\end{tabular}
-\end{center}
+\end{center}
+\caption{Bit-decoding of values}
+\end{envForCaption}
%\end{definition}
Sulzmann and Lu's integrated the bitcodes into regular expressions to
@@ -121,8 +247,8 @@
$(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
$(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
$(r_1 + r_2)^\uparrow$ & $\dn$ &
- $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
- \textit{fuse}\,[1]\,r_2^\uparrow]$\\
+ $_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\,
+ \textit{fuse}\,[S]\,r_2^\uparrow]$\\
$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
$_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
$(r^*)^\uparrow$ & $\dn$ &
@@ -200,7 +326,7 @@
& &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
& &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
$(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
- $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
+ $_{bs}\textit{SEQ}\;(\textit{fuse}\, [Z] \; r\,\backslash c )\,
(_{bs}\textit{STAR}\,[]\,r)$
\end{tabular}
\end{center}
@@ -222,7 +348,7 @@
& &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
$(_{bs}a^*)\,\backslash c$ & $\dn$ &
- $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
+ $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
(_{[]}r^*))$
\end{tabular}
\end{center}
@@ -231,7 +357,7 @@
\noindent
For instance, when we do derivative of $_{bs}a^*$ with respect to c,
we need to unfold it into a sequence,
-and attach an additional bit $0$ to the front of $r \backslash c$
+and attach an additional bit $Z$ to the front of $r \backslash c$
to indicate one more star iteration. Also the sequence clause
is more subtle---when $a_1$ is $\textit{bnullable}$ (here
\textit{bnullable} is exactly the same as $\textit{nullable}$, except
@@ -263,7 +389,7 @@
$\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
$\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
- $bs \,@\, [0]$
+ $bs \,@\, [Z]$
\end{tabular}
\end{center}
%\end{definition}