diff -r aff7bf93b9c7 -r 50e590823220 ChengsongTanPhdThesis/Chapters/Bitcoded1.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Mon Jun 06 16:45:42 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Mon Jun 06 23:17:45 2022 +0100 @@ -9,67 +9,191 @@ %its correctness proof in %Chapter 3\ref{Chapter3}. -\section*{Bit-coded Algorithm} -Bits and bitcodes (lists of bits) are defined as: +\section{Bit-coded Algorithm} + + +The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure}, +stores information of previous lexing steps +on a stack, in the form of regular expressions +and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc. +\begin{envForCaption} +\begin{ceqn} +\begin{equation}%\label{graph:injLexer} +\begin{tikzcd} +r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ +v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] +\end{tikzcd} +\end{equation} +\end{ceqn} +\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure} +\end{envForCaption} +\noindent +This is both inefficient and prone to stack overflow. +A natural question arises as to whether we can store lexing +information on the fly, while still using regular expression +derivatives? + +In a lexing algorithm's run, split by the current input position, +we have a sub-string that has been consumed, +and the sub-string that has yet to come. +We already know what was before, and this should be reflected in the value +and the regular expression at that step as well. But this is not the +case for injection-based regular expression derivatives. +Take the regex $(aa)^* \cdot bc$ matching the string $aabc$ +as an example, if we have just read the two former characters $aa$: \begin{center} - $b ::= 1 \mid 0 \qquad +\begin{envForCaption} +\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] + \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] + {Consumed: $aa$ + \nodepart{two} Not Yet Reached: $bc$ }; +%\caption{term 1 \ref{term:1}'s matching configuration} +\end{tikzpicture} +\caption{Partially matched String} +\end{envForCaption} +\end{center} +%\caption{Input String}\label{StringPartial} +%\end{figure} + +\noindent +We have the value that has already been partially calculated, +and the part that has yet to come: +\begin{center} +\begin{envForCaption} +\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] + \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] + {$\Seq(\Stars[\Char(a), \Char(a)], ???)$ + \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$}; +%\caption{term 1 \ref{term:1}'s matching configuration} +\end{tikzpicture} +\caption{Partially constructed Value} +\end{envForCaption} +\end{center} + +In the regex derivative part , (after simplification) +all we have is just what is about to come: +\begin{center} +\begin{envForCaption} +\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] + \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},] + {$???$ + \nodepart{two} To Come: $b c$}; +%\caption{term 1 \ref{term:1}'s matching configuration} +\end{tikzpicture} +\caption{Derivative} +\end{envForCaption} +\end{center} +\noindent +The previous part is missing. +How about keeping the partially constructed value +attached to the front of the regular expression? +\begin{center} +\begin{envForCaption} +\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] + \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] + {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$ + \nodepart{two} To Come: $b c$}; +%\caption{term 1 \ref{term:1}'s matching configuration} +\end{tikzpicture} +\caption{Derivative} +\end{envForCaption} +\end{center} +\noindent +If we do this kind of "attachment" +and each time augment the attached partially +constructed value when taking off a +character: +\begin{center} +\begin{envForCaption} +\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] + \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] + {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$ + \nodepart{two} To Come: $c$}; +\end{tikzpicture}\\ +\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] + \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] + {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$ + \nodepart{two} EOF}; +\end{tikzpicture} +\caption{After $\backslash b$ and $\backslash c$} +\end{envForCaption} +\end{center} +\noindent +In the end we could recover the value without a backward phase. +But (partial) values are a bit clumsy to stick together with a regular expression, so +we instead use bit-codes to encode them. + +Bits and bitcodes (lists of bits) are defined as: +\begin{envForCaption} +\begin{center} + $b ::= S \mid Z \qquad bs ::= [] \mid b::bs $ \end{center} +\caption{Bit-codes datatype} +\end{envForCaption} \noindent -The $1$ and $0$ are not in bold in order to avoid +The $S$ and $Z$ are not in bold in order to avoid confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or bit-lists) can be used to encode values (or potentially incomplete values) in a compact form. This can be straightforwardly seen in the following coding function from values to bitcodes: - +\begin{envForCaption} \begin{center} \begin{tabular}{lcl} $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ - $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\ - $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\ + $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\ + $\textit{code}(\Right\,v)$ & $\dn$ & $S :: code(v)$\\ $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ - $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\ - $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\; + $\textit{code}(\Stars\,[])$ & $\dn$ & $[Z]$\\ + $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\; code(\Stars\,vs)$ \end{tabular} \end{center} +\caption{Coding Function for Values} +\end{envForCaption} \noindent -Here $\textit{code}$ encodes a value into a bitcodes by converting -$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty -star iteration by $1$. The border where a local star terminates -is marked by $0$. This coding is lossy, as it throws away the information about +Here $\textit{code}$ encodes a value into a bit-code by converting +$\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty +star iteration by $S$. The border where a local star terminates +is marked by $Z$. +This coding is lossy, as it throws away the information about characters, and also does not encode the ``boundary'' between two sequence values. Moreover, with only the bitcode we cannot even tell -whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The +whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily manipulated and ``moved -around'' in a regular expression. In order to recover values, we will -need the corresponding regular expression as an extra information. This -means the decoding function is defined as: +around'' in a regular expression. + + +We define the reverse operation of $\code$, which is $\decode$. +As expected, $\decode$ not only requires the bit-codes, +but also a regular expression to guide the decoding and +fill the gaps of characters: %\begin{definition}[Bitdecoding of Values]\mbox{} +\begin{envForCaption} \begin{center} \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ - $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & + $\textit{decode}'\,(Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; (\Left\,v, bs_1)$\\ - $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & + $\textit{decode}'\,(S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; (\Right\,v, bs_1)$\\ $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ - $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ - $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & + $\textit{decode}'\,(Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ + $\textit{decode}'\,(S\!::\!bs)\,(r^*)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ @@ -79,7 +203,9 @@ & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; \textit{else}\;\textit{None}$ \end{tabular} -\end{center} +\end{center} +\caption{Bit-decoding of values} +\end{envForCaption} %\end{definition} Sulzmann and Lu's integrated the bitcodes into regular expressions to @@ -121,8 +247,8 @@ $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ $(r_1 + r_2)^\uparrow$ & $\dn$ & - $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\, - \textit{fuse}\,[1]\,r_2^\uparrow]$\\ + $_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\, + \textit{fuse}\,[S]\,r_2^\uparrow]$\\ $(r_1\cdot r_2)^\uparrow$ & $\dn$ & $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ $(r^*)^\uparrow$ & $\dn$ & @@ -200,7 +326,7 @@ & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & - $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\, + $_{bs}\textit{SEQ}\;(\textit{fuse}\, [Z] \; r\,\backslash c )\, (_{bs}\textit{STAR}\,[]\,r)$ \end{tabular} \end{center} @@ -222,7 +348,7 @@ & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ $(_{bs}a^*)\,\backslash c$ & $\dn$ & - $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot + $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot (_{[]}r^*))$ \end{tabular} \end{center} @@ -231,7 +357,7 @@ \noindent For instance, when we do derivative of $_{bs}a^*$ with respect to c, we need to unfold it into a sequence, -and attach an additional bit $0$ to the front of $r \backslash c$ +and attach an additional bit $Z$ to the front of $r \backslash c$ to indicate one more star iteration. Also the sequence clause is more subtle---when $a_1$ is $\textit{bnullable}$ (here \textit{bnullable} is exactly the same as $\textit{nullable}$, except @@ -263,7 +389,7 @@ $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & - $bs \,@\, [0]$ + $bs \,@\, [Z]$ \end{tabular} \end{center} %\end{definition}