lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded1.tex

equal deleted inserted replaced

-:aff7bf93b9c7
+:50e590823220
 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive
 %simplifications and therefore introduce our version of the bitcoded algorithm and
 %its correctness proof in
 %Chapter 3\ref{Chapter3}.
-\section*{Bit-coded Algorithm}
+\section{Bit-coded Algorithm}
+The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
+stores information of previous lexing steps
+on a stack, in the form of regular expressions
+and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
+\begin{envForCaption}
+\begin{ceqn}
+\begin{equation}%\label{graph:injLexer}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
+v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure}
+\end{envForCaption}
+\noindent
+This is both inefficient and prone to stack overflow.
+A natural question arises as to whether we can store lexing
+information on the fly, while still using regular expression
+derivatives?
+In a lexing algorithm's run, split by the current input position,
+we have a sub-string that has been consumed,
+and the sub-string that has yet to come.
+We already know what was before, and this should be reflected in the value
+and the regular expression at that step as well. But this is not the
+case for injection-based regular expression derivatives.
+Take the regex $(aa)^* \cdot bc$ matching the string $aabc$
+as an example, if we have just read the two former characters $aa$:
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+{Consumed: $aa$
+\nodepart{two} Not Yet Reached: $bc$ };
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\caption{Partially matched String}
+\end{envForCaption}
+\end{center}
+%\caption{Input String}\label{StringPartial}
+%\end{figure}
+\noindent
+We have the value that has already been partially calculated,
+and the part that has yet to come:
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+{$\Seq(\Stars[\Char(a), \Char(a)], ???)$
+\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\caption{Partially constructed Value}
+\end{envForCaption}
+\end{center}
+In the regex derivative part , (after simplification)
+all we have is just what is about to come:
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
+{$???$
+\nodepart{two} To Come: $b c$};
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\caption{Derivative}
+\end{envForCaption}
+\end{center}
+\noindent
+The previous part is missing.
+How about keeping the partially constructed value
+attached to the front of the regular expression?
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+{$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
+\nodepart{two} To Come: $b c$};
+%\caption{term 1 \ref{term:1}'s matching configuration}
+\end{tikzpicture}
+\caption{Derivative}
+\end{envForCaption}
+\end{center}
+\noindent
+If we do this kind of "attachment"
+and each time augment the attached partially
+constructed value when taking off a
+character:
+\begin{center}
+\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+{$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
+\nodepart{two} To Come: $c$};
+\end{tikzpicture}\\
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+{$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
+\nodepart{two} EOF};
+\end{tikzpicture}
+\caption{After $\backslash b$ and $\backslash c$}
+\end{envForCaption}
+\end{center}
+\noindent
+In the end we could recover the value without a backward phase.
+But (partial) values are a bit clumsy to stick together with a regular expression, so
+we instead use bit-codes to encode them.
 Bits and bitcodes (lists of bits) are defined as:
+\begin{envForCaption}
 \begin{center}
-		$b ::=   1 \mid  0 \qquad
+		$b ::=   S \mid  Z \qquad
 bs ::= [] \mid b::bs
 $
 \end{center}
+\caption{Bit-codes datatype}
-\noindent
+\end{envForCaption}
-The $1$ and $0$ are not in bold in order to avoid
+\noindent
+The $S$ and $Z$ are not in bold in order to avoid
 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
 bit-lists) can be used to encode values (or potentially incomplete values) in a
 compact form. This can be straightforwardly seen in the following
 coding function from values to bitcodes:
+\begin{envForCaption}
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
-$\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
+$\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
-$\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
+$\textit{code}(\Right\,v)$ & $\dn$ & $S :: code(v)$\\
 $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
-$\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
+$\textit{code}(\Stars\,[])$ & $\dn$ & $[Z]$\\
-$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
+$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\;
 code(\Stars\,vs)$
 \end{tabular}
 \end{center}
+\caption{Coding Function for Values}
-\noindent
+\end{envForCaption}
-Here $\textit{code}$ encodes a value into a bitcodes by converting
-$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
+\noindent
-star iteration by $1$. The border where a local star terminates
+Here $\textit{code}$ encodes a value into a bit-code by converting
-is marked by $0$. This coding is lossy, as it throws away the information about
+$\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
+star iteration by $S$. The border where a local star terminates
+is marked by $Z$.
+This coding is lossy, as it throws away the information about
 characters, and also does not encode the ``boundary'' between two
 sequence values. Moreover, with only the bitcode we cannot even tell
-whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
+whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The
 reason for choosing this compact way of storing information is that the
 relatively small size of bits can be easily manipulated and ``moved
-around'' in a regular expression. In order to recover values, we will
+around'' in a regular expression.
-need the corresponding regular expression as an extra information. This
-means the decoding function is defined as:
+We define the reverse operation of $\code$, which is $\decode$.
+As expected, $\decode$ not only requires the bit-codes,
+but also a regular expression to guide the decoding and
+fill the gaps of characters:
 %\begin{definition}[Bitdecoding of Values]\mbox{}
+\begin{envForCaption}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
-$\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+$\textit{decode}'\,(Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
 (\Left\,v, bs_1)$\\
-$\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+$\textit{decode}'\,(S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
 (\Right\,v, bs_1)$\\
 $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
 $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
 & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
 & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
-$\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+$\textit{decode}'\,(Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
-$\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ &
+$\textit{decode}'\,(S\!::\!bs)\,(r^*)$ & $\dn$ &
 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
 & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
 & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
 $\textit{decode}\,bs\,r$ & $\dn$ &
 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
 \textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
+\caption{Bit-decoding of values}
+\end{envForCaption}
 %\end{definition}
 Sulzmann and Lu's integrated the bitcodes into regular expressions to
 create annotated regular expressions \cite{Sulzmann2014}.
 \emph{Annotated regular expressions} are defined by the following
 \begin{tabular}{lcl}
 $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
 $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
 $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
 $(r_1 + r_2)^\uparrow$ & $\dn$ &
-$_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
+$_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\,
-\textit{fuse}\,[1]\,r_2^\uparrow]$\\
+\textit{fuse}\,[S]\,r_2^\uparrow]$\\
 $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
 $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
 $(r^*)^\uparrow$ & $\dn$ &
 $_{[]}(r^\uparrow)^*$\\
 \end{tabular}
 $\textit{if}\;\textit{bnullable}\,a_1$\\
 					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
 					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
 & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
 $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
-$_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
+$_{bs}\textit{SEQ}\;(\textit{fuse}\, [Z] \; r\,\backslash c )\,
 (_{bs}\textit{STAR}\,[]\,r)$
 \end{tabular}
 \end{center}
 %\end{definition}
 \fi
 $\textit{if}\;\textit{bnullable}\,a_1$\\
 					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
 $(_{bs}a^*)\,\backslash c$ & $\dn$ &
-$_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
+$_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
 (_{[]}r^*))$
 \end{tabular}
 \end{center}
 %\end{definition}
 \noindent
 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
 we need to unfold it into a sequence,
-and attach an additional bit $0$ to the front of $r \backslash c$
+and attach an additional bit $Z$ to the front of $r \backslash c$
 to indicate one more star iteration. Also the sequence clause
 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
 that it is for annotated regular expressions, therefore we omit the
 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
 $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
 $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
-$bs \,@\, [0]$
+$bs \,@\, [Z]$
 \end{tabular}
 \end{center}
 %\end{definition}
 \noindent

changeset 537	50e590823220
parent 536	aff7bf93b9c7
child 538	8016a2480704