lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded1.tex

equal deleted inserted replaced

-:57e33978e55d
+:3cbcd7cda0a9
 \label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive
 %simplifications and therefore introduce our version of the bitcoded algorithm and
 %its correctness proof in
 %Chapter 3\ref{Chapter3}.
-In this chapter, we are going to introduce the bit-coded algorithm
+In this chapter, we are going to describe the bit-coded algorithm
-introduced by Sulzmann and Lu to address the problem of
+introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of
-under-simplified regular expressions.
+regular expressions.
 \section{Bit-coded Algorithm}
 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
 stores information of previous lexing steps
 on a stack, in the form of regular expressions
 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
-\begin{envForCaption}
 \begin{ceqn}
 \begin{equation}%\label{graph:injLexer}
-\begin{tikzcd}
+	\begin{tikzcd}[ampersand replacement=\&, execute at end picture={
-r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
+			\begin{scope}[on background layer]
-v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
+				\node[rectangle, fill={red!30},
+					pattern=north east lines, pattern color=red,
+					fit={(-3,-1) (-3, 1) (1, -1)
+						(1, 1)}
+				     ]
+				     {}; ,
+				\node[rectangle, fill={blue!20},
+					pattern=north east lines, pattern color=blue,
+					fit= {(1, -1) (1, 1) (3, -1) (3, 1)}
+					]
+					{};
+				\end{scope}}
+					]
+r_0 \arrow[r, "\backslash c_0"]  \arrow[d] \& r_1 \arrow[r, "\backslash c_1"] \arrow[d] \& r_2 \arrow[r, dashed] \arrow[d] \& r_n \arrow[d, "mkeps" description] \\
+v_0           \& v_1 \arrow[l,"inj_{r_0} c_0"]                \& v_2 \arrow[l, "inj_{r_1} c_1"]              \& v_n \arrow[l, dashed]         \\
 \end{tikzcd}
 \end{equation}
 \end{ceqn}
-\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure}
+\noindent
-\end{envForCaption}
+The red part represents what we already know during the first
-\noindent
+derivative phase,
+and the blue part represents the unknown part of input.
+The red area expands as we move towards $r_n$,
+indicating an increasing stack size during lexing.
+Despite having some partial lexing information during
+the forward derivative phase, we choose to store them
+temporarily, only to convert the information to lexical
+values at a later stage. In essence we are repeating work we
+have already done.
 This is both inefficient and prone to stack overflow.
 A natural question arises as to whether we can store lexing
 information on the fly, while still using regular expression
-derivatives?
+derivatives.
-In a lexing algorithm's run, split by the current input position,
+If we remove the details of the individual
-we have a sub-string that has been consumed,
+lexing steps, and use red and blue areas as before
-and the sub-string that has yet to come.
+to indicate consumed (seen) input and constructed
-We already know what was before, and this should be reflected in the value
+partial value (before recovering the rest of the stack),
-and the regular expression at that step as well. But this is not the
+one could see that the seen part's lexical information
-case for injection-based regular expression derivatives.
+is stored in the form of a regular expression.
-Take the regex $(aa)^* \cdot bc$ matching the string $aabc$
+Consider the regular expression $(aa)^* \cdot bc$ matching the string $aabc$
-as an example, if we have just read the two former characters $aa$:
+and assume we have just read the two characters $aa$:
+\begin{center}
-\begin{center}
-\begin{envForCaption}
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-{Consumed: $aa$
+	    {Partial lexing info: $\ONE \cdot a \cdot (aa)^* \cdot bc$ etc.
-\nodepart{two} Not Yet Reached: $bc$ };
+\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
+\end{tikzpicture}
+\end{center}
+\noindent
+In the injection-based lexing algorithm, we ``neglect" the red area
+by putting all the characters we have consumed and
+intermediate regular expressions on the stack when
+we go from left to right in the derivative phase.
+The red area grows till the string is exhausted.
+During the injection phase, the value in the blue area
+is built up incrementally, while the red area shrinks.
+Before we have recovered all characters and intermediate
+derivative regular expressions from the stack,
+what values these characters and regular expressions correspond
+to are unknown:
+\begin{center}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
+	    {$(\ONE \cdot \ONE) \cdot (aa)^* \cdot bc $ correspond to:$???$
+\nodepart{two}  $b c$ corresponds to  $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
 %\caption{term 1 \ref{term:1}'s matching configuration}
 \end{tikzpicture}
-\caption{Partially matched String}
+\end{center}
-\end{envForCaption}
+\noindent
-\end{center}
+However, they should be calculable,
-%\caption{Input String}\label{StringPartial}
+as characters and regular expression shapes
-%\end{figure}
+after taking derivative w.r.t those characters
+have already been known, therefore in our example,
-\noindent
+we know that the value starts with two $a$s,
-We have the value that has already been partially calculated,
+and makes up to an iteration in a Kleene star:
-and the part that has yet to come:
+(We have put the injection-based lexing's partial
-\begin{center}
+result in the right part of the split rectangle
-\begin{envForCaption}
+to contrast it with the partial valued produced
+in a forward manner)
+\begin{center}
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-{$\Seq(\Stars[\Char(a), \Char(a)], ???)$
+	    {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$
-\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
+	\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$  $\stackrel{Inj}{\longleftarrow}$};
 %\caption{term 1 \ref{term:1}'s matching configuration}
 \end{tikzpicture}
-\caption{Partially constructed Value}
-\end{envForCaption}
-\end{center}
-In the regex derivative part , (after simplification)
-all we have is just what is about to come:
-\begin{center}
-\begin{envForCaption}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
-{$???$
-\nodepart{two} To Come: $b c$};
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\caption{Derivative}
-\end{envForCaption}
-\end{center}
-\noindent
-The previous part is missing.
-How about keeping the partially constructed value
-attached to the front of the regular expression?
-\begin{center}
-\begin{envForCaption}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-{$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
-\nodepart{two} To Come: $b c$};
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\caption{Derivative}
-\end{envForCaption}
 \end{center}
 \noindent
 If we do this kind of "attachment"
 and each time augment the attached partially
 constructed value when taking off a
 character:
 \begin{center}
-\begin{envForCaption}
+\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint)
+{$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
+\nodepart{two} Remaining: $b c$};
+\end{tikzpicture}\\
+$\downarrow$\\
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
 {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
-\nodepart{two} To Come: $c$};
+\nodepart{two} Remaining: $c$};
 \end{tikzpicture}\\
+$\downarrow$\\
 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
 \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
 {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
 \nodepart{two} EOF};
 \end{tikzpicture}
-\caption{After $\backslash b$ and $\backslash c$}
-\end{envForCaption}
 \end{center}
 \noindent
 In the end we could recover the value without a backward phase.
 But (partial) values are a bit clumsy to stick together with a regular expression, so
 we instead use bit-codes to encode them.
 Bits and bitcodes (lists of bits) are defined as:
-\begin{envForCaption}
 \begin{center}
 		$b ::=   S \mid  Z \qquad
 bs ::= [] \mid b::bs
 $
 \end{center}
-\caption{Bit-codes datatype}
-\end{envForCaption}
 \noindent
 Using $S$ and $Z$ rather than $1$ and $0$ is to avoid
 confusion with the regular expressions $\ZERO$ and $\ONE$.
 Bitcodes (or
 bit-lists) can be used to encode values (or potentially incomplete values) in a
 compact form. This can be straightforwardly seen in the following
 coding function from values to bitcodes:
-\begin{envForCaption}
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
 $\textit{code}(\Stars\,[])$ & $\dn$ & $[Z]$\\
 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\;
 code(\Stars\,vs)$
 \end{tabular}
 \end{center}
-\caption{Coding Function for Values}
-\end{envForCaption}
 \noindent
 Here $\textit{code}$ encodes a value into a bit-code by converting
 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
 star iteration by $S$. The border where a local star terminates
 is marked by $Z$.
 characters, and also does not encode the ``boundary'' between two
 sequence values. Moreover, with only the bitcode we cannot even tell
 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The
 reason for choosing this compact way of storing information is that the
 relatively small size of bits can be easily manipulated and ``moved
-around'' in a regular expression.
+around" in a regular expression.
+Because of the lossiness, the process of decoding a bitlist requires additionally
+a regular expression. The function $\decode$ is defined as:
 We define the reverse operation of $\code$, which is $\decode$.
 As expected, $\decode$ not only requires the bit-codes,
 but also a regular expression to guide the decoding and
 fill the gaps of characters:
 %\begin{definition}[Bitdecoding of Values]\mbox{}
-\begin{envForCaption}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
 $\textit{decode}'\,(Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
 \textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
-\end{envForCaption}
 %\end{definition}
 \noindent
+The function $\decode'$ returns a pair consisting of a partially decoded value and some leftover:
 $\decode'$ does most of the job while $\decode$ throws
 away leftover bit-codes and returns the value only.
 $\decode$ is terminating as $\decode'$ is terminating.
 We have the property that $\decode$ and $\code$ are
 reverse operations of one another:
 The alternative constructor ($\sum$) has been generalised to
 accept a list of annotated regular expressions rather than just 2.
 The first thing we define related to bit-coded regular expressions
-is how we move bits, for instance pasting it at the front of an annotated regex.
+is how we move bits, for instance pasting it at the front of an annotated regular expression.
 The operation $\fuse$ is just to attach bit-codes
 to the front of an annotated regular expression:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
 represented as bit-codes is augmented and carried around
 during a derivative is taken.
 This is done by adding bitcodes to the
 derivatives, for example when one more star iteratoin is taken (we
 call the operation of derivatives on annotated regular expressions $\bder$
-because it is derivatives on regexes with \emph{b}itcodes),
+because it is derivatives on regular expressiones with \emph{b}itcodes),
 we need to unfold it into a sequence,
 and attach an additional bit $Z$ to the front of $r \backslash c$
 to indicate one more star iteration.
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $\retrieve$ is connected to the $\blexer$ in the following way:
 \begin{lemma}\label{blexer_retrieve}
 $\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
 \end{lemma}
 \noindent
-$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regexes of $\blexer$.
+$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regular expressiones of $\blexer$.
 For plain regular expressions something similar is required as well.
 \subsection{$\flex$}
 Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$,
 defined as

changeset 564	3cbcd7cda0a9
parent 543	b2bea5968b89
child 575	3178f0e948ac