diff -r 57e33978e55d -r 3cbcd7cda0a9 ChengsongTanPhdThesis/Chapters/Bitcoded1.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Tue Jul 05 00:42:06 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Wed Jul 13 08:27:28 2022 +0100 @@ -8,96 +8,105 @@ %simplifications and therefore introduce our version of the bitcoded algorithm and %its correctness proof in %Chapter 3\ref{Chapter3}. -In this chapter, we are going to introduce the bit-coded algorithm -introduced by Sulzmann and Lu to address the problem of -under-simplified regular expressions. +In this chapter, we are going to describe the bit-coded algorithm +introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of +regular expressions. \section{Bit-coded Algorithm} The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure}, stores information of previous lexing steps on a stack, in the form of regular expressions and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc. -\begin{envForCaption} \begin{ceqn} \begin{equation}%\label{graph:injLexer} -\begin{tikzcd} -r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ -v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] + \begin{tikzcd}[ampersand replacement=\&, execute at end picture={ + \begin{scope}[on background layer] + \node[rectangle, fill={red!30}, + pattern=north east lines, pattern color=red, + fit={(-3,-1) (-3, 1) (1, -1) + (1, 1)} + ] + {}; , + \node[rectangle, fill={blue!20}, + pattern=north east lines, pattern color=blue, + fit= {(1, -1) (1, 1) (3, -1) (3, 1)} + ] + {}; + \end{scope}} + ] +r_0 \arrow[r, "\backslash c_0"] \arrow[d] \& r_1 \arrow[r, "\backslash c_1"] \arrow[d] \& r_2 \arrow[r, dashed] \arrow[d] \& r_n \arrow[d, "mkeps" description] \\ +v_0 \& v_1 \arrow[l,"inj_{r_0} c_0"] \& v_2 \arrow[l, "inj_{r_1} c_1"] \& v_n \arrow[l, dashed] \\ \end{tikzcd} \end{equation} \end{ceqn} -\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure} -\end{envForCaption} \noindent +The red part represents what we already know during the first +derivative phase, +and the blue part represents the unknown part of input. +The red area expands as we move towards $r_n$, +indicating an increasing stack size during lexing. +Despite having some partial lexing information during +the forward derivative phase, we choose to store them +temporarily, only to convert the information to lexical +values at a later stage. In essence we are repeating work we +have already done. This is both inefficient and prone to stack overflow. A natural question arises as to whether we can store lexing information on the fly, while still using regular expression -derivatives? +derivatives. -In a lexing algorithm's run, split by the current input position, -we have a sub-string that has been consumed, -and the sub-string that has yet to come. -We already know what was before, and this should be reflected in the value -and the regular expression at that step as well. But this is not the -case for injection-based regular expression derivatives. -Take the regex $(aa)^* \cdot bc$ matching the string $aabc$ -as an example, if we have just read the two former characters $aa$: - +If we remove the details of the individual +lexing steps, and use red and blue areas as before +to indicate consumed (seen) input and constructed +partial value (before recovering the rest of the stack), +one could see that the seen part's lexical information +is stored in the form of a regular expression. +Consider the regular expression $(aa)^* \cdot bc$ matching the string $aabc$ +and assume we have just read the two characters $aa$: \begin{center} -\begin{envForCaption} -\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] - \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] - {Consumed: $aa$ - \nodepart{two} Not Yet Reached: $bc$ }; -%\caption{term 1 \ref{term:1}'s matching configuration} -\end{tikzpicture} -\caption{Partially matched String} -\end{envForCaption} -\end{center} -%\caption{Input String}\label{StringPartial} -%\end{figure} - -\noindent -We have the value that has already been partially calculated, -and the part that has yet to come: -\begin{center} -\begin{envForCaption} \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] - {$\Seq(\Stars[\Char(a), \Char(a)], ???)$ + {Partial lexing info: $\ONE \cdot a \cdot (aa)^* \cdot bc$ etc. \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$}; -%\caption{term 1 \ref{term:1}'s matching configuration} \end{tikzpicture} -\caption{Partially constructed Value} -\end{envForCaption} \end{center} - -In the regex derivative part , (after simplification) -all we have is just what is about to come: +\noindent +In the injection-based lexing algorithm, we ``neglect" the red area +by putting all the characters we have consumed and +intermediate regular expressions on the stack when +we go from left to right in the derivative phase. +The red area grows till the string is exhausted. +During the injection phase, the value in the blue area +is built up incrementally, while the red area shrinks. +Before we have recovered all characters and intermediate +derivative regular expressions from the stack, +what values these characters and regular expressions correspond +to are unknown: \begin{center} -\begin{envForCaption} \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},] - {$???$ - \nodepart{two} To Come: $b c$}; + {$(\ONE \cdot \ONE) \cdot (aa)^* \cdot bc $ correspond to:$???$ + \nodepart{two} $b c$ corresponds to $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$}; %\caption{term 1 \ref{term:1}'s matching configuration} \end{tikzpicture} -\caption{Derivative} -\end{envForCaption} \end{center} \noindent -The previous part is missing. -How about keeping the partially constructed value -attached to the front of the regular expression? +However, they should be calculable, +as characters and regular expression shapes +after taking derivative w.r.t those characters +have already been known, therefore in our example, +we know that the value starts with two $a$s, +and makes up to an iteration in a Kleene star: +(We have put the injection-based lexing's partial +result in the right part of the split rectangle +to contrast it with the partial valued produced +in a forward manner) \begin{center} -\begin{envForCaption} \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] - {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$ - \nodepart{two} To Come: $b c$}; + {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$ + \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$ $\stackrel{Inj}{\longleftarrow}$}; %\caption{term 1 \ref{term:1}'s matching configuration} \end{tikzpicture} -\caption{Derivative} -\end{envForCaption} \end{center} \noindent If we do this kind of "attachment" @@ -105,19 +114,23 @@ constructed value when taking off a character: \begin{center} -\begin{envForCaption} +\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] + \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint) + {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$ + \nodepart{two} Remaining: $b c$}; +\end{tikzpicture}\\ +$\downarrow$\\ \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$ - \nodepart{two} To Come: $c$}; + \nodepart{two} Remaining: $c$}; \end{tikzpicture}\\ +$\downarrow$\\ \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$ \nodepart{two} EOF}; \end{tikzpicture} -\caption{After $\backslash b$ and $\backslash c$} -\end{envForCaption} \end{center} \noindent In the end we could recover the value without a backward phase. @@ -125,14 +138,11 @@ we instead use bit-codes to encode them. Bits and bitcodes (lists of bits) are defined as: -\begin{envForCaption} \begin{center} $b ::= S \mid Z \qquad bs ::= [] \mid b::bs $ \end{center} -\caption{Bit-codes datatype} -\end{envForCaption} \noindent Using $S$ and $Z$ rather than $1$ and $0$ is to avoid @@ -141,7 +151,6 @@ bit-lists) can be used to encode values (or potentially incomplete values) in a compact form. This can be straightforwardly seen in the following coding function from values to bitcodes: -\begin{envForCaption} \begin{center} \begin{tabular}{lcl} $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ @@ -154,9 +163,6 @@ code(\Stars\,vs)$ \end{tabular} \end{center} -\caption{Coding Function for Values} -\end{envForCaption} - \noindent Here $\textit{code}$ encodes a value into a bit-code by converting $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty @@ -168,9 +174,10 @@ whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily manipulated and ``moved -around'' in a regular expression. +around" in a regular expression. - +Because of the lossiness, the process of decoding a bitlist requires additionally +a regular expression. The function $\decode$ is defined as: We define the reverse operation of $\code$, which is $\decode$. As expected, $\decode$ not only requires the bit-codes, but also a regular expression to guide the decoding and @@ -178,7 +185,6 @@ %\begin{definition}[Bitdecoding of Values]\mbox{} -\begin{envForCaption} \begin{center} \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ @@ -205,10 +211,10 @@ \textit{else}\;\textit{None}$ \end{tabular} \end{center} -\end{envForCaption} %\end{definition} \noindent +The function $\decode'$ returns a pair consisting of a partially decoded value and some leftover: $\decode'$ does most of the job while $\decode$ throws away leftover bit-codes and returns the value only. $\decode$ is terminating as $\decode'$ is terminating. @@ -251,7 +257,7 @@ The first thing we define related to bit-coded regular expressions -is how we move bits, for instance pasting it at the front of an annotated regex. +is how we move bits, for instance pasting it at the front of an annotated regular expression. The operation $\fuse$ is just to attach bit-codes to the front of an annotated regular expression: \begin{center} @@ -364,7 +370,7 @@ This is done by adding bitcodes to the derivatives, for example when one more star iteratoin is taken (we call the operation of derivatives on annotated regular expressions $\bder$ -because it is derivatives on regexes with \emph{b}itcodes), +because it is derivatives on regular expressiones with \emph{b}itcodes), we need to unfold it into a sequence, and attach an additional bit $Z$ to the front of $r \backslash c$ to indicate one more star iteration. @@ -652,7 +658,7 @@ $\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$ \end{lemma} \noindent -$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regexes of $\blexer$. +$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regular expressiones of $\blexer$. For plain regular expressions something similar is required as well. \subsection{$\flex$}