lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded1.tex

equal deleted inserted replaced

-:e71a6e2aca2d
+:35df9cdd36ca
 %its correctness proof in
 %Chapter 3\ref{Chapter3}.
 In this chapter, we are going to describe the bit-coded algorithm
 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of
 regular expressions.
+We have implemented their algorithm in Scala, and found out inefficiencies
+in their algorithm such as de-duplication not working properly and redundant
+fixpoint construction. Their algorithm is improved and verified with the help of
+formal proofs.
 \section{Bit-coded Algorithm}
-The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
+We first do a recap of what was going on
+in the lexer algorithm in Chapter \ref{Inj},
+\begin{center}
+\begin{tabular}{lcl}
+	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
+	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
+	& & $\quad \phantom{\mid}\; \None \implies \None$\\
+	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
+\end{tabular}
+\end{center}
+\noindent
+The algorithm recursively calls $\lexer$ on
+each new character input,
+and before starting a child call
+it stores information of previous lexing steps
+on a stack, in the form of regular expressions
+and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
+Each descent into deeper recursive calls in $\lexer$
+causes a new pair of $r_i, c_i$ to be pushed to the call stack.
+\begin{figure}
+\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick]
+%\draw (-6,-6) grid (6,6);
+\node  [ circle ] (r) at (-6, 5) {$r$};
+%\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$};
+\node  [circle, minimum size = 0.1, draw] (c1) at (-4, 5.4) {$c_1$};
+%
+%\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$};
+\node  [minimum size = 0.5, circle, draw] (r1) at (-2, 5) {$r_1$};
+\node [minimum width = 2cm, rectangle, draw] (stack) at (0, 3) {Stack};
+\path
+	(r)
+edge [->, >=stealth',shorten >=1pt] node[left] {} (r1);
+\path   (r1)
+	edge [bend right, dashed] node {saved} (stack);
+\path   (c1)
+	edge [bend right, dashed] node {} (stack);
+\end{tikzpicture}
+\caption{First Derivative Taken}
+\end{figure}
+\begin{figure}
+\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick]
+%\draw (-6,-6) grid (6,6);
+\node  [ circle ] (r) at (-6, 5) {$r$};
+%\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$};
+\node  [circle, minimum size = 0.1, ] (c1) at (-4, 5.4) {$c_1$};
+%
+%\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$};
+\node  [minimum size = 0.5, circle, ] (r1) at (-2, 5) {$r_1$};
+\node [circle, minimum size = 0.1, draw] (c2) at (0, 5.4) {$c_2$};
+%
+%\node (2, 5) (r2) circle [radius = 0.5] {$r_2$};
+\node [circle, draw] (r2) at (2, 5) {$r_2$};
+\node [minimum width = 3cm, minimum height = 1cm, rectangle, draw] (stack) at (0, 2) {\large Stack};
+\path
+	(r)
+edge [->, >=stealth',shorten >=1pt] node[left] {} (r1);
+\path   (r2)
+	edge [bend right, dashed] node {} (stack);
+\path   (c2)
+	edge [bend right, dashed] node {} (stack);
+\path   (r1)
+	edge [] node {} (r2);
+\end{tikzpicture}
+\caption{Second Derivative Taken}
+\end{figure}
+\noindent
+As the number of derivative steps increase,
+the stack would increase:
+\begin{figure}
+\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick]
+%\draw (-6,-6) grid (6,6);
+\node  [ circle ] (r) at (-6, 5) {$r$};
+%\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$};
+\node  [circle, minimum size = 0.1, ] (c1) at (-4, 5.4) {$c_1$};
+%
+%\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$};
+\node  [minimum size = 0.5, circle, ] (r1) at (-2, 5) {$r_1$};
+\node [circle, minimum size = 0.1, ] (c2) at (0, 5.4) {$c_2$};
+%
+%\node (2, 5) (r2) circle [radius = 0.5] {$r_2$};
+\node [circle, ] (r2) at (2, 5) {$r_2$};
+\node [minimum width = 4cm, minimum height = 2.5cm, rectangle, draw] (stack) at (0, 1) { \large Stack};
+\node [] (ldots) at (3.5, 5) {};
+%\node (6, 5) (rn) circle [radius = 0.5] {$r_n$};
+\node [minimum size = 0.5, circle, ] (rn) at (6, 5) {};
+\node (rldots) at ($(ldots)!.4!(rn)$) {\ldots};
+\path
+	(r)
+edge [->, >=stealth',shorten >=1pt] node[left] {} (r1);
+\path   (rldots)
+	edge [bend left, dashed] node {} (stack);
+\path   (r1)
+	edge [] node {} (r2);
+\path   (r2)
+	edge [] node {} (ldots);
+\path   (ldots)
+	edge [bend left, dashed] node {} (stack);
+\path   (5.03, 4.9)
+	edge [bend left, dashed] node {} (stack);
+\end{tikzpicture}
+\caption{More Derivatives Taken}
+\end{figure}
+\noindent
+After all derivatives have been taken, the stack grows to a maximum size
+and the pair of regular expressions and characters $r_i, c_{i+1}$
+are then popped out and used in the injection phase:
+\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick]
+%\draw (-6,-6) grid (6,6);
+\node  [radius = 0.5, circle, draw] (r) at (-6, 5) {$r$};
+%\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$};
+\node  [circle, minimum size = 0.1, draw] (c1) at (-4, 5.4) {$c_1$};
+%
+%\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$};
+\node  [minimum size = 0.5, circle, draw] (r1) at (-2, 5) {$r_1$};
+%
+%\node (0, 6)  (c2) circle [radius = 0.3] {$c_2$};
+\node [circle, minimum size = 0.1, draw] (c2) at (0, 5.4) {$c_2$};
+%
+%\node (2, 5) (r2) circle [radius = 0.5] {$r_2$};
+\node [circle, draw] (r2) at (2, 5) {$r_2$};
+%
+%
+\node [] (ldots) at (4.5, 5) {};
+%\node (6, 5) (rn) circle [radius = 0.5] {$r_n$};
+\node [minimum size = 0.5, circle, draw] (rn) at (6, 5) {$r_n$};
+\node at ($(ldots)!.4!(rn)$) {\ldots};
+\node [minimum size = 0.5, circle, draw] (vn) at (6, -5) {$v_n$};
+\node [] (ldots2) at (3.5, -5) {};
+\node  [minimum size = 0.5, circle, draw] (v2) at (2, -5) {$v_2$};
+\node at ($(ldots2)!.4!(v2)$) {\ldots};
+\node [circle, draw] (v1) at (-2, -5) {$v_1$};
+\node  [radius = 0.5, circle, draw] (v) at (-6, -5) {$v$};
+\node [minimum size = 6cm, rectangle, draw] (stack) at (0, 0) {\Huge Stack};
+\path
+	(r)
+edge [->, >=stealth',shorten >=1pt] node[left] {} (r1);
+\path
+	(r1)
+edge [] node {} (r2);
+\path   (r2)
+	edge [] node {} (ldots);
+\path   (rn)
+	edge [] node {$\mkeps$} (vn);
+\path   (vn)
+	edge [] node {} (ldots2);
+\path   (v2)
+	edge [] node {$\inj \; r_1 \; c_2\;v_2$} (v1);
+\path   (v1)
+	edge [] node {$\inj \; r \; c_1 \; v_1$} (v);
+\path   (r)
+	edge [dashed, bend right] node {} (stack);
+\path   (r1)
+	edge [dashed, ] node {} (stack);
+\path   (c1)
+	edge [dashed, bend right] node {} (stack);
+\path   (c2)
+	edge [dashed] node {} (stack);
+\path   (4.5, 5)
+	edge [dashed, bend left] node {} (stack);
+\path   (4.9, 5)
+	edge [dashed, bend left] node {} (stack);
+\path   (5.3, 5)
+	edge [dashed, bend left] node {} (stack);
+\path (r2)
+	edge [dashed, ] node {} (stack);
+\path (rn)
+	edge [dashed, bend left] node {} (stack);
+\end{tikzpicture}
+%\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick]
+%%\draw (-6,-6) grid (6,6);
+%\node  [radius = 0.5, circle, draw] (r) at (-6, 5) {$r$};
+%
+%%\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$};
+%\node  [circle, minimum size = 0.1] (c1) at (-4, 5.4) {$c_1$};
+%%
+%%\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$};
+%\node  [minimum size = 0.5, circle, draw] (r1) at (-2, 5) {$r_1$};
+%%
+%%\node (0, 6)  (c2) circle [radius = 0.3] {$c_2$};
+%\node [circle, minimum size = 0.1] (c2) at (0, 5.4) {$c_2$};
+%%
+%%\node (2, 5) (r2) circle [radius = 0.5] {$r_2$};
+%\node [circle, draw] (r2) at (2, 5) {$r_2$};
+%%
+%%
+%\node [] (ldots) at (4.5, 5) {};
+%%\node (6, 5) (rn) circle [radius = 0.5] {$r_n$};
+%
+%\node [minimum size = 0.5, circle, draw] (rn) at (6, 5) {$r_n$};
+%
+%\node at ($(ldots)!.4!(rn)$) {\ldots};
+%
+%\node [minimum size = 0.5, circle, draw] (vn) at (6, -5) {$v_n$};
+%
+%\node [] (ldots2) at (3.5, -5) {};
+%
+%\node  [minimum size = 0.5, circle, draw] (v2) at (2, -5) {$v_2$};
+%
+%\node at ($(ldots2)!.4!(v2)$) {\ldots};
+%
+%
+%\node [circle, draw] (v1) at (-2, -5) {$v_1$};
+%
+%\node  [radius = 0.5, circle, draw] (v) at (-6, -5) {$v$};
+%
+%\node [minimum size = 6cm, rectangle, draw] (stack) at (0, 0) {\Huge Stack};
+%
+%\path
+%	(r)
+%        edge [->, >=stealth',shorten >=1pt] node[left] {} (r1);
+%\path
+%	(r1)
+%        edge [] node {} (r2);
+%\path   (r2)
+%	edge [] node {} (ldots);
+%\path   (rn)
+%	edge [] node {$\mkeps$} (vn);
+%\path   (vn)
+%	edge [] node {} (ldots2);
+%\path   (v2)
+%	edge [] node {} (v1);
+%
+%\path   (v1)
+%	edge [] node {} (v);
+%\path   (r)
+%	edge [] node {saved} (stack);
+%
+%\path   (r1)
+%	edge [] node {saved} (stack);
+%\end{tikzpicture}
+\noindent
+The information stored in characters and regular expressions
+make the algorithm work in an elegant way, at the expense of being
+storing quite a bit of verbose information.
+The lexer algorithm in Chapter \ref{Inj},
 stores information of previous lexing steps
 on a stack, in the form of regular expressions
 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
+The red part represents what we already know during the first
+derivative phase,
+and the blue part represents the unknown part of input.
 \begin{ceqn}
 \begin{equation}%\label{graph:injLexer}
 	\begin{tikzcd}[ampersand replacement=\&, execute at end picture={
 			\begin{scope}[on background layer]
 				\node[rectangle, fill={red!30},
 v_0           \& v_1 \arrow[l,"inj_{r_0} c_0"]                \& v_2 \arrow[l, "inj_{r_1} c_1"]              \& v_n \arrow[l, dashed]         \\
 \end{tikzcd}
 \end{equation}
 \end{ceqn}
 \noindent
-The red part represents what we already know during the first
-derivative phase,
-and the blue part represents the unknown part of input.
 The red area expands as we move towards $r_n$,
-indicating an increasing stack size during lexing.
+indicating more is known about the lexing result.
-Despite having some partial lexing information during
+Despite knowing this partial lexing information during
 the forward derivative phase, we choose to store them
-temporarily, only to convert the information to lexical
+all the way until $r_n$ is reached.
-values at a later stage. In essence we are repeating work we
+Then we reconstruct the value character by character
-have already done.
+values at a later stage, using information in a Last-In-First-Out
-This is both inefficient and prone to stack overflow.
+manner. Although the algorithm is elegant and natural,
-A natural question arises as to whether we can store lexing
+it can be inefficient and prone to stack overflow.\\
+It turns out we can store lexing
 information on the fly, while still using regular expression
 derivatives.
+If we remove the individual
-If we remove the details of the individual
 lexing steps, and use red and blue areas as before
 to indicate consumed (seen) input and constructed
 partial value (before recovering the rest of the stack),
 one could see that the seen part's lexical information
 is stored in the form of a regular expression.

changeset 579	35df9cdd36ca
parent 576	3e1b699696b6
child 580	e0f0a81f907b