diff -r e71a6e2aca2d -r 35df9cdd36ca ChengsongTanPhdThesis/Chapters/Bitcoded1.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Mon Aug 15 17:26:08 2022 +0200 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Wed Aug 17 01:09:13 2022 +0100 @@ -11,11 +11,299 @@ In this chapter, we are going to describe the bit-coded algorithm introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of regular expressions. +We have implemented their algorithm in Scala, and found out inefficiencies +in their algorithm such as de-duplication not working properly and redundant +fixpoint construction. Their algorithm is improved and verified with the help of +formal proofs. \section{Bit-coded Algorithm} -The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure}, +We first do a recap of what was going on +in the lexer algorithm in Chapter \ref{Inj}, +\begin{center} +\begin{tabular}{lcl} + $\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ + $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\ + & & $\quad \phantom{\mid}\; \None \implies \None$\\ + & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ +\end{tabular} +\end{center} +\noindent +The algorithm recursively calls $\lexer$ on +each new character input, +and before starting a child call +it stores information of previous lexing steps +on a stack, in the form of regular expressions +and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc. +Each descent into deeper recursive calls in $\lexer$ +causes a new pair of $r_i, c_i$ to be pushed to the call stack. +\begin{figure} +\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] +%\draw (-6,-6) grid (6,6); +\node [ circle ] (r) at (-6, 5) {$r$}; + +%\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$}; +\node [circle, minimum size = 0.1, draw] (c1) at (-4, 5.4) {$c_1$}; +% +%\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$}; +\node [minimum size = 0.5, circle, draw] (r1) at (-2, 5) {$r_1$}; + +\node [minimum width = 2cm, rectangle, draw] (stack) at (0, 3) {Stack}; + +\path + (r) + edge [->, >=stealth',shorten >=1pt] node[left] {} (r1); + +\path (r1) + edge [bend right, dashed] node {saved} (stack); +\path (c1) + edge [bend right, dashed] node {} (stack); + + +\end{tikzpicture} +\caption{First Derivative Taken} +\end{figure} + + + +\begin{figure} +\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] +%\draw (-6,-6) grid (6,6); +\node [ circle ] (r) at (-6, 5) {$r$}; + +%\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$}; +\node [circle, minimum size = 0.1, ] (c1) at (-4, 5.4) {$c_1$}; +% +%\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$}; +\node [minimum size = 0.5, circle, ] (r1) at (-2, 5) {$r_1$}; + + +\node [circle, minimum size = 0.1, draw] (c2) at (0, 5.4) {$c_2$}; +% +%\node (2, 5) (r2) circle [radius = 0.5] {$r_2$}; +\node [circle, draw] (r2) at (2, 5) {$r_2$}; +\node [minimum width = 3cm, minimum height = 1cm, rectangle, draw] (stack) at (0, 2) {\large Stack}; + +\path + (r) + edge [->, >=stealth',shorten >=1pt] node[left] {} (r1); + +\path (r2) + edge [bend right, dashed] node {} (stack); +\path (c2) + edge [bend right, dashed] node {} (stack); + +\path (r1) + edge [] node {} (r2); + +\end{tikzpicture} +\caption{Second Derivative Taken} +\end{figure} +\noindent +As the number of derivative steps increase, +the stack would increase: + +\begin{figure} +\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] +%\draw (-6,-6) grid (6,6); +\node [ circle ] (r) at (-6, 5) {$r$}; + +%\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$}; +\node [circle, minimum size = 0.1, ] (c1) at (-4, 5.4) {$c_1$}; +% +%\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$}; +\node [minimum size = 0.5, circle, ] (r1) at (-2, 5) {$r_1$}; + + +\node [circle, minimum size = 0.1, ] (c2) at (0, 5.4) {$c_2$}; +% +%\node (2, 5) (r2) circle [radius = 0.5] {$r_2$}; +\node [circle, ] (r2) at (2, 5) {$r_2$}; +\node [minimum width = 4cm, minimum height = 2.5cm, rectangle, draw] (stack) at (0, 1) { \large Stack}; + +\node [] (ldots) at (3.5, 5) {}; +%\node (6, 5) (rn) circle [radius = 0.5] {$r_n$}; + +\node [minimum size = 0.5, circle, ] (rn) at (6, 5) {}; + +\node (rldots) at ($(ldots)!.4!(rn)$) {\ldots}; + +\path + (r) + edge [->, >=stealth',shorten >=1pt] node[left] {} (r1); + +\path (rldots) + edge [bend left, dashed] node {} (stack); + +\path (r1) + edge [] node {} (r2); + +\path (r2) + edge [] node {} (ldots); +\path (ldots) + edge [bend left, dashed] node {} (stack); +\path (5.03, 4.9) + edge [bend left, dashed] node {} (stack); +\end{tikzpicture} +\caption{More Derivatives Taken} +\end{figure} +\noindent +After all derivatives have been taken, the stack grows to a maximum size +and the pair of regular expressions and characters $r_i, c_{i+1}$ +are then popped out and used in the injection phase: + +\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] +%\draw (-6,-6) grid (6,6); +\node [radius = 0.5, circle, draw] (r) at (-6, 5) {$r$}; + +%\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$}; +\node [circle, minimum size = 0.1, draw] (c1) at (-4, 5.4) {$c_1$}; +% +%\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$}; +\node [minimum size = 0.5, circle, draw] (r1) at (-2, 5) {$r_1$}; +% +%\node (0, 6) (c2) circle [radius = 0.3] {$c_2$}; +\node [circle, minimum size = 0.1, draw] (c2) at (0, 5.4) {$c_2$}; +% +%\node (2, 5) (r2) circle [radius = 0.5] {$r_2$}; +\node [circle, draw] (r2) at (2, 5) {$r_2$}; +% +% +\node [] (ldots) at (4.5, 5) {}; +%\node (6, 5) (rn) circle [radius = 0.5] {$r_n$}; + +\node [minimum size = 0.5, circle, draw] (rn) at (6, 5) {$r_n$}; + +\node at ($(ldots)!.4!(rn)$) {\ldots}; + +\node [minimum size = 0.5, circle, draw] (vn) at (6, -5) {$v_n$}; + +\node [] (ldots2) at (3.5, -5) {}; + +\node [minimum size = 0.5, circle, draw] (v2) at (2, -5) {$v_2$}; + +\node at ($(ldots2)!.4!(v2)$) {\ldots}; + + +\node [circle, draw] (v1) at (-2, -5) {$v_1$}; + +\node [radius = 0.5, circle, draw] (v) at (-6, -5) {$v$}; + +\node [minimum size = 6cm, rectangle, draw] (stack) at (0, 0) {\Huge Stack}; + +\path + (r) + edge [->, >=stealth',shorten >=1pt] node[left] {} (r1); +\path + (r1) + edge [] node {} (r2); +\path (r2) + edge [] node {} (ldots); +\path (rn) + edge [] node {$\mkeps$} (vn); +\path (vn) + edge [] node {} (ldots2); +\path (v2) + edge [] node {$\inj \; r_1 \; c_2\;v_2$} (v1); + +\path (v1) + edge [] node {$\inj \; r \; c_1 \; v_1$} (v); +\path (r) + edge [dashed, bend right] node {} (stack); + +\path (r1) + edge [dashed, ] node {} (stack); + +\path (c1) + edge [dashed, bend right] node {} (stack); + +\path (c2) + edge [dashed] node {} (stack); +\path (4.5, 5) + edge [dashed, bend left] node {} (stack); +\path (4.9, 5) + edge [dashed, bend left] node {} (stack); +\path (5.3, 5) + edge [dashed, bend left] node {} (stack); +\path (r2) + edge [dashed, ] node {} (stack); +\path (rn) + edge [dashed, bend left] node {} (stack); +\end{tikzpicture} +%\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] +%%\draw (-6,-6) grid (6,6); +%\node [radius = 0.5, circle, draw] (r) at (-6, 5) {$r$}; +% +%%\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$}; +%\node [circle, minimum size = 0.1] (c1) at (-4, 5.4) {$c_1$}; +%% +%%\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$}; +%\node [minimum size = 0.5, circle, draw] (r1) at (-2, 5) {$r_1$}; +%% +%%\node (0, 6) (c2) circle [radius = 0.3] {$c_2$}; +%\node [circle, minimum size = 0.1] (c2) at (0, 5.4) {$c_2$}; +%% +%%\node (2, 5) (r2) circle [radius = 0.5] {$r_2$}; +%\node [circle, draw] (r2) at (2, 5) {$r_2$}; +%% +%% +%\node [] (ldots) at (4.5, 5) {}; +%%\node (6, 5) (rn) circle [radius = 0.5] {$r_n$}; +% +%\node [minimum size = 0.5, circle, draw] (rn) at (6, 5) {$r_n$}; +% +%\node at ($(ldots)!.4!(rn)$) {\ldots}; +% +%\node [minimum size = 0.5, circle, draw] (vn) at (6, -5) {$v_n$}; +% +%\node [] (ldots2) at (3.5, -5) {}; +% +%\node [minimum size = 0.5, circle, draw] (v2) at (2, -5) {$v_2$}; +% +%\node at ($(ldots2)!.4!(v2)$) {\ldots}; +% +% +%\node [circle, draw] (v1) at (-2, -5) {$v_1$}; +% +%\node [radius = 0.5, circle, draw] (v) at (-6, -5) {$v$}; +% +%\node [minimum size = 6cm, rectangle, draw] (stack) at (0, 0) {\Huge Stack}; +% +%\path +% (r) +% edge [->, >=stealth',shorten >=1pt] node[left] {} (r1); +%\path +% (r1) +% edge [] node {} (r2); +%\path (r2) +% edge [] node {} (ldots); +%\path (rn) +% edge [] node {$\mkeps$} (vn); +%\path (vn) +% edge [] node {} (ldots2); +%\path (v2) +% edge [] node {} (v1); +% +%\path (v1) +% edge [] node {} (v); +%\path (r) +% edge [] node {saved} (stack); +% +%\path (r1) +% edge [] node {saved} (stack); +%\end{tikzpicture} +\noindent +The information stored in characters and regular expressions +make the algorithm work in an elegant way, at the expense of being +storing quite a bit of verbose information. + + + +The lexer algorithm in Chapter \ref{Inj}, stores information of previous lexing steps on a stack, in the form of regular expressions and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc. +The red part represents what we already know during the first +derivative phase, +and the blue part represents the unknown part of input. \begin{ceqn} \begin{equation}%\label{graph:injLexer} \begin{tikzcd}[ampersand replacement=\&, execute at end picture={ @@ -39,22 +327,19 @@ \end{equation} \end{ceqn} \noindent -The red part represents what we already know during the first -derivative phase, -and the blue part represents the unknown part of input. The red area expands as we move towards $r_n$, -indicating an increasing stack size during lexing. -Despite having some partial lexing information during +indicating more is known about the lexing result. +Despite knowing this partial lexing information during the forward derivative phase, we choose to store them -temporarily, only to convert the information to lexical -values at a later stage. In essence we are repeating work we -have already done. -This is both inefficient and prone to stack overflow. -A natural question arises as to whether we can store lexing +all the way until $r_n$ is reached. +Then we reconstruct the value character by character +values at a later stage, using information in a Last-In-First-Out +manner. Although the algorithm is elegant and natural, +it can be inefficient and prone to stack overflow.\\ +It turns out we can store lexing information on the fly, while still using regular expression derivatives. - -If we remove the details of the individual +If we remove the individual lexing steps, and use red and blue areas as before to indicate consumed (seen) input and constructed partial value (before recovering the rest of the stack),