ChengsongTanPhdThesis/Chapters/Bitcoded1.tex
changeset 579 35df9cdd36ca
parent 576 3e1b699696b6
child 580 e0f0a81f907b
equal deleted inserted replaced
578:e71a6e2aca2d 579:35df9cdd36ca
     9 %its correctness proof in 
     9 %its correctness proof in 
    10 %Chapter 3\ref{Chapter3}. 
    10 %Chapter 3\ref{Chapter3}. 
    11 In this chapter, we are going to describe the bit-coded algorithm
    11 In this chapter, we are going to describe the bit-coded algorithm
    12 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of 
    12 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of 
    13 regular expressions. 
    13 regular expressions. 
       
    14 We have implemented their algorithm in Scala, and found out inefficiencies
       
    15 in their algorithm such as de-duplication not working properly and redundant
       
    16 fixpoint construction. Their algorithm is improved and verified with the help of
       
    17 formal proofs.
    14 \section{Bit-coded Algorithm}
    18 \section{Bit-coded Algorithm}
    15 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
    19 We first do a recap of what was going on 
       
    20 in the lexer algorithm in Chapter \ref{Inj},
       
    21 \begin{center}
       
    22 \begin{tabular}{lcl}
       
    23 	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
       
    24 	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
       
    25 	& & $\quad \phantom{\mid}\; \None \implies \None$\\
       
    26 	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
       
    27 \end{tabular}
       
    28 \end{center}
       
    29 \noindent
       
    30 The algorithm recursively calls $\lexer$ on 
       
    31 each new character input,
       
    32 and before starting a child call
       
    33 it stores information of previous lexing steps
       
    34 on a stack, in the form of regular expressions
       
    35 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
       
    36 Each descent into deeper recursive calls in $\lexer$
       
    37 causes a new pair of $r_i, c_i$ to be pushed to the call stack.
       
    38 \begin{figure}
       
    39 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] 
       
    40 %\draw (-6,-6) grid (6,6);
       
    41 \node  [ circle ] (r) at (-6, 5) {$r$};
       
    42 
       
    43 %\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$};
       
    44 \node  [circle, minimum size = 0.1, draw] (c1) at (-4, 5.4) {$c_1$};
       
    45 %
       
    46 %\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$};
       
    47 \node  [minimum size = 0.5, circle, draw] (r1) at (-2, 5) {$r_1$};
       
    48 
       
    49 \node [minimum width = 2cm, rectangle, draw] (stack) at (0, 3) {Stack};
       
    50 
       
    51 \path
       
    52 	(r)
       
    53         edge [->, >=stealth',shorten >=1pt] node[left] {} (r1);
       
    54 
       
    55 \path   (r1)
       
    56 	edge [bend right, dashed] node {saved} (stack);
       
    57 \path   (c1)
       
    58 	edge [bend right, dashed] node {} (stack);
       
    59 
       
    60 
       
    61 \end{tikzpicture}
       
    62 \caption{First Derivative Taken}
       
    63 \end{figure}
       
    64 
       
    65 
       
    66 
       
    67 \begin{figure}
       
    68 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] 
       
    69 %\draw (-6,-6) grid (6,6);
       
    70 \node  [ circle ] (r) at (-6, 5) {$r$};
       
    71 
       
    72 %\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$};
       
    73 \node  [circle, minimum size = 0.1, ] (c1) at (-4, 5.4) {$c_1$};
       
    74 %
       
    75 %\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$};
       
    76 \node  [minimum size = 0.5, circle, ] (r1) at (-2, 5) {$r_1$};
       
    77 
       
    78 
       
    79 \node [circle, minimum size = 0.1, draw] (c2) at (0, 5.4) {$c_2$};
       
    80 %
       
    81 %\node (2, 5) (r2) circle [radius = 0.5] {$r_2$};
       
    82 \node [circle, draw] (r2) at (2, 5) {$r_2$};
       
    83 \node [minimum width = 3cm, minimum height = 1cm, rectangle, draw] (stack) at (0, 2) {\large Stack};
       
    84 
       
    85 \path
       
    86 	(r)
       
    87         edge [->, >=stealth',shorten >=1pt] node[left] {} (r1);
       
    88 
       
    89 \path   (r2)
       
    90 	edge [bend right, dashed] node {} (stack);
       
    91 \path   (c2)
       
    92 	edge [bend right, dashed] node {} (stack);
       
    93 
       
    94 \path   (r1)
       
    95 	edge [] node {} (r2);
       
    96 
       
    97 \end{tikzpicture}
       
    98 \caption{Second Derivative Taken}
       
    99 \end{figure}
       
   100 \noindent
       
   101 As the number of derivative steps increase,
       
   102 the stack would increase:
       
   103 
       
   104 \begin{figure}
       
   105 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] 
       
   106 %\draw (-6,-6) grid (6,6);
       
   107 \node  [ circle ] (r) at (-6, 5) {$r$};
       
   108 
       
   109 %\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$};
       
   110 \node  [circle, minimum size = 0.1, ] (c1) at (-4, 5.4) {$c_1$};
       
   111 %
       
   112 %\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$};
       
   113 \node  [minimum size = 0.5, circle, ] (r1) at (-2, 5) {$r_1$};
       
   114 
       
   115 
       
   116 \node [circle, minimum size = 0.1, ] (c2) at (0, 5.4) {$c_2$};
       
   117 %
       
   118 %\node (2, 5) (r2) circle [radius = 0.5] {$r_2$};
       
   119 \node [circle, ] (r2) at (2, 5) {$r_2$};
       
   120 \node [minimum width = 4cm, minimum height = 2.5cm, rectangle, draw] (stack) at (0, 1) { \large Stack};
       
   121 
       
   122 \node [] (ldots) at (3.5, 5) {};
       
   123 %\node (6, 5) (rn) circle [radius = 0.5] {$r_n$};
       
   124 
       
   125 \node [minimum size = 0.5, circle, ] (rn) at (6, 5) {};
       
   126 
       
   127 \node (rldots) at ($(ldots)!.4!(rn)$) {\ldots};
       
   128 
       
   129 \path
       
   130 	(r)
       
   131         edge [->, >=stealth',shorten >=1pt] node[left] {} (r1);
       
   132 
       
   133 \path   (rldots)
       
   134 	edge [bend left, dashed] node {} (stack);
       
   135 
       
   136 \path   (r1)
       
   137 	edge [] node {} (r2);
       
   138 
       
   139 \path   (r2)
       
   140 	edge [] node {} (ldots);
       
   141 \path   (ldots)
       
   142 	edge [bend left, dashed] node {} (stack);
       
   143 \path   (5.03, 4.9)
       
   144 	edge [bend left, dashed] node {} (stack);
       
   145 \end{tikzpicture}
       
   146 \caption{More Derivatives Taken}
       
   147 \end{figure}
       
   148 \noindent
       
   149 After all derivatives have been taken, the stack grows to a maximum size
       
   150 and the pair of regular expressions and characters $r_i, c_{i+1}$ 
       
   151 are then popped out and used in the injection phase:
       
   152 
       
   153 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] 
       
   154 %\draw (-6,-6) grid (6,6);
       
   155 \node  [radius = 0.5, circle, draw] (r) at (-6, 5) {$r$};
       
   156 
       
   157 %\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$};
       
   158 \node  [circle, minimum size = 0.1, draw] (c1) at (-4, 5.4) {$c_1$};
       
   159 %
       
   160 %\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$};
       
   161 \node  [minimum size = 0.5, circle, draw] (r1) at (-2, 5) {$r_1$};
       
   162 %
       
   163 %\node (0, 6)  (c2) circle [radius = 0.3] {$c_2$};
       
   164 \node [circle, minimum size = 0.1, draw] (c2) at (0, 5.4) {$c_2$};
       
   165 %
       
   166 %\node (2, 5) (r2) circle [radius = 0.5] {$r_2$};
       
   167 \node [circle, draw] (r2) at (2, 5) {$r_2$};
       
   168 %
       
   169 %
       
   170 \node [] (ldots) at (4.5, 5) {};
       
   171 %\node (6, 5) (rn) circle [radius = 0.5] {$r_n$};
       
   172 
       
   173 \node [minimum size = 0.5, circle, draw] (rn) at (6, 5) {$r_n$};
       
   174 
       
   175 \node at ($(ldots)!.4!(rn)$) {\ldots};
       
   176 
       
   177 \node [minimum size = 0.5, circle, draw] (vn) at (6, -5) {$v_n$};
       
   178 
       
   179 \node [] (ldots2) at (3.5, -5) {};
       
   180 
       
   181 \node  [minimum size = 0.5, circle, draw] (v2) at (2, -5) {$v_2$};
       
   182 
       
   183 \node at ($(ldots2)!.4!(v2)$) {\ldots};
       
   184 
       
   185 
       
   186 \node [circle, draw] (v1) at (-2, -5) {$v_1$};
       
   187 
       
   188 \node  [radius = 0.5, circle, draw] (v) at (-6, -5) {$v$};
       
   189 
       
   190 \node [minimum size = 6cm, rectangle, draw] (stack) at (0, 0) {\Huge Stack};
       
   191 
       
   192 \path
       
   193 	(r)
       
   194         edge [->, >=stealth',shorten >=1pt] node[left] {} (r1);
       
   195 \path
       
   196 	(r1)
       
   197         edge [] node {} (r2);
       
   198 \path   (r2)
       
   199 	edge [] node {} (ldots);
       
   200 \path   (rn)
       
   201 	edge [] node {$\mkeps$} (vn);
       
   202 \path   (vn) 
       
   203 	edge [] node {} (ldots2);
       
   204 \path   (v2)
       
   205 	edge [] node {$\inj \; r_1 \; c_2\;v_2$} (v1);
       
   206 
       
   207 \path   (v1)
       
   208 	edge [] node {$\inj \; r \; c_1 \; v_1$} (v);
       
   209 \path   (r)
       
   210 	edge [dashed, bend right] node {} (stack);
       
   211 
       
   212 \path   (r1)
       
   213 	edge [dashed, ] node {} (stack);
       
   214 
       
   215 \path   (c1)
       
   216 	edge [dashed, bend right] node {} (stack);
       
   217 
       
   218 \path   (c2)
       
   219 	edge [dashed] node {} (stack);
       
   220 \path   (4.5, 5)
       
   221 	edge [dashed, bend left] node {} (stack);
       
   222 \path   (4.9, 5)
       
   223 	edge [dashed, bend left] node {} (stack);
       
   224 \path   (5.3, 5)
       
   225 	edge [dashed, bend left] node {} (stack);
       
   226 \path (r2)
       
   227 	edge [dashed, ] node {} (stack);
       
   228 \path (rn)
       
   229 	edge [dashed, bend left] node {} (stack);
       
   230 \end{tikzpicture}
       
   231 %\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] 
       
   232 %%\draw (-6,-6) grid (6,6);
       
   233 %\node  [radius = 0.5, circle, draw] (r) at (-6, 5) {$r$};
       
   234 %
       
   235 %%\node (-4, 6) (c1) circle [radius = 0.3] {$c_1$};
       
   236 %\node  [circle, minimum size = 0.1] (c1) at (-4, 5.4) {$c_1$};
       
   237 %%
       
   238 %%\node (-2, 5) (r1) circle [radius = 0.5] {$r_1$};
       
   239 %\node  [minimum size = 0.5, circle, draw] (r1) at (-2, 5) {$r_1$};
       
   240 %%
       
   241 %%\node (0, 6)  (c2) circle [radius = 0.3] {$c_2$};
       
   242 %\node [circle, minimum size = 0.1] (c2) at (0, 5.4) {$c_2$};
       
   243 %%
       
   244 %%\node (2, 5) (r2) circle [radius = 0.5] {$r_2$};
       
   245 %\node [circle, draw] (r2) at (2, 5) {$r_2$};
       
   246 %%
       
   247 %%
       
   248 %\node [] (ldots) at (4.5, 5) {};
       
   249 %%\node (6, 5) (rn) circle [radius = 0.5] {$r_n$};
       
   250 %
       
   251 %\node [minimum size = 0.5, circle, draw] (rn) at (6, 5) {$r_n$};
       
   252 %
       
   253 %\node at ($(ldots)!.4!(rn)$) {\ldots};
       
   254 %
       
   255 %\node [minimum size = 0.5, circle, draw] (vn) at (6, -5) {$v_n$};
       
   256 %
       
   257 %\node [] (ldots2) at (3.5, -5) {};
       
   258 %
       
   259 %\node  [minimum size = 0.5, circle, draw] (v2) at (2, -5) {$v_2$};
       
   260 %
       
   261 %\node at ($(ldots2)!.4!(v2)$) {\ldots};
       
   262 %
       
   263 %
       
   264 %\node [circle, draw] (v1) at (-2, -5) {$v_1$};
       
   265 %
       
   266 %\node  [radius = 0.5, circle, draw] (v) at (-6, -5) {$v$};
       
   267 %
       
   268 %\node [minimum size = 6cm, rectangle, draw] (stack) at (0, 0) {\Huge Stack};
       
   269 %
       
   270 %\path
       
   271 %	(r)
       
   272 %        edge [->, >=stealth',shorten >=1pt] node[left] {} (r1);
       
   273 %\path
       
   274 %	(r1)
       
   275 %        edge [] node {} (r2);
       
   276 %\path   (r2)
       
   277 %	edge [] node {} (ldots);
       
   278 %\path   (rn)
       
   279 %	edge [] node {$\mkeps$} (vn);
       
   280 %\path   (vn) 
       
   281 %	edge [] node {} (ldots2);
       
   282 %\path   (v2)
       
   283 %	edge [] node {} (v1);
       
   284 %
       
   285 %\path   (v1)
       
   286 %	edge [] node {} (v);
       
   287 %\path   (r)
       
   288 %	edge [] node {saved} (stack);
       
   289 %
       
   290 %\path   (r1)
       
   291 %	edge [] node {saved} (stack);
       
   292 %\end{tikzpicture}
       
   293 \noindent
       
   294 The information stored in characters and regular expressions
       
   295 make the algorithm work in an elegant way, at the expense of being
       
   296 storing quite a bit of verbose information.
       
   297 
       
   298 
       
   299 
       
   300 The lexer algorithm in Chapter \ref{Inj},
    16 stores information of previous lexing steps
   301 stores information of previous lexing steps
    17 on a stack, in the form of regular expressions
   302 on a stack, in the form of regular expressions
    18 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
   303 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
       
   304 The red part represents what we already know during the first
       
   305 derivative phase,
       
   306 and the blue part represents the unknown part of input.
    19 \begin{ceqn}
   307 \begin{ceqn}
    20 \begin{equation}%\label{graph:injLexer}
   308 \begin{equation}%\label{graph:injLexer}
    21 	\begin{tikzcd}[ampersand replacement=\&, execute at end picture={
   309 	\begin{tikzcd}[ampersand replacement=\&, execute at end picture={
    22 			\begin{scope}[on background layer]
   310 			\begin{scope}[on background layer]
    23 				\node[rectangle, fill={red!30},
   311 				\node[rectangle, fill={red!30},
    37 v_0           \& v_1 \arrow[l,"inj_{r_0} c_0"]                \& v_2 \arrow[l, "inj_{r_1} c_1"]              \& v_n \arrow[l, dashed]         \\
   325 v_0           \& v_1 \arrow[l,"inj_{r_0} c_0"]                \& v_2 \arrow[l, "inj_{r_1} c_1"]              \& v_n \arrow[l, dashed]         \\
    38 \end{tikzcd}
   326 \end{tikzcd}
    39 \end{equation}
   327 \end{equation}
    40 \end{ceqn}
   328 \end{ceqn}
    41 \noindent
   329 \noindent
    42 The red part represents what we already know during the first
       
    43 derivative phase,
       
    44 and the blue part represents the unknown part of input.
       
    45 The red area expands as we move towards $r_n$, 
   330 The red area expands as we move towards $r_n$, 
    46 indicating an increasing stack size during lexing.
   331 indicating more is known about the lexing result.
    47 Despite having some partial lexing information during
   332 Despite knowing this partial lexing information during
    48 the forward derivative phase, we choose to store them
   333 the forward derivative phase, we choose to store them
    49 temporarily, only to convert the information to lexical
   334 all the way until $r_n$ is reached.
    50 values at a later stage. In essence we are repeating work we
   335 Then we reconstruct the value character by character
    51 have already done.
   336 values at a later stage, using information in a Last-In-First-Out
    52 This is both inefficient and prone to stack overflow.
   337 manner. Although the algorithm is elegant and natural,
    53 A natural question arises as to whether we can store lexing
   338 it can be inefficient and prone to stack overflow.\\
       
   339 It turns out we can store lexing
    54 information on the fly, while still using regular expression 
   340 information on the fly, while still using regular expression 
    55 derivatives.
   341 derivatives.
    56 
   342 If we remove the individual 
    57 If we remove the details of the individual 
       
    58 lexing steps, and use red and blue areas as before
   343 lexing steps, and use red and blue areas as before
    59 to indicate consumed (seen) input and constructed
   344 to indicate consumed (seen) input and constructed
    60 partial value (before recovering the rest of the stack),
   345 partial value (before recovering the rest of the stack),
    61 one could see that the seen part's lexical information
   346 one could see that the seen part's lexical information
    62 is stored in the form of a regular expression.
   347 is stored in the form of a regular expression.