ChengsongTanPhdThesis/Chapters/Bitcoded1.tex
changeset 538 8016a2480704
parent 537 50e590823220
child 542 a7344c9afbaf
equal deleted inserted replaced
537:50e590823220 538:8016a2480704
     1 % Chapter Template
     1 % Chapter Template
     2 
     2 
     3 % Main chapter title
     3 % Main chapter title
     4 \chapter{Correctness of Bit-coded Algorithm without Simplification}
     4 \chapter{Bit-coded Algorithm of Sulzmann and Lu}
     5 
     5 
     6 \label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
     6 \label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
     7 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
     7 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
     8 %simplifications and therefore introduce our version of the bitcoded algorithm and 
     8 %simplifications and therefore introduce our version of the bitcoded algorithm and 
     9 %its correctness proof in 
     9 %its correctness proof in 
    10 %Chapter 3\ref{Chapter3}. 
    10 %Chapter 3\ref{Chapter3}. 
    11 
    11 In this chapter, we are going to introduce the bit-coded algorithm
       
    12 introduced by Sulzmann and Lu to address the problem of 
       
    13 under-simplified regular expressions. 
    12 \section{Bit-coded Algorithm}
    14 \section{Bit-coded Algorithm}
    13 
       
    14 
       
    15 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
    15 The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
    16 stores information of previous lexing steps
    16 stores information of previous lexing steps
    17 on a stack, in the form of regular expressions
    17 on a stack, in the form of regular expressions
    18 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
    18 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
    19 \begin{envForCaption}
    19 \begin{envForCaption}
   133 \end{center}
   133 \end{center}
   134 \caption{Bit-codes datatype}
   134 \caption{Bit-codes datatype}
   135 \end{envForCaption}
   135 \end{envForCaption}
   136 
   136 
   137 \noindent
   137 \noindent
   138 The $S$ and $Z$ are not in bold in order to avoid 
   138 Using $S$ and $Z$ rather than $1$ and $0$ is to avoid
   139 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
   139 confusion with the regular expressions $\ZERO$ and $\ONE$.
       
   140 Bitcodes (or
   140 bit-lists) can be used to encode values (or potentially incomplete values) in a
   141 bit-lists) can be used to encode values (or potentially incomplete values) in a
   141 compact form. This can be straightforwardly seen in the following
   142 compact form. This can be straightforwardly seen in the following
   142 coding function from values to bitcodes: 
   143 coding function from values to bitcodes: 
   143 \begin{envForCaption}
   144 \begin{envForCaption}
   144 \begin{center}
   145 \begin{center}
   201   $\textit{decode}\,bs\,r$ & $\dn$ &
   202   $\textit{decode}\,bs\,r$ & $\dn$ &
   202      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
   203      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
   203   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
   204   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
   204        \textit{else}\;\textit{None}$                       
   205        \textit{else}\;\textit{None}$                       
   205 \end{tabular}    
   206 \end{tabular}    
   206 \end{center}
   207 \end{center} 
   207 \caption{Bit-decoding of values}    
       
   208 \end{envForCaption}
   208 \end{envForCaption}
   209 %\end{definition}
   209 %\end{definition}
   210 
   210 
   211 Sulzmann and Lu's integrated the bitcodes into regular expressions to
   211 \noindent
   212 create annotated regular expressions \cite{Sulzmann2014}.
   212 $\decode'$ does most of the job while $\decode$ throws
   213 \emph{Annotated regular expressions} are defined by the following
   213 away leftover bit-codes and returns the value only.
   214 grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
   214 $\decode$ is terminating as $\decode'$ is terminating.
       
   215 We have the property that $\decode$ and $\code$ are
       
   216 reverse operations of one another:
       
   217 \begin{lemma}
       
   218 \[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \]
       
   219 \end{lemma}
       
   220 \begin{proof}
       
   221 By proving a more general version of the lemma, on $\decode'$:
       
   222 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
       
   223 Then setting $ds$ to be $[]$ and unfolding $\decode$ definition
       
   224 we get the lemma.
       
   225 \end{proof}
       
   226 With the $\code$ and $\decode$ functions in hand, we know how to 
       
   227 switch between bit-codes and value--the two different representations of 
       
   228 lexing information. 
       
   229 The next step is to integrate this information into the working regular expression.
       
   230 Attaching bits to the front of regular expressions is the solution Sulzamann and Lu
       
   231 gave for storing partial values on the fly:
   215 
   232 
   216 \begin{center}
   233 \begin{center}
   217 \begin{tabular}{lcl}
   234 \begin{tabular}{lcl}
   218   $\textit{a}$ & $::=$  & $\ZERO$\\
   235   $\textit{a}$ & $::=$  & $\ZERO$\\
   219                   & $\mid$ & $_{bs}\ONE$\\
   236                   & $\mid$ & $_{bs}\ONE$\\
   224 \end{tabular}    
   241 \end{tabular}    
   225 \end{center}  
   242 \end{center}  
   226 %(in \textit{ALTS})
   243 %(in \textit{ALTS})
   227 
   244 
   228 \noindent
   245 \noindent
   229 where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
   246 We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}.
       
   247 $bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
   230 expressions and $as$ for a list of annotated regular expressions.
   248 expressions and $as$ for a list of annotated regular expressions.
   231 The alternative constructor($\sum$) has been generalized to 
   249 The alternative constructor ($\sum$) has been generalised to 
   232 accept a list of annotated regular expressions rather than just 2.
   250 accept a list of annotated regular expressions rather than just 2.
   233 We will show that these bitcodes encode information about
   251 %We will show that these bitcodes encode information about
   234 the (POSIX) value that should be generated by the Sulzmann and Lu
   252 %the ($\POSIX$) value that should be generated by the Sulzmann and Lu
   235 algorithm.
   253 %algorithm.
   236 
   254 The most central question is how these partial lexing information
   237 
   255 represented as bit-codes is augmented and carried around 
   238 To do lexing using annotated regular expressions, we shall first
   256 during a derivative is taken.
   239 transform the usual (un-annotated) regular expressions into annotated
   257 
   240 regular expressions. This operation is called \emph{internalisation} and
   258 This is done by adding bitcodes to the 
   241 defined as follows:
   259 derivatives, for example when one more star iteratoin is taken (we
   242 
   260 call the operation of derivatives on annotated regular expressions $\bder$
   243 %\begin{definition}
   261 because it is derivatives on regexes with bitcodes):
   244 \begin{center}
   262 \begin{center}
   245 \begin{tabular}{lcl}
   263   \begin{tabular}{@{}lcl@{}}
   246   $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
   264   $\bder \; c\; (_{bs}a^*) $ & $\dn$ &
   247   $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
   265       $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot
   248   $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
   266        (_{[]}a^*))$
   249   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
   250   $_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\,
       
   251   \textit{fuse}\,[S]\,r_2^\uparrow]$\\
       
   252   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
   253          $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
       
   254   $(r^*)^\uparrow$ & $\dn$ &
       
   255          $_{[]}(r^\uparrow)^*$\\
       
   256 \end{tabular}    
   267 \end{tabular}    
   257 \end{center}    
   268 \end{center}    
   258 %\end{definition}
   269 
   259 
   270 \noindent
   260 \noindent
   271 For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when
   261 We use up arrows here to indicate that the basic un-annotated regular
   272 there is no danger of confusion with derivatives on plain regular expressions, 
   262 expressions are ``lifted up'' into something slightly more complex. In the
   273 for example, the above can be expressed as
   263 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
   274 \begin{center}
   264 attach bits to the front of an annotated regular expression. Its
   275   \begin{tabular}{@{}lcl@{}}
   265 definition is as follows:
   276   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
   266 
   277       $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot
       
   278        (_{[]}a^*))$
       
   279 \end{tabular}    
       
   280 \end{center}   
       
   281 
       
   282 
       
   283 Using the picture we used earlier to depict this, the transformation when 
       
   284 taking a derivative w.r.t a star is like below:
       
   285 \centering
       
   286 \begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}}
       
   287 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   288     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
       
   289         {$bs$
       
   290          \nodepart{two} $a^*$ };
       
   291 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   292 \end{tikzpicture} 
       
   293 &
       
   294 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   295     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
       
   296         {$v_{\text{previous iterations}}$
       
   297          \nodepart{two} $a^*$};
       
   298 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   299 \end{tikzpicture}
       
   300 \\
       
   301 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   302     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
       
   303         { $bs$ + [Z]
       
   304          \nodepart{two}  $(a\backslash c )\cdot a^*$ };
       
   305 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   306 \end{tikzpicture}
       
   307 &
       
   308 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   309     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
       
   310         {$v_{\text{previous iterations}}$ + 1 more iteration
       
   311          \nodepart{two} $(a\backslash c )\cdot a^*$ };
       
   312 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   313 \end{tikzpicture}
       
   314 \end{tabular}    
       
   315 \noindent
       
   316 The operation $\fuse$ is just to attach bit-codes 
       
   317 to the front of an annotated regular expression:
   267 \begin{center}
   318 \begin{center}
   268 \begin{tabular}{lcl}
   319 \begin{tabular}{lcl}
   269   $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
   320   $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
   270   $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
   321   $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
   271      $_{bs @ bs'}\ONE$\\
   322      $_{bs @ bs'}\ONE$\\
   276   $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
   327   $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
   277      $_{bs@bs'}a_1 \cdot a_2$\\
   328      $_{bs@bs'}a_1 \cdot a_2$\\
   278   $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
   329   $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
   279      $_{bs @ bs'}a^*$
   330      $_{bs @ bs'}a^*$
   280 \end{tabular}    
   331 \end{tabular}    
   281 \end{center}  
   332 \end{center} 
   282 
   333 
   283 \noindent
   334 \noindent
   284 After internalising the regular expression, we perform successive
   335 Another place in the $\bder$ function where it differs
   285 derivative operations on the annotated regular expressions. This
   336 from normal derivatives on un-annotated regular expressions
   286 derivative operation is the same as what we had previously for the
   337 is the sequence case:
   287 basic regular expressions, except that we beed to take care of
       
   288 the bitcodes:
       
   289 
       
   290 
       
   291 \iffalse
       
   292  %\begin{definition}{bder}
       
   293 \begin{center}
   338 \begin{center}
   294   \begin{tabular}{@{}lcl@{}}
   339   \begin{tabular}{@{}lcl@{}}
   295   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   340 
   296   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   341   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
   297   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
       
   298         $\textit{if}\;c=d\; \;\textit{then}\;
       
   299          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
   300   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
       
   301   $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
       
   302   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
   303      $\textit{if}\;\textit{bnullable}\,a_1$\\
   342      $\textit{if}\;\textit{bnullable}\,a_1$\\
   304 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
   343 					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
   305 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
   344 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
   306   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
   345   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
   307   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
       
   308       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       
   309        (\textit{STAR}\,[]\,r)$
       
   310 \end{tabular}    
   346 \end{tabular}    
   311 \end{center}    
   347 \end{center}    
   312 %\end{definition}
   348 Here 
   313 
   349 
   314 \begin{center}
       
   315   \begin{tabular}{@{}lcl@{}}
       
   316   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   317   $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   318   $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
       
   319         $\textit{if}\;c=d\; \;\textit{then}\;
       
   320          _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
       
   321   $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
       
   322   $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
       
   323   $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
       
   324      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   325 					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
       
   326 					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
   327   & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
       
   328   $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
       
   329       $_{bs}\textit{SEQ}\;(\textit{fuse}\, [Z] \; r\,\backslash c )\,
       
   330        (_{bs}\textit{STAR}\,[]\,r)$
       
   331 \end{tabular}    
       
   332 \end{center}    
       
   333 %\end{definition}
       
   334 \fi
       
   335 
   350 
   336 \begin{center}
   351 \begin{center}
   337   \begin{tabular}{@{}lcl@{}}
   352   \begin{tabular}{@{}lcl@{}}
   338   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
   353   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
   339   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
   354   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
   350   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
   365   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
   351       $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
   366       $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
   352        (_{[]}r^*))$
   367        (_{[]}r^*))$
   353 \end{tabular}    
   368 \end{tabular}    
   354 \end{center}    
   369 \end{center}    
       
   370 
       
   371 
       
   372 To do lexing using annotated regular expressions, we shall first
       
   373 transform the usual (un-annotated) regular expressions into annotated
       
   374 regular expressions. This operation is called \emph{internalisation} and
       
   375 defined as follows:
       
   376 
       
   377 %\begin{definition}
       
   378 \begin{center}
       
   379 \begin{tabular}{lcl}
       
   380   $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
       
   381   $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
       
   382   $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
       
   383   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
   384   $_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\,
       
   385   \textit{fuse}\,[S]\,r_2^\uparrow]$\\
       
   386   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
   387          $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
       
   388   $(r^*)^\uparrow$ & $\dn$ &
       
   389          $_{[]}(r^\uparrow)^*$\\
       
   390 \end{tabular}    
       
   391 \end{center}    
       
   392 %\end{definition}
       
   393 
       
   394 \noindent
       
   395 We use up arrows here to indicate that the basic un-annotated regular
       
   396 expressions are ``lifted up'' into something slightly more complex. In the
       
   397 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
       
   398 attach bits to the front of an annotated regular expression. Its
       
   399 definition is as follows:
       
   400 
       
   401  
       
   402 
       
   403 \noindent
       
   404 After internalising the regular expression, we perform successive
       
   405 derivative operations on the annotated regular expressions. This
       
   406 derivative operation is the same as what we had previously for the
       
   407 basic regular expressions, except that we beed to take care of
       
   408 the bitcodes:
       
   409 
       
   410 
       
   411 
       
   412 
   355 
   413 
   356 %\end{definition}
   414 %\end{definition}
   357 \noindent
   415 \noindent
   358 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
   416 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
   359 we need to unfold it into a sequence,
   417 we need to unfold it into a sequence,
   414 
   472 
   415 \noindent
   473 \noindent
   416 In this definition $\_\backslash s$ is the  generalisation  of the derivative
   474 In this definition $\_\backslash s$ is the  generalisation  of the derivative
   417 operation from characters to strings (just like the derivatives for un-annotated
   475 operation from characters to strings (just like the derivatives for un-annotated
   418 regular expressions).
   476 regular expressions).
   419 
       
   420 Now we introduce the simplifications, which is why we introduce the 
       
   421 bitcodes in the first place.
       
   422 
       
   423 \subsection*{Simplification Rules}
       
   424 
       
   425 This section introduces aggressive (in terms of size) simplification rules
       
   426 on annotated regular expressions
       
   427 to keep derivatives small. Such simplifications are promising
       
   428 as we have
       
   429 generated test data that show
       
   430 that a good tight bound can be achieved. We could only
       
   431 partially cover the search space as there are infinitely many regular
       
   432 expressions and strings. 
       
   433 
       
   434 One modification we introduced is to allow a list of annotated regular
       
   435 expressions in the $\sum$ constructor. This allows us to not just
       
   436 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
       
   437 also unnecessary ``copies'' of regular expressions (very similar to
       
   438 simplifying $r + r$ to just $r$, but in a more general setting). Another
       
   439 modification is that we use simplification rules inspired by Antimirov's
       
   440 work on partial derivatives. They maintain the idea that only the first
       
   441 ``copy'' of a regular expression in an alternative contributes to the
       
   442 calculation of a POSIX value. All subsequent copies can be pruned away from
       
   443 the regular expression. A recursive definition of our  simplification function 
       
   444 that looks somewhat similar to our Scala code is given below:
       
   445 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
   446 %Is it $ALTS$ or $ALTS$?}\\
       
   447 
       
   448 \begin{center}
       
   449   \begin{tabular}{@{}lcl@{}}
       
   450    
       
   451   $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
   452    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
   453    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
   454    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
   455    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
   456    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
       
   457 
       
   458   $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\
       
   459   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
   460    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
   461    &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
       
   462 
       
   463    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
   464 \end{tabular}    
       
   465 \end{center}    
       
   466 
       
   467 \noindent
       
   468 The simplification does a pattern matching on the regular expression.
       
   469 When it detected that the regular expression is an alternative or
       
   470 sequence, it will try to simplify its child regular expressions
       
   471 recursively and then see if one of the children turns into $\ZERO$ or
       
   472 $\ONE$, which might trigger further simplification at the current level.
       
   473 The most involved part is the $\sum$ clause, where we use two
       
   474 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
       
   475 alternatives and reduce as many duplicates as possible. Function
       
   476 $\textit{distinct}$  keeps the first occurring copy only and removes all later ones
       
   477 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
       
   478 Its recursive definition is given below:
       
   479 
       
   480  \begin{center}
       
   481   \begin{tabular}{@{}lcl@{}}
       
   482   $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
       
   483      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
       
   484   $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
       
   485     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
       
   486 \end{tabular}    
       
   487 \end{center}  
       
   488 
       
   489 \noindent
       
   490 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
       
   491 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
       
   492 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
       
   493 
       
   494 Having defined the $\simp$ function,
       
   495 we can use the previous notation of  natural
       
   496 extension from derivative w.r.t.~character to derivative
       
   497 w.r.t.~string:%\comment{simp in  the [] case?}
       
   498 
       
   499 \begin{center}
       
   500 \begin{tabular}{lcl}
       
   501 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
   502 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
   503 \end{tabular}
       
   504 \end{center}
       
   505 
       
   506 \noindent
       
   507 to obtain an optimised version of the algorithm:
       
   508 
       
   509  \begin{center}
       
   510 \begin{tabular}{lcl}
       
   511   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
   512       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
   513   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   514   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   515   & & $\;\;\textit{else}\;\textit{None}$
       
   516 \end{tabular}
       
   517 \end{center}
       
   518 
       
   519 \noindent
       
   520 This algorithm keeps the regular expression size small, for example,
       
   521 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
   522 will be reduced to just 17 and stays constant, no matter how long the
       
   523 input string is.
       
   524 
       
   525 
       
   526 
       
   527 
       
   528 
       
   529 
       
   530 
       
   531 
       
   532 
   477 
   533 
   478 
   534 
   479 
   535 %-----------------------------------
   480 %-----------------------------------
   536 %	SUBSECTION 1
   481 %	SUBSECTION 1