etnms/etnms.tex
changeset 107 b1e365afa29c
parent 106 e0db3242d8b5
child 108 0a0c551bb368
equal deleted inserted replaced
106:e0db3242d8b5 107:b1e365afa29c
    35 \newcommand{\ONE}{\mbox{\bf 1}}
    35 \newcommand{\ONE}{\mbox{\bf 1}}
    36 \def\erase{\textit{erase}}
    36 \def\erase{\textit{erase}}
    37 \def\bders{\textit{bders}}
    37 \def\bders{\textit{bders}}
    38 \def\lexer{\mathit{lexer}}
    38 \def\lexer{\mathit{lexer}}
    39 \def\blexer{\textit{blexer}}
    39 \def\blexer{\textit{blexer}}
       
    40 \def\fuse{\textit{fuse}}
       
    41 \def\flatten{\textit{flatten}}
       
    42 \def\map{\textit{map}}
    40 \def\blexers{\mathit{blexer\_simp}}
    43 \def\blexers{\mathit{blexer\_simp}}
    41 \def\simp{\mathit{simp}}
    44 \def\simp{\mathit{simp}}
    42 \def\mkeps{\mathit{mkeps}}
    45 \def\mkeps{\mathit{mkeps}}
    43 \def\bmkeps{\textit{bmkeps}}
    46 \def\bmkeps{\textit{bmkeps}}
    44 \def\inj{\mathit{inj}}
    47 \def\inj{\mathit{inj}}
    96   formally established their correctness. We have also not yet looked 
    99   formally established their correctness. We have also not yet looked 
    97   at extended regular expressions, such as bounded repetitions,
   100   at extended regular expressions, such as bounded repetitions,
    98   negation and back-references.
   101   negation and back-references.
    99 \end{abstract}
   102 \end{abstract}
   100 
   103 
       
   104 \section{Recapitulation of Concepts From the Last Report}
       
   105 \subsection{The Algorithm by Brzozowski based on Derivatives of Regular
       
   106 Expressions}
       
   107 
       
   108 Suppose (basic) regular expressions are given by the following grammar:
       
   109 \[			r ::=   \ZERO \mid  \ONE
       
   110 			 \mid  c  
       
   111 			 \mid  r_1 \cdot r_2
       
   112 			 \mid  r_1 + r_2   
       
   113 			 \mid r^*         
       
   114 \]
       
   115 
       
   116 \noindent
       
   117 
       
   118 The ingenious contribution by Brzozowski is the notion of
       
   119 \emph{derivatives} of regular expressions.  
       
   120 \begin{center}
       
   121 		\begin{tabular}{lcl}
       
   122 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   123 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   124 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   125 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   126 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   127 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   128 		\end{tabular}
       
   129 	\end{center}
       
   130 
       
   131 \begin{center}
       
   132 \begin{tabular}{lcl}
       
   133 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   134 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   135 		$d \backslash c$     & $\dn$ & 
       
   136 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   137 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   138 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   139 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   140 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   141 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   142 \end{tabular}
       
   143 \end{center}
       
   144 
       
   145 %Assuming the classic notion of a
       
   146 %\emph{language} of a regular expression, written $L(\_)$, t
       
   147 
       
   148 \noindent
       
   149 The main property of the derivative operation is that
       
   150 
       
   151 \begin{center}
       
   152 $c\!::\!s \in L(r)$ holds
       
   153 if and only if $s \in L(r\backslash c)$.
       
   154 \end{center}
       
   155 
       
   156 \noindent
       
   157 
       
   158 
       
   159 Now we can generalise the derivative operation to strings like this:
       
   160 
       
   161 \begin{center}
       
   162 \begin{tabular}{lcl}
       
   163 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
   164 $r \backslash [\,] $ & $\dn$ & $r$
       
   165 \end{tabular}
       
   166 \end{center}
       
   167 
       
   168 \noindent
       
   169 and then define as  regular-expression matching algorithm: 
       
   170 \[
       
   171 match\;s\;r \;\dn\; nullable(r\backslash s)
       
   172 \]
       
   173 
       
   174 \noindent
       
   175 This algorithm looks graphically as follows:
       
   176 \begin{equation}\label{graph:*}
       
   177 \begin{tikzcd}
       
   178 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
       
   179 \end{tikzcd}
       
   180 \end{equation}
       
   181 
       
   182 \noindent
       
   183 where we start with  a regular expression  $r_0$, build successive
       
   184 derivatives until we exhaust the string and then use \textit{nullable}
       
   185 to test whether the result can match the empty string. It can  be
       
   186 relatively  easily shown that this matcher is correct  (that is given
       
   187 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
       
   188 
       
   189  
       
   190 \subsection{Values and the Algorithm by Sulzmann and Lu}
       
   191 
       
   192 One limitation of Brzozowski's algorithm is that it only produces a
       
   193 YES/NO answer for whether a string is being matched by a regular
       
   194 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
       
   195 to allow generation of an actual matching, called a \emph{value} or
       
   196 sometimes also \emph{lexical value}.  These values and regular
       
   197 expressions correspond to each other as illustrated in the following
       
   198 table:
       
   199 
       
   200 
       
   201 \begin{center}
       
   202 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   203 		\begin{tabular}{@{}rrl@{}}
       
   204 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   205 			$r$ & $::=$  & $\ZERO$\\
       
   206 			& $\mid$ & $\ONE$   \\
       
   207 			& $\mid$ & $c$          \\
       
   208 			& $\mid$ & $r_1 \cdot r_2$\\
       
   209 			& $\mid$ & $r_1 + r_2$   \\
       
   210 			\\
       
   211 			& $\mid$ & $r^*$         \\
       
   212 		\end{tabular}
       
   213 		&
       
   214 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   215 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   216 			$v$ & $::=$  & \\
       
   217 			&        & $\Empty$   \\
       
   218 			& $\mid$ & $\Char(c)$          \\
       
   219 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   220 			& $\mid$ & $\Left(v)$   \\
       
   221 			& $\mid$ & $\Right(v)$  \\
       
   222 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   223 		\end{tabular}
       
   224 	\end{tabular}
       
   225 \end{center}
       
   226 
       
   227 \noindent
       
   228 
       
   229 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   230 algorithm by a second phase (the first phase being building successive
       
   231 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
       
   232 is generated in case the regular expression matches  the string. 
       
   233 Pictorially, the Sulzmann and Lu algorithm is as follows:
       
   234 
       
   235 \begin{ceqn}
       
   236 \begin{equation}\label{graph:2}
       
   237 \begin{tikzcd}
       
   238 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   239 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   240 \end{tikzcd}
       
   241 \end{equation}
       
   242 \end{ceqn}
       
   243 
       
   244 \noindent
       
   245 For convenience, we shall employ the following notations: the regular
       
   246 expression we start with is $r_0$, and the given string $s$ is composed
       
   247 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
   248 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
   249 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
   250 the derivative $r_n$. We test whether this derivative is
       
   251 $\textit{nullable}$ or not. If not, we know the string does not match
       
   252 $r$ and no value needs to be generated. If yes, we start building the
       
   253 values incrementally by \emph{injecting} back the characters into the
       
   254 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
   255 algorithm from the right to left. For the first value $v_n$, we call the
       
   256 function $\textit{mkeps}$, which builds the lexical value
       
   257 for how the empty string has been matched by the (nullable) regular
       
   258 expression $r_n$. This function is defined as
       
   259 
       
   260 	\begin{center}
       
   261 		\begin{tabular}{lcl}
       
   262 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   263 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   264 			& \textit{if} $\nullable(r_{1})$\\ 
       
   265 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   266 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   267 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   268 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   269 		\end{tabular}
       
   270 	\end{center}
       
   271 
       
   272 
       
   273 \noindent 
       
   274 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   275 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   276 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   277 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   278 matches $s$. For this Sulzmann and Lu defined a function that reverses
       
   279 the ``chopping off'' of characters during the derivative phase. The
       
   280 corresponding function is called \emph{injection}, written
       
   281 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   282 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   283 is a character ${c_{i-1}}$, the character we want to inject and the
       
   284 third argument is the value ${v_i}$, into which one wants to inject the
       
   285 character (it corresponds to the regular expression after the character
       
   286 has been chopped off). The result of this function is a new value. The
       
   287 definition of $\textit{inj}$ is as follows: 
       
   288 
       
   289 \begin{center}
       
   290 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   291   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   292   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   293   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   294   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   295   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   296   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   297   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   298 \end{tabular}
       
   299 \end{center}
       
   300 
       
   301 \noindent This definition is by recursion on the ``shape'' of regular
       
   302 expressions and values. 
       
   303 
       
   304 
       
   305 \subsection{Simplification of Regular Expressions}
       
   306 
       
   307 The main drawback of building successive derivatives according
       
   308 to Brzozowski's definition is that they can grow very quickly in size.
       
   309 This is mainly due to the fact that the derivative operation generates
       
   310 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
       
   311 implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
       
   312 are excruciatingly slow. For example when starting with the regular
       
   313 expression $(a + aa)^*$ and building 12 successive derivatives
       
   314 w.r.t.~the character $a$, one obtains a derivative regular expression
       
   315 with more than 8000 nodes (when viewed as a tree). Operations like
       
   316 $\textit{der}$ and $\nullable$ need to traverse such trees and
       
   317 consequently the bigger the size of the derivative the slower the
       
   318 algorithm. 
       
   319 
       
   320 Fortunately, one can simplify regular expressions after each derivative
       
   321 step. Various simplifications of regular expressions are possible, such
       
   322 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
       
   323 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
       
   324 affect the answer for whether a regular expression matches a string or
       
   325 not, but fortunately also do not affect the POSIX strategy of how
       
   326 regular expressions match strings---although the latter is much harder
       
   327 to establish. Some initial results in this regard have been
       
   328 obtained in \cite{AusafDyckhoffUrban2016}. 
       
   329 
       
   330 If we want the size of derivatives in Sulzmann and Lu's algorithm to
       
   331 stay below this bound, we would need more aggressive simplifications.
       
   332 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
   333 deleting duplicates whenever possible. For example, the parentheses in
       
   334 $(a+b) \cdot c + bc$ can be opened up to get $a\cdot c +  b \cdot c + b
       
   335 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
   336 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
   337 $a^*+a+\ONE$. Adding these more aggressive simplification rules helps us
       
   338 to achieve the same size bound as that of the partial derivatives. 
       
   339 
       
   340 In order to implement the idea of ``spilling out alternatives'' and to
       
   341 make them compatible with the $\text{inj}$-mechanism, we use
       
   342 \emph{bitcodes}. Bits and bitcodes (lists of bits) are just:
       
   343 
       
   344 
       
   345 \begin{center}
       
   346 		$b ::=   S \mid  Z \qquad
       
   347 bs ::= [] \mid b:bs    
       
   348 $
       
   349 \end{center}
       
   350 
       
   351 \noindent
       
   352 The $S$ and $Z$ are arbitrary names for the bits in order to avoid 
       
   353 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
       
   354 bit-lists) can be used to encode values (or incomplete values) in a
       
   355 compact form. This can be straightforwardly seen in the following
       
   356 coding function from values to bitcodes: 
       
   357 
       
   358 \begin{center}
       
   359 \begin{tabular}{lcl}
       
   360   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
   361   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
   362   $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
       
   363   $\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\
       
   364   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
   365   $\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\
       
   366   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\;
       
   367                                                  code(\Stars\,vs)$
       
   368 \end{tabular}    
       
   369 \end{center} 
       
   370 
       
   371 \noindent
       
   372 Here $\textit{code}$ encodes a value into a bitcodes by converting
       
   373 $\Left$ into $\Z$, $\Right$ into $\S$, the start point of a non-empty
       
   374 star iteration into $\S$, and the border where a local star terminates
       
   375 into $\Z$. This coding is lossy, as it throws away the information about
       
   376 characters, and also does not encode the ``boundary'' between two
       
   377 sequence values. Moreover, with only the bitcode we cannot even tell
       
   378 whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The
       
   379 reason for choosing this compact way of storing information is that the
       
   380 relatively small size of bits can be easily manipulated and ``moved
       
   381 around'' in a regular expression. In order to recover values, we will 
       
   382 need the corresponding regular expression as an extra information. This
       
   383 means the decoding function is defined as:
       
   384 
       
   385 
       
   386 %\begin{definition}[Bitdecoding of Values]\mbox{}
       
   387 \begin{center}
       
   388 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
   389   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
       
   390   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
       
   391   $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   392      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       
   393        (\Left\,v, bs_1)$\\
       
   394   $\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   395      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       
   396        (\Right\,v, bs_1)$\\                           
       
   397   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
       
   398         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
       
   399   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
       
   400   & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
       
   401   $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
       
   402   $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & 
       
   403          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   404   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
       
   405   & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
       
   406   
       
   407   $\textit{decode}\,bs\,r$ & $\dn$ &
       
   408      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   409   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       
   410        \textit{else}\;\textit{None}$                       
       
   411 \end{tabular}    
       
   412 \end{center}    
       
   413 %\end{definition}
       
   414 
       
   415 Sulzmann and Lu's integrated the bitcodes into regular expressions to
       
   416 create annotated regular expressions \cite{Sulzmann2014}.
       
   417 \emph{Annotated regular expressions} are defined by the following
       
   418 grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
       
   419 
       
   420 \begin{center}
       
   421 \begin{tabular}{lcl}
       
   422   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
       
   423                   & $\mid$ & $\textit{ONE}\;\;bs$\\
       
   424                   & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
       
   425                   & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
       
   426                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
       
   427                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
       
   428 \end{tabular}    
       
   429 \end{center}  
       
   430 %(in \textit{ALTS})
       
   431 
       
   432 \noindent
       
   433 where $bs$ stands for bitcodes, $a$  for $\bold{a}$nnotated regular
       
   434 expressions and $as$ for a list of annotated regular expressions.
       
   435 The alternative constructor($\textit{ALTS}$) has been generalized to 
       
   436 accept a list of annotated regular expressions rather than just 2.
       
   437 We will show that these bitcodes encode information about
       
   438 the (POSIX) value that should be generated by the Sulzmann and Lu
       
   439 algorithm.
       
   440 
       
   441 
       
   442 To do lexing using annotated regular expressions, we shall first
       
   443 transform the usual (un-annotated) regular expressions into annotated
       
   444 regular expressions. This operation is called \emph{internalisation} and
       
   445 defined as follows:
       
   446 
       
   447 %\begin{definition}
       
   448 \begin{center}
       
   449 \begin{tabular}{lcl}
       
   450   $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
       
   451   $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
       
   452   $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
       
   453   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
   454   $\textit{ALTS}\;[]\,List((\textit{fuse}\,[\Z]\,r_1^\uparrow),\,
       
   455   (\textit{fuse}\,[\S]\,r_2^\uparrow))$\\
       
   456   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
   457          $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
       
   458   $(r^*)^\uparrow$ & $\dn$ &
       
   459          $\textit{STAR}\;[]\,r^\uparrow$\\
       
   460 \end{tabular}    
       
   461 \end{center}    
       
   462 %\end{definition}
       
   463 
       
   464 \noindent
       
   465 We use up arrows here to indicate that the basic un-annotated regular
       
   466 expressions are ``lifted up'' into something slightly more complex. In the
       
   467 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
       
   468 attach bits to the front of an annotated regular expression. Its
       
   469 definition is as follows:
       
   470 
       
   471 \begin{center}
       
   472 \begin{tabular}{lcl}
       
   473   $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
       
   474   $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ &
       
   475      $\textit{ONE}\,(bs\,@\,bs')$\\
       
   476   $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
       
   477      $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
       
   478   $\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,as)$ & $\dn$ &
       
   479      $\textit{ALTS}\,(bs\,@\,bs')\,as$\\
       
   480   $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
       
   481      $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
       
   482   $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
       
   483      $\textit{STAR}\,(bs\,@\,bs')\,a$
       
   484 \end{tabular}    
       
   485 \end{center}  
       
   486 
       
   487 \noindent
       
   488 After internalising the regular expression, we perform successive
       
   489 derivative operations on the annotated regular expressions. This
       
   490 derivative operation is the same as what we had previously for the
       
   491 basic regular expressions, except that we beed to take care of
       
   492 the bitcodes:
       
   493 
       
   494  %\begin{definition}{bder}
       
   495 \begin{center}
       
   496   \begin{tabular}{@{}lcl@{}}
       
   497   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   498   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   499   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
       
   500         $\textit{if}\;c=d\; \;\textit{then}\;
       
   501          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
   502   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
       
   503   $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
       
   504   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
   505      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   506 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
   507 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
   508   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
       
   509   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
       
   510       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       
   511        (\textit{STAR}\,[]\,r)$
       
   512 \end{tabular}    
       
   513 \end{center}    
       
   514 %\end{definition}
       
   515 
       
   516 \noindent
       
   517 For instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence,
       
   518 we need to attach an additional bit $Z$ to the front of $r \backslash c$
       
   519 to indicate that there is one more star iteration. Also the $SEQ$ clause
       
   520 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
       
   521 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
       
   522 that it is for annotated regular expressions, therefore we omit the
       
   523 definition). Assume that $bmkeps$ correctly extracts the bitcode for how
       
   524 $a_1$ matches the string prior to character $c$ (more on this later),
       
   525 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
       
   526 \backslash c)$ will collapse the regular expression $a_1$(as it has
       
   527 already been fully matched) and store the parsing information at the
       
   528 head of the regular expression $a_2 \backslash c$ by fusing to it. The
       
   529 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
       
   530 now been elevated to the top-level of $ALTS$, as this information will be
       
   531 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
       
   532 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
       
   533 the lexing information, we complete the lexing by collecting the
       
   534 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
   535 for annotated regular expressions, called $\textit{bmkeps}$:
       
   536 
       
   537 
       
   538 %\begin{definition}[\textit{bmkeps}]\mbox{}
       
   539 \begin{center}
       
   540 \begin{tabular}{lcl}
       
   541   $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
       
   542   $\textit{bmkeps}\,(\textit{ALTS}\;bs\,a::as)$ & $\dn$ &
       
   543      $\textit{if}\;\textit{bnullable}\,a$\\
       
   544   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
       
   545   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(\textit{ALTS}\;bs\,as)$\\
       
   546   $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ &
       
   547      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
       
   548   $\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ &
       
   549      $bs \,@\, [\S]$
       
   550 \end{tabular}    
       
   551 \end{center}    
       
   552 %\end{definition}
       
   553 
       
   554 \noindent
       
   555 This function completes the value information by travelling along the
       
   556 path of the regular expression that corresponds to a POSIX value and
       
   557 collecting all the bitcodes, and using $S$ to indicate the end of star
       
   558 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
       
   559 decode them, we get the value we expect. The corresponding lexing
       
   560 algorithm looks as follows:
       
   561 
       
   562 \begin{center}
       
   563 \begin{tabular}{lcl}
       
   564   $\textit{blexer}\;r\,s$ & $\dn$ &
       
   565       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
   566   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   567   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   568   & & $\;\;\textit{else}\;\textit{None}$
       
   569 \end{tabular}
       
   570 \end{center}
       
   571 
       
   572 \noindent
       
   573 In this definition $\_\backslash s$ is the  generalisation  of the derivative
       
   574 operation from characters to strings (just like the derivatives for un-annotated
       
   575 regular expressions).
       
   576 
       
   577 The main point of the bitcodes and annotated regular expressions is that
       
   578 we can apply rather aggressive (in terms of size) simplification rules
       
   579 in order to keep derivatives small. We have developed such
       
   580 ``aggressive'' simplification rules and generated test data that show
       
   581 that the expected bound can be achieved. Obviously we could only
       
   582 partially cover  the search space as there are infinitely many regular
       
   583 expressions and strings. 
       
   584 
       
   585 One modification we introduced is to allow a list of annotated regular
       
   586 expressions in the \textit{ALTS} constructor. This allows us to not just
       
   587 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
       
   588 also unnecessary ``copies'' of regular expressions (very similar to
       
   589 simplifying $r + r$ to just $r$, but in a more general setting). Another
       
   590 modification is that we use simplification rules inspired by Antimirov's
       
   591 work on partial derivatives. They maintain the idea that only the first
       
   592 ``copy'' of a regular expression in an alternative contributes to the
       
   593 calculation of a POSIX value. All subsequent copies can be pruned away from
       
   594 the regular expression. A recursive definition of our  simplification function 
       
   595 that looks somewhat similar to our Scala code is given below:
       
   596 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
   597 %Is it $ALTS$ or $ALTS$?}\\
       
   598 
       
   599 \begin{center}
       
   600   \begin{tabular}{@{}lcl@{}}
       
   601    
       
   602   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
   603    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
   604    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
   605    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
   606    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
   607    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
       
   608 
       
   609   $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
       
   610   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
   611    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
   612    &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$\\ 
       
   613 
       
   614    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
   615 \end{tabular}    
       
   616 \end{center}    
       
   617 
       
   618 \noindent
       
   619 The simplification does a pattern matching on the regular expression.
       
   620 When it detected that the regular expression is an alternative or
       
   621 sequence, it will try to simplify its children regular expressions
       
   622 recursively and then see if one of the children turn into $\ZERO$ or
       
   623 $\ONE$, which might trigger further simplification at the current level.
       
   624 The most involved part is the $\textit{ALTS}$ clause, where we use two
       
   625 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
       
   626 $\textit{ALTS}$ and reduce as many duplicates as possible. Function
       
   627 $\textit{distinct}$  keeps the first occurring copy only and remove all later ones
       
   628 when detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}.
       
   629 Its recursive definition is given below:
       
   630 
       
   631  \begin{center}
       
   632   \begin{tabular}{@{}lcl@{}}
       
   633   $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
       
   634      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
       
   635   $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
       
   636     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
       
   637 \end{tabular}    
       
   638 \end{center}  
       
   639 
       
   640 \noindent
       
   641 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
       
   642 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
       
   643 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
       
   644 
       
   645 Suppose we apply simplification after each derivative step, and view
       
   646 these two operations as an atomic one: $a \backslash_{simp}\,c \dn
       
   647 \textit{simp}(a \backslash c)$. Then we can use the previous natural
       
   648 extension from derivative w.r.t.~character to derivative
       
   649 w.r.t.~string:%\comment{simp in  the [] case?}
       
   650 
       
   651 \begin{center}
       
   652 \begin{tabular}{lcl}
       
   653 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
   654 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
   655 \end{tabular}
       
   656 \end{center}
       
   657 
       
   658 \noindent
       
   659 we obtain an optimised version of the algorithm:
       
   660 
       
   661  \begin{center}
       
   662 \begin{tabular}{lcl}
       
   663   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
   664       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
   665   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   666   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   667   & & $\;\;\textit{else}\;\textit{None}$
       
   668 \end{tabular}
       
   669 \end{center}
       
   670 
       
   671 \noindent
       
   672 This algorithm keeps the regular expression size small, for example,
       
   673 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
   674 will be reduced to just 6 and stays constant, no matter how long the
       
   675 input string is.
       
   676 
       
   677 
       
   678 
   101 \section{Introduction}
   679 \section{Introduction}
   102 
   680 
   103 While we believe derivatives of regular expressions, written
   681 While we believe derivatives of regular expressions, written
   104 $r\backslash s$, are a beautiful concept (in terms of ease of
   682 $r\backslash s$, are a beautiful concept (in terms of ease of
   105 implementing them in functional programming languages and in terms of
   683 implementing them in functional programming languages and in terms of
   284 This property allows one to rewrite an induction hypothesis like 
   862 This property allows one to rewrite an induction hypothesis like 
   285 
   863 
   286 \begin{center} 
   864 \begin{center} 
   287 $ \flex \; r\; id\; (s@[c])\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$
   865 $ \flex \; r\; id\; (s@[c])\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$
   288 \end{center}
   866 \end{center}
       
   867 
       
   868 
       
   869 \subsubsection{Retrieve Function}
       
   870 The crucial point is to find the
       
   871 $\textit{POSIX}$  information of a regular expression and how it is modified,
       
   872 augmented and propagated 
       
   873 during simplification in parallel with the regular expression that
       
   874 has not been simplified in the subsequent derivative operations.  To aid this,
       
   875 we use the helper function retrieve described by Sulzmann and Lu:
       
   876 \begin{center}
       
   877 \begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
       
   878   $\textit{retrieve}\,(\textit{ONE}\,bs)\,\Empty$ & $\dn$ & $bs$\\
       
   879   $\textit{retrieve}\,(\textit{CHAR}\,bs\,c)\,(\Char\,d)$ & $\dn$ & $bs$\\
       
   880   $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Left\,v)$ & $\dn$ &
       
   881      $bs \,@\, \textit{retrieve}\,a\,v$\\
       
   882   $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Right\,v)$ & $\dn$ &
       
   883   $bs \,@\, \textit{retrieve}\,(\textit{ALTS}\,bs\,as)\,v$\\
       
   884   $\textit{retrieve}\,(\textit{SEQ}\,bs\,a_1\,a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
       
   885      $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
       
   886   $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,[])$ & $\dn$ &
       
   887      $bs \,@\, [\S]$\\
       
   888   $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
       
   889   \multicolumn{3}{l}{
       
   890      \hspace{3cm}$bs \,@\, [\Z] \,@\, \textit{retrieve}\,a\,v\,@\,
       
   891                     \textit{retrieve}\,(\textit{STAR}\,[]\,a)\,(\Stars\,vs)$}\\
       
   892 \end{tabular}
       
   893 \end{center}
       
   894 %\comment{Did not read further}\\
       
   895 This function assembles the bitcode 
       
   896 %that corresponds to a lexical value for how
       
   897 %the current derivative matches the suffix of the string(the characters that
       
   898 %have not yet appeared, but will appear as the successive derivatives go on.
       
   899 %How do we get this "future" information? By the value $v$, which is
       
   900 %computed by a pass of the algorithm that uses
       
   901 %$inj$ as described in the previous section).  
       
   902 using information from both the derivative regular expression and the
       
   903 value. Sulzmann and Lu poroposed this function, but did not prove
       
   904 anything about it. Ausaf and Urban used it to connect the bitcoded
       
   905 algorithm to the older algorithm by the following equation:
       
   906  
       
   907  \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
       
   908 	 (r^\uparrow)\backslash_{simp} \,c)\,v)$ 
       
   909  \end{center} 
       
   910 
       
   911 \noindent
       
   912 whereby $r^\uparrow$ stands for the internalised version of $r$. Ausaf
       
   913 and Urban also used this fact to prove  the correctness of bitcoded
       
   914 algorithm without simplification.  Our purpose of using this, however,
       
   915 is to establish 
       
   916 
       
   917 \begin{center}
       
   918 $ \textit{retrieve} \;
       
   919 a \; v \;=\; \textit{retrieve}  \; (\textit{simp}\,a) \; v'.$
       
   920 \end{center}
       
   921 The idea is that using $v'$, a simplified version of $v$ that had gone
       
   922 through the same simplification step as $\textit{simp}(a)$, we are able
       
   923 to extract the bitcode that gives the same parsing information as the
       
   924 unsimplified one. However, we noticed that constructing such a  $v'$
       
   925 from $v$ is not so straightforward. The point of this is that  we might
       
   926 be able to finally bridge the gap by proving
       
   927 
   289 
   928 
   290 \noindent\rule[1.5ex]{\linewidth}{4pt}
   929 \noindent\rule[1.5ex]{\linewidth}{4pt}
   291 There is no mention of retrieve yet .... this is the second trick in the
   930 There is no mention of retrieve yet .... this is the second trick in the
   292 existing proof. I am not sure whether you need to explain annotated regular
   931 existing proof. I am not sure whether you need to explain annotated regular
   293 expressions much earlier - maybe before the ``existing proof'' section, or
   932 expressions much earlier - maybe before the ``existing proof'' section, or
   354 \end{center}
   993 \end{center}
   355 \noindent
   994 \noindent
   356 $f$ rectifies $r\backslash s$ so the value $\mkeps(f(r\backslash s))$ becomes
   995 $f$ rectifies $r\backslash s$ so the value $\mkeps(f(r\backslash s))$ becomes
   357  something simpler
   996  something simpler
   358 to make the retrieve function defined.\\
   997 to make the retrieve function defined.\\
       
   998 \subsubsection{Ways to Rectify Value}
   359 One way to do this is to prove the following:
   999 One way to do this is to prove the following:
   360 \begin{center}
  1000 \begin{center}
   361 $\retrieve \; \rup\backslash_{simp} s \; \mkeps(\simp(r\backslash s))=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$
  1001 $\retrieve \; \rup\backslash_{simp} s \; \mkeps(\simp(r\backslash s))=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$
   362 \end{center}
  1002 \end{center}
   363 \noindent
  1003 \noindent
   372 $\erase$ in the above equality means to remove the bit-codes
  1012 $\erase$ in the above equality means to remove the bit-codes
   373 in an annotated regular expression and only keep the original
  1013 in an annotated regular expression and only keep the original
   374 regular expression(just like "erasing" the bits). Its definition is omitted.
  1014 regular expression(just like "erasing" the bits). Its definition is omitted.
   375 $\rup\backslash_{simp} \, s$ and $\simp(\rup\backslash s)$
  1015 $\rup\backslash_{simp} \, s$ and $\simp(\rup\backslash s)$
   376 are very closely related, but not identical.
  1016 are very closely related, but not identical.
       
  1017 \subsubsection{Example for $\rup\backslash_{simp} \, s \neq \simp(\rup\backslash s)$}
   377 For example, let $r$ be the regular expression
  1018 For example, let $r$ be the regular expression
   378 $(a+b)(a+a*)$ and $s$  be the string $aa$, then
  1019 $(a+b)(a+a*)$ and $s$  be the string $aa$, then
   379 both $\erase (\rup\backslash_{simp} \, s)$ and $\erase (\simp (\rup\backslash s))$
  1020 both $\erase (\rup\backslash_{simp} \, s)$ and $\erase (\simp (\rup\backslash s))$
   380 are $\ONE + a^*$. However, without $\erase$ 
  1021 are $\ONE + a^*$. However, without $\erase$ 
   381 \begin{center}
  1022 \begin{center}
   385 whereas
  1026 whereas
   386 \begin{center}
  1027 \begin{center}
   387 $\simp(\rup\backslash  s)$ is equal to $(_{00}\ONE +_{011}a^*)$
  1028 $\simp(\rup\backslash  s)$ is equal to $(_{00}\ONE +_{011}a^*)$
   388 \end{center}
  1029 \end{center}
   389 \noindent
  1030 \noindent
   390 For the sake of visual simplicity, we use numbers to denote the bits
  1031 (For the sake of visual simplicity, we use numbers to denote the bits
   391 in bitcodes as we have previously defined for annotated 
  1032 in bitcodes as we have previously defined for annotated 
   392 regular expressions. $\S$ is replaced by 
  1033 regular expressions. $\S$ is replaced by 
   393 subscript $_1$ and $\Z$ by $_0$.
  1034 subscript $_1$ and $\Z$ by $_0$.)
   394 
  1035 
   395 Two "rules" might be inferred from the above example.
  1036 What makes the difference?
   396 
  1037 
   397 First, after erasing the bits the two regular expressions
  1038 %Two "rules" might be inferred from the above example.
   398 are exactly the same: both become $1+a^*$. Here the 
  1039 
   399 function $\simp$ exhibits the "one in the end equals many times
  1040 %First, after erasing the bits the two regular expressions
   400 at the front"
  1041 %are exactly the same: both become $1+a^*$. Here the 
   401 property: one simplification in the end causes the 
  1042 %function $\simp$ exhibits the "one in the end equals many times
   402 same regular expression structure as
  1043 %at the front"
   403 successive simplifications done alongside derivatives.
  1044 %property: one simplification in the end causes the 
   404 $\rup\backslash_{simp} \, s$ unfolds to 
  1045 %same regular expression structure as
   405 $\simp((\simp(r\backslash a))\backslash a)$
  1046 %successive simplifications done alongside derivatives.
   406 and $\simp(\rup\backslash s)$ unfolds to 
  1047 %$\rup\backslash_{simp} \, s$ unfolds to 
   407 $\simp((r\backslash a)\backslash a)$. The one simplification
  1048 %$\simp((\simp(r\backslash a))\backslash a)$
   408 in the latter causes the resulting regular expression to 
  1049 %and $\simp(\rup\backslash s)$ unfolds to 
   409 become $1+a^*$, exactly the same as the former with
  1050 %$\simp((r\backslash a)\backslash a)$. The one simplification
   410 two simplifications.
  1051 %in the latter causes the resulting regular expression to 
   411 
  1052 %become $1+a^*$, exactly the same as the former with
   412 Second, the bit-codes are different, but they are essentially
  1053 %two simplifications.
   413 the same: if we push the outmost bits ${\bf_0}(_0\ONE +_{11}a^*)$ of $\rup\backslash_{simp} \, s$
  1054 
   414 inside then we get $(_{00}\ONE +_{011}a^*)$, exactly the 
  1055 %Second, the bit-codes are different, but they are essentially
   415 same as that of $\rup\backslash \, s$. And this difference 
  1056 %the same: if we push the outmost bits ${\bf_0}(_0\ONE +_{11}a^*)$ of $\rup\backslash_{simp} \, s$
   416 does not matter when we try to apply $\bmkeps$ or $\retrieve$
  1057 %inside then we get $(_{00}\ONE +_{011}a^*)$, exactly the 
   417 to it. This seems a good news if we want to use $\retrieve$
  1058 %same as that of $\rup\backslash \, s$. And this difference 
   418 to prove things.
  1059 %does not matter when we try to apply $\bmkeps$ or $\retrieve$
   419 
  1060 %to it. This seems a good news if we want to use $\retrieve$
   420 If we look into the difference above, we could see that the
  1061 %to prove things.
   421 difference is not fundamental: the bits are just being moved
  1062 
   422 around in a way that does not hurt the correctness.
  1063 %If we look into the difference above, we could see that the
       
  1064 %difference is not fundamental: the bits are just being moved
       
  1065 %around in a way that does not hurt the correctness.
   423 During the first derivative operation, 
  1066 During the first derivative operation, 
   424 $\rup\backslash a=(_0\ONE  + \ZERO)(_0a  +  _1a^*)$  is
  1067 $\rup\backslash a=(_0\ONE  + \ZERO)(_0a  +  _1a^*)$  is
   425 in the form of a sequence regular expression with the first
  1068 in the form of a sequence regular expression with
   426 part being nullable. 
  1069 two components, the first
       
  1070 one $\ONE + \ZERO$ being nullable. 
   427 Recall the simplification function definition:
  1071 Recall the simplification function definition:
   428 \begin{center}
  1072 \begin{center}
   429   \begin{tabular}{@{}lcl@{}}
  1073   \begin{tabular}{@{}lcl@{}}
   430    
  1074    
   431   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
  1075   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
   443    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
  1087    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
   444 \end{tabular}    
  1088 \end{tabular}    
   445 \end{center}    
  1089 \end{center}    
   446 
  1090 
   447 \noindent
  1091 \noindent
   448 If we call $\simp$ on $\rup$, just as $\backslash_{simp}$
  1092 
       
  1093 and the difinition of $\flatten$:
       
  1094  \begin{center}
       
  1095  \begin{tabular}{c c c}
       
  1096  $\flatten \; []$ & $\dn$ & $[]$\\
       
  1097  $\flatten \; \ZERO::rs$ & $\dn$ & $rs$\\
       
  1098  $\flatten \;(_{\textit{bs}_1}\oplus \textit{rs}_1 ::rs)$ & $\dn$ & $(\map \, (\fuse \, \textit{bs}_1) \,\textit{rs}_1) ::: \flatten(rs)$\\
       
  1099  $\flatten \; r :: rs$ & $\dn$ & $r::\flatten(rs)$
       
  1100  \end{tabular}
       
  1101  \end{center}
       
  1102  
       
  1103  \noindent
       
  1104 If we call $\simp$ on $\rup\backslash a$, just as $\backslash_{simp}$
   449 requires, then we would go throught the third clause of 
  1105 requires, then we would go throught the third clause of 
   450 the sequence case:$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$.
  1106 the sequence case:$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$.
   451 The $\ZERO$ of $(_0\ONE  + \ZERO)$ is simplified away and 
  1107 The $\ZERO$ of $(_0\ONE  + \ZERO)$ is thrown away 
   452 $_0\ONE$ merged into $_0a  +  _1a^*$ by simply
  1108 by $\flatten$ and 
       
  1109 $_0\ONE$ merged into $(_0a  +  _1a^*)$ by simply
   453 putting its bits($_0$) to the front of the second component:
  1110 putting its bits($_0$) to the front of the second component:
   454  ${\bf_0}(_0a  +  _1a^*)$. 
  1111  ${\bf_0}(_0a  +  _1a^*)$. 
   455  After a second derivative operation,
  1112  After a second derivative operation,
   456  namely, $(_0(_0a  +  _1a^*))\backslash a$, we get 
  1113  namely, $(_0(_0a  +  _1a^*))\backslash a$, we get 
   457  $
  1114  $
   458  _0(_0 \ONE  +  _1(_1\ONE \cdot a^*))
  1115  _0(_0 \ONE  +  _1(_1\ONE \cdot a^*))
   459  $, and this simplifies to $_0(_0 \ONE  +  _{11} a^*)$
  1116  $, and this simplifies to $_0(_0 \ONE  +  _{11} a^*)$
   460  by the third clause of the alternative case:
  1117  by the third clause of the alternative case:
   461  $\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$.
  1118  $\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$.
   462 The outmost bit $_0$ remains unchanged and stays with 
  1119 The outmost bit $_0$ stays with 
   463 the outmost regular expression. However, things are a bit
  1120 the outmost regular expression, rather than being fused to
   464 different when it comes to $\simp(\rup\backslash \, s)$, because
  1121 its child regular expressions, as what we will later see happens
   465 without simplification,  first term of the sequence 
  1122 to $\simp(\rup\backslash \, s)$.
   466 $\rup\backslash a=(_0\ONE  + \ZERO)(_0a  +  _1a^*)$ 
  1123 If we choose to not simplify in the midst of derivative operations,
   467 is not merged into the second component
  1124 but only do it at the end after the string has been exhausted, 
   468 and is nullable.
  1125 namely, $\simp(\rup\backslash \, s)=\simp((\rup\backslash a)\backslash a)$,
       
  1126 then at the {\bf second} derivative of 
       
  1127 $(\rup\backslash a)\bf{\backslash a}$
       
  1128 we will go throught the clause of $\backslash$:
       
  1129 \begin{center}
       
  1130 \begin{tabular}{lcl}
       
  1131 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
  1132      $(when \; \textit{bnullable}\,a_1)$\\
       
  1133 					       & &$\textit{ALTS}\,bs\,List(\;\;(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
  1134 					       & &$(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))\;\;)$\\
       
  1135 \end{tabular}
       
  1136 \end{center}
       
  1137 
       
  1138 because
       
  1139 $\rup\backslash a = (_0\ONE  + \ZERO)(_0a  +  _1a^*)$  
       
  1140 is a sequence
       
  1141 with the first component being nullable
       
  1142 (unsimplified, unlike the first round of running$\backslash_{simp}$).
   469 Therefore $((_0\ONE  + \ZERO)(_0a  +  _1a^*))\backslash a$ splits into
  1143 Therefore $((_0\ONE  + \ZERO)(_0a  +  _1a^*))\backslash a$ splits into
   470 $([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)] + _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$.
  1144 $([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)] + _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$.
   471 After these two successive derivatives without simplification,
  1145 After these two successive derivatives without simplification,
   472 we apply $\simp$ to this regular expression, which goes through
  1146 we apply $\simp$ to this regular expression, which goes through
   473 the alternative clause, and each component of $([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)] + _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$ will be simplified, giving us the list:$[\ZERO, _0(_0\ONE  + _{11}a^*]$,this
  1147 the alternative clause, and each component of 
   474 list is then flattened--for
  1148 $([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)] + _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$ 
   475 $([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)]$ it will be simplified into $\ZERO$
  1149 will be simplified, giving us the list:$[\ZERO, _0(_0\ONE  + _{11}a^*)]$
   476 and then thrown away by $\textit{flatten}$; $ _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$
  1150 This list is then "flattened"--$\ZERO$ will be
   477  becomes  $ _{00}\ONE  + _{011}a^*]))$ because flatten opens up the alternative
  1151 thrown away by $\textit{flatten}$; $ _0(_0\ONE  + _{11}a^*)$
   478  $\ONE + a^*$ and fuses the front bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$
  1152 is opened up to make the list consisting of two separate elements 
   479  and get $_{00}$ and $_{011}$.
  1153 $_{00}\ONE$ and $_{011}a^*$, note that $flatten$ 
   480  %CONSTRUCTION SITE
  1154 $\fuse$s the bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$.
       
  1155 In a nutshell, the order of simplification causes
       
  1156 the bits to be moved differently.
       
  1157  
       
  1158  \subsubsection{A Failed Attempt To Remedy the Problem Above}
       
  1159 A simple class of regular expressions and strings 
       
  1160 pairs can be deduced 
       
  1161 from the above example which 
       
  1162 trigger the difference between 
       
  1163 $\rup\backslash_{simp} \, s$
       
  1164 and  $\simp(\rup\backslash s)$:
       
  1165 \[
       
  1166 D = (c_1c_2, \{r_1\cdot r_2 \mid \text{$ c_1\in L(r_1), r_1 \; not \; \nullable, c_2 \in L(r_2),
       
  1167 r_2 \backslash c_2$ is of shape alternative  and $c_1c_2 \notin L(r_1)$}\})
       
  1168 \]
   481 and we can use this to construct a set of examples based 
  1169 and we can use this to construct a set of examples based 
   482 on this type of behaviour of two operations:
  1170 on this type of behaviour of two operations:
   483 $r_1$
  1171 $r_1$
   484 that is to say, despite the bits are being moved around on the regular expression
  1172 that is to say, despite the bits are being moved around on the regular expression
   485 (difference in bits), the structure of the (unannotated)regular expression
  1173 (difference in bits), the structure of the (unannotated)regular expression
   486 after one simplification is exactly the same after the 
  1174 after one simplification is exactly the same after the 
   487 same sequence of derivative operations 
  1175 same sequence of derivative operations 
   488 regardless of whether we did simplification
  1176 regardless of whether we did simplification
   489 along the way.
  1177 along the way.
       
  1178  One way would be to give a function that calls
       
  1179  %CONSTRUCTION SITE
       
  1180 fuse is the culprit: it causes the order in which alternatives
       
  1181 are opened up to be remembered and finally the difference
       
  1182 appear in $\simp(\rup \backslash s)$ and $\rup \backslash{simp} \,s$.
       
  1183 but we have to use them as they are essential in the simplification:
       
  1184 flatten needs them.
       
  1185 
       
  1186 
       
  1187 
   490 However, without erase the above equality does not hold:
  1188 However, without erase the above equality does not hold:
   491 for the regular expression  
  1189 for the regular expression  
   492 $(a+b)(a+a*)$,
  1190 $(a+b)(a+a*)$,
   493 if we do derivative with respect to string $aa$,
  1191 if we do derivative with respect to string $aa$,
   494 we get
  1192 we get