ChengsongTanPhdThesis/Chapters/Chapter2.tex
changeset 519 856d025dbc15
parent 518 ff7945a988a3
child 524 947cbbd4e4a7
equal deleted inserted replaced
518:ff7945a988a3 519:856d025dbc15
     1 % Chapter Template
     1 % Chapter Template
     2 
     2 
     3 \chapter{Chapter Title Here} % Main chapter title
     3 \chapter{Regular Expressions and Sulzmanna and Lu's Lexing Algorithm Without Bitcodes} % Main chapter title
     4 
     4 
     5 \label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
     5 \label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts
       
     6 %and notations we 
       
     7 %use for describing the lexing algorithm by Sulzmann and Lu,
       
     8 %and then give the algorithm and its variant, and discuss
       
     9 %why more aggressive simplifications are needed. 
       
    10 
       
    11 
       
    12 \section{Preliminaries}
       
    13 
       
    14 Suppose we have an alphabet $\Sigma$, the strings  whose characters
       
    15 are from $\Sigma$
       
    16 can be expressed as $\Sigma^*$.
       
    17 
       
    18 We use patterns to define a set of strings concisely. Regular expressions
       
    19 are one of such patterns systems:
       
    20 The basic regular expressions  are defined inductively
       
    21  by the following grammar:
       
    22 \[			r ::=   \ZERO \mid  \ONE
       
    23 			 \mid  c  
       
    24 			 \mid  r_1 \cdot r_2
       
    25 			 \mid  r_1 + r_2   
       
    26 			 \mid r^*         
       
    27 \]
       
    28 
       
    29 The language or set of strings defined by regular expressions are defined as
       
    30 %TODO: FILL in the other defs
       
    31 \begin{center}
       
    32 \begin{tabular}{lcl}
       
    33 $L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
       
    34 $L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
       
    35 \end{tabular}
       
    36 \end{center}
       
    37 Which are also called the "language interpretation".
       
    38 
       
    39 
       
    40 
       
    41 The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
       
    42 where the operation transforms the regex to a new one containing
       
    43 strings without the head character $c$.
       
    44 
       
    45 Formally, we define first such a transformation on any string set, which
       
    46 we call semantic derivative:
       
    47 \begin{center}
       
    48 $\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
       
    49 \end{center}
       
    50 Mathematically, it can be expressed as the 
       
    51 
       
    52 If the $\textit{StringSet}$ happen to have some structure, for example,
       
    53 if it is regular, then we have that it
       
    54 
       
    55 % Derivatives of a
       
    56 %regular expression, written $r \backslash c$, give a simple solution
       
    57 %to the problem of matching a string $s$ with a regular
       
    58 %expression $r$: if the derivative of $r$ w.r.t.\ (in
       
    59 %succession) all the characters of the string matches the empty string,
       
    60 %then $r$ matches $s$ (and {\em vice versa}).  
       
    61 
       
    62 The the derivative of regular expression, denoted as
       
    63 $r \backslash c$, is a function that takes parameters
       
    64 $r$ and $c$, and returns another regular expression $r'$,
       
    65 which is computed by the following recursive function:
       
    66 
       
    67 \begin{center}
       
    68 \begin{tabular}{lcl}
       
    69 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
    70 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
    71 		$d \backslash c$     & $\dn$ & 
       
    72 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
    73 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
    74 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
    75 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
    76 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
    77 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
    78 \end{tabular}
       
    79 \end{center}
       
    80 \noindent
       
    81 \noindent
       
    82 
       
    83 The $\nullable$ function tests whether the empty string $""$ 
       
    84 is in the language of $r$:
       
    85 
       
    86 
       
    87 \begin{center}
       
    88 		\begin{tabular}{lcl}
       
    89 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
    90 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
    91 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
    92 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
    93 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
    94 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
    95 		\end{tabular}
       
    96 \end{center}
       
    97 \noindent
       
    98 The empty set does not contain any string and
       
    99 therefore not the empty string, the empty string 
       
   100 regular expression contains the empty string
       
   101 by definition, the character regular expression
       
   102 is the singleton that contains character only,
       
   103 and therefore does not contain the empty string,
       
   104 the alternative regular expression(or "or" expression)
       
   105 might have one of its children regular expressions
       
   106 being nullable and any one of its children being nullable
       
   107 would suffice. The sequence regular expression
       
   108 would require both children to have the empty string
       
   109 to compose an empty string and the Kleene star
       
   110 operation naturally introduced the empty string.
       
   111 
       
   112 We can give the meaning of regular expressions derivatives
       
   113 by language interpretation:
       
   114 
       
   115 
       
   116  
       
   117   
       
   118 \begin{center}
       
   119 \begin{tabular}{lcl}
       
   120 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   121 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   122 		$d \backslash c$     & $\dn$ & 
       
   123 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   124 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   125 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   126 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   127 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   128 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   129 \end{tabular}
       
   130 \end{center}
       
   131 \noindent
       
   132 \noindent
       
   133 The function derivative, written $\backslash c$, 
       
   134 defines how a regular expression evolves into
       
   135 a new regular expression after all the string it contains
       
   136 is chopped off a certain head character $c$.
       
   137 The most involved cases are the sequence 
       
   138 and star case.
       
   139 The sequence case says that if the first regular expression
       
   140 contains an empty string then the second component of the sequence
       
   141 might be chosen as the target regular expression to be chopped
       
   142 off its head character.
       
   143 The star regular expression's derivative unwraps the iteration of
       
   144 regular expression and attaches the star regular expression
       
   145 to the sequence's second element to make sure a copy is retained
       
   146 for possible more iterations in later phases of lexing.
       
   147 
       
   148 
       
   149 The main property of the derivative operation
       
   150 that enables us to reason about the correctness of
       
   151 an algorithm using derivatives is 
       
   152 
       
   153 \begin{center}
       
   154 $c\!::\!s \in L(r)$ holds
       
   155 if and only if $s \in L(r\backslash c)$.
       
   156 \end{center}
       
   157 
       
   158 \noindent
       
   159 We can generalise the derivative operation shown above for single characters
       
   160 to strings as follows:
       
   161 
       
   162 \begin{center}
       
   163 \begin{tabular}{lcl}
       
   164 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
   165 $r \backslash [\,] $ & $\dn$ & $r$
       
   166 \end{tabular}
       
   167 \end{center}
       
   168 
       
   169 \noindent
       
   170 and then define Brzozowski's  regular-expression matching algorithm as:
       
   171 
       
   172 \[
       
   173 match\;s\;r \;\dn\; nullable(r\backslash s)
       
   174 \]
       
   175 
       
   176 \noindent
       
   177 Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
       
   178 this algorithm presented graphically is as follows:
       
   179 
       
   180 \begin{equation}\label{graph:*}
       
   181 \begin{tikzcd}
       
   182 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
       
   183 \end{tikzcd}
       
   184 \end{equation}
       
   185 
       
   186 \noindent
       
   187 where we start with  a regular expression  $r_0$, build successive
       
   188 derivatives until we exhaust the string and then use \textit{nullable}
       
   189 to test whether the result can match the empty string. It can  be
       
   190 relatively  easily shown that this matcher is correct  (that is given
       
   191 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
       
   192 
       
   193 Beautiful and simple definition.
       
   194 
       
   195 If we implement the above algorithm naively, however,
       
   196 the algorithm can be excruciatingly slow. 
       
   197 
       
   198 
       
   199 \begin{figure}
       
   200 \centering
       
   201 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   202 \begin{tikzpicture}
       
   203 \begin{axis}[
       
   204     xlabel={$n$},
       
   205     x label style={at={(1.05,-0.05)}},
       
   206     ylabel={time in secs},
       
   207     enlargelimits=false,
       
   208     xtick={0,5,...,30},
       
   209     xmax=33,
       
   210     ymax=10000,
       
   211     ytick={0,1000,...,10000},
       
   212     scaled ticks=false,
       
   213     axis lines=left,
       
   214     width=5cm,
       
   215     height=4cm, 
       
   216     legend entries={JavaScript},  
       
   217     legend pos=north west,
       
   218     legend cell align=left]
       
   219 \addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
       
   220 \end{axis}
       
   221 \end{tikzpicture}\\
       
   222 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   223            of the form $\underbrace{aa..a}_{n}$.}
       
   224 \end{tabular}    
       
   225 \caption{EightThousandNodes} \label{fig:EightThousandNodes}
       
   226 \end{figure}
       
   227 
       
   228 
       
   229 (8000 node data to be added here)
       
   230 For example, when starting with the regular
       
   231 expression $(a + aa)^*$ and building a few successive derivatives (around 10)
       
   232 w.r.t.~the character $a$, one obtains a derivative regular expression
       
   233 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
       
   234 The reason why $(a + aa) ^*$ explodes so drastically is that without
       
   235 pruning, the algorithm will keep records of all possible ways of matching:
       
   236 \begin{center}
       
   237 $(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
       
   238 \end{center}
       
   239 
       
   240 \noindent
       
   241 Each of the above alternative branches correspond to the match 
       
   242 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
       
   243 These different ways of matching will grow exponentially with the string length,
       
   244 and without simplifications that throw away some of these very similar matchings,
       
   245 it is no surprise that these expressions grow so quickly.
       
   246 Operations like
       
   247 $\backslash$ and $\nullable$ need to traverse such trees and
       
   248 consequently the bigger the size of the derivative the slower the
       
   249 algorithm. 
       
   250 
       
   251 Brzozowski was quick in finding that during this process a lot useless
       
   252 $\ONE$s and $\ZERO$s are generated and therefore not optimal.
       
   253 He also introduced some "similarity rules" such
       
   254 as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
       
   255 different but language-equivalent sub-regexes to further decrease the size
       
   256 of the intermediate regexes. 
       
   257 
       
   258 More simplifications are possible, such as deleting duplicates
       
   259 and opening up nested alternatives to trigger even more simplifications.
       
   260 And suppose we apply simplification after each derivative step, and compose
       
   261 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
       
   262 \textit{simp}(a \backslash c)$. Then we can build
       
   263 a matcher without having  cumbersome regular expressions.
       
   264 
       
   265 
       
   266 If we want the size of derivatives in the algorithm to
       
   267 stay even lower, we would need more aggressive simplifications.
       
   268 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
   269 deleting duplicates whenever possible. For example, the parentheses in
       
   270 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
       
   271 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
   272 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
   273 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us
       
   274 to achieve a very tight size bound, namely,
       
   275  the same size bound as that of the \emph{partial derivatives}. 
       
   276 
       
   277 Building derivatives and then simplify them.
       
   278 So far so good. But what if we want to 
       
   279 do lexing instead of just a YES/NO answer?
       
   280 This requires us to go back again to the world 
       
   281 without simplification first for a moment.
       
   282 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
       
   283 elegant(arguably as beautiful as the original
       
   284 derivatives definition) solution for this.
       
   285 
       
   286 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
       
   287 
       
   288 
       
   289 They first defined the datatypes for storing the 
       
   290 lexing information called a \emph{value} or
       
   291 sometimes also \emph{lexical value}.  These values and regular
       
   292 expressions correspond to each other as illustrated in the following
       
   293 table:
       
   294 
       
   295 \begin{center}
       
   296 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   297 		\begin{tabular}{@{}rrl@{}}
       
   298 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   299 			$r$ & $::=$  & $\ZERO$\\
       
   300 			& $\mid$ & $\ONE$   \\
       
   301 			& $\mid$ & $c$          \\
       
   302 			& $\mid$ & $r_1 \cdot r_2$\\
       
   303 			& $\mid$ & $r_1 + r_2$   \\
       
   304 			\\
       
   305 			& $\mid$ & $r^*$         \\
       
   306 		\end{tabular}
       
   307 		&
       
   308 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   309 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   310 			$v$ & $::=$  & \\
       
   311 			&        & $\Empty$   \\
       
   312 			& $\mid$ & $\Char(c)$          \\
       
   313 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   314 			& $\mid$ & $\Left(v)$   \\
       
   315 			& $\mid$ & $\Right(v)$  \\
       
   316 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   317 		\end{tabular}
       
   318 	\end{tabular}
       
   319 \end{center}
       
   320 
       
   321 \noindent
       
   322 
       
   323 Building on top of Sulzmann and Lu's attempt to formalize the 
       
   324 notion of POSIX lexing rules \parencite{Sulzmann2014}, 
       
   325 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
       
   326 POSIX matching as a ternary relation recursively defined in a
       
   327 natural deduction style.
       
   328 With the formally-specified rules for what a POSIX matching is,
       
   329 they proved in Isabelle/HOL that the algorithm gives correct results.
       
   330 
       
   331 But having a correct result is still not enough, 
       
   332 we want at least some degree of $\mathbf{efficiency}$.
       
   333 
       
   334 
       
   335 
       
   336 One regular expression can have multiple lexical values. For example
       
   337 for the regular expression $(a+b)^*$, it has a infinite list of
       
   338 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
       
   339 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
       
   340 $\ldots$, and vice versa.
       
   341 Even for the regular expression matching a certain string, there could 
       
   342 still be more than one value corresponding to it.
       
   343 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
       
   344 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   345 The number of different ways of matching 
       
   346 without allowing any value under a star to be flattened
       
   347 to an empty string can be given by the following formula:
       
   348 \begin{equation}
       
   349 	C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
       
   350 \end{equation}
       
   351 and a closed form formula can be calculated to be
       
   352 \begin{equation}
       
   353 	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
       
   354 \end{equation}
       
   355 which is clearly in exponential order.
       
   356 
       
   357 A lexer aimed at getting all the possible values has an exponential
       
   358 worst case runtime. Therefore it is impractical to try to generate
       
   359 all possible matches in a run. In practice, we are usually 
       
   360 interested about POSIX values, which by intuition always
       
   361 \begin{itemize}
       
   362 \item
       
   363 match the leftmost regular expression when multiple options of matching
       
   364 are available  
       
   365 \item 
       
   366 always match a subpart as much as possible before proceeding
       
   367 to the next token.
       
   368 \end{itemize}
       
   369 
       
   370 
       
   371  For example, the above example has the POSIX value
       
   372 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
       
   373 The output of an algorithm we want would be a POSIX matching
       
   374 encoded as a value.
       
   375 The reason why we are interested in $\POSIX$ values is that they can
       
   376 be practically used in the lexing phase of a compiler front end.
       
   377 For instance, when lexing a code snippet 
       
   378 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
       
   379 as an identifier rather than a keyword.
       
   380 
       
   381 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   382 algorithm by a second phase (the first phase being building successive
       
   383 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
       
   384 is generated in case the regular expression matches  the string. 
       
   385 Pictorially, the Sulzmann and Lu algorithm is as follows:
       
   386 
       
   387 \begin{ceqn}
       
   388 \begin{equation}\label{graph:2}
       
   389 \begin{tikzcd}
       
   390 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   391 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   392 \end{tikzcd}
       
   393 \end{equation}
       
   394 \end{ceqn}
       
   395 
       
   396 
       
   397 \noindent
       
   398 For convenience, we shall employ the following notations: the regular
       
   399 expression we start with is $r_0$, and the given string $s$ is composed
       
   400 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
   401 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
   402 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
   403 the derivative $r_n$. We test whether this derivative is
       
   404 $\textit{nullable}$ or not. If not, we know the string does not match
       
   405 $r$ and no value needs to be generated. If yes, we start building the
       
   406 values incrementally by \emph{injecting} back the characters into the
       
   407 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
   408 algorithm from the right to left. For the first value $v_n$, we call the
       
   409 function $\textit{mkeps}$, which builds a POSIX lexical value
       
   410 for how the empty string has been matched by the (nullable) regular
       
   411 expression $r_n$. This function is defined as
       
   412 
       
   413 	\begin{center}
       
   414 		\begin{tabular}{lcl}
       
   415 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   416 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   417 			& \textit{if} $\nullable(r_{1})$\\ 
       
   418 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   419 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   420 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   421 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   422 		\end{tabular}
       
   423 	\end{center}
       
   424 
       
   425 
       
   426 \noindent 
       
   427 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   428 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   429 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   430 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   431 matches $s$. The POSIX value is maintained throught out the process.
       
   432 For this Sulzmann and Lu defined a function that reverses
       
   433 the ``chopping off'' of characters during the derivative phase. The
       
   434 corresponding function is called \emph{injection}, written
       
   435 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   436 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   437 is a character ${c_{i-1}}$, the character we want to inject and the
       
   438 third argument is the value ${v_i}$, into which one wants to inject the
       
   439 character (it corresponds to the regular expression after the character
       
   440 has been chopped off). The result of this function is a new value. The
       
   441 definition of $\textit{inj}$ is as follows: 
       
   442 
       
   443 \begin{center}
       
   444 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   445   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   446   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   447   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   448   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   449   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   450   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   451   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   452 \end{tabular}
       
   453 \end{center}
       
   454 
       
   455 \noindent This definition is by recursion on the ``shape'' of regular
       
   456 expressions and values. 
       
   457 The clauses basically do one thing--identifying the ``holes'' on 
       
   458 value to inject the character back into.
       
   459 For instance, in the last clause for injecting back to a value
       
   460 that would turn into a new star value that corresponds to a star,
       
   461 we know it must be a sequence value. And we know that the first 
       
   462 value of that sequence corresponds to the child regex of the star
       
   463 with the first character being chopped off--an iteration of the star
       
   464 that had just been unfolded. This value is followed by the already
       
   465 matched star iterations we collected before. So we inject the character 
       
   466 back to the first value and form a new value with this new iteration
       
   467 being added to the previous list of iterations, all under the $Stars$
       
   468 top level.
       
   469 
       
   470 We have mentioned before that derivatives without simplification 
       
   471 can get clumsy, and this is true for values as well--they reflect
       
   472 the regular expressions size by definition.
       
   473 
       
   474 One can introduce simplification on the regex and values, but have to
       
   475 be careful in not breaking the correctness as the injection 
       
   476 function heavily relies on the structure of the regexes and values
       
   477 being correct and match each other.
       
   478 It can be achieved by recording some extra rectification functions
       
   479 during the derivatives step, and applying these rectifications in 
       
   480 each run during the injection phase.
       
   481 And we can prove that the POSIX value of how
       
   482 regular expressions match strings will not be affected---although is much harder
       
   483 to establish. 
       
   484 Some initial results in this regard have been
       
   485 obtained in \cite{AusafDyckhoffUrban2016}. 
       
   486 
       
   487 
       
   488 
       
   489 %Brzozowski, after giving the derivatives and simplification,
       
   490 %did not explore lexing with simplification or he may well be 
       
   491 %stuck on an efficient simplificaiton with a proof.
       
   492 %He went on to explore the use of derivatives together with 
       
   493 %automaton, and did not try lexing using derivatives.
       
   494 
       
   495 We want to get rid of complex and fragile rectification of values.
       
   496 Can we not create those intermediate values $v_1,\ldots v_n$,
       
   497 and get the lexing information that should be already there while
       
   498 doing derivatives in one pass, without a second phase of injection?
       
   499 In the meantime, can we make sure that simplifications
       
   500 are easily handled without breaking the correctness of the algorithm?
       
   501 
       
   502 Sulzmann and Lu solved this problem by
       
   503 introducing additional informtaion to the 
       
   504 regular expressions called \emph{bitcodes}.
       
   505 
       
   506 \subsection*{Bit-coded Algorithm}
       
   507 Bits and bitcodes (lists of bits) are defined as:
       
   508 
       
   509 \begin{center}
       
   510 		$b ::=   1 \mid  0 \qquad
       
   511 bs ::= [] \mid b::bs    
       
   512 $
       
   513 \end{center}
       
   514 
       
   515 \noindent
       
   516 The $1$ and $0$ are not in bold in order to avoid 
       
   517 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
       
   518 bit-lists) can be used to encode values (or potentially incomplete values) in a
       
   519 compact form. This can be straightforwardly seen in the following
       
   520 coding function from values to bitcodes: 
       
   521 
       
   522 \begin{center}
       
   523 \begin{tabular}{lcl}
       
   524   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
   525   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
   526   $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
       
   527   $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
       
   528   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
   529   $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
       
   530   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
       
   531                                                  code(\Stars\,vs)$
       
   532 \end{tabular}    
       
   533 \end{center} 
       
   534 
       
   535 \noindent
       
   536 Here $\textit{code}$ encodes a value into a bitcodes by converting
       
   537 $\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
       
   538 star iteration by $1$. The border where a local star terminates
       
   539 is marked by $0$. This coding is lossy, as it throws away the information about
       
   540 characters, and also does not encode the ``boundary'' between two
       
   541 sequence values. Moreover, with only the bitcode we cannot even tell
       
   542 whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
       
   543 reason for choosing this compact way of storing information is that the
       
   544 relatively small size of bits can be easily manipulated and ``moved
       
   545 around'' in a regular expression. In order to recover values, we will 
       
   546 need the corresponding regular expression as an extra information. This
       
   547 means the decoding function is defined as:
       
   548 
       
   549 
       
   550 %\begin{definition}[Bitdecoding of Values]\mbox{}
       
   551 \begin{center}
       
   552 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
   553   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
       
   554   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
       
   555   $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   556      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       
   557        (\Left\,v, bs_1)$\\
       
   558   $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   559      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       
   560        (\Right\,v, bs_1)$\\                           
       
   561   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
       
   562         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
       
   563   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
       
   564   & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
       
   565   $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
       
   566   $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
       
   567          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   568   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
       
   569   & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
       
   570   
       
   571   $\textit{decode}\,bs\,r$ & $\dn$ &
       
   572      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   573   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       
   574        \textit{else}\;\textit{None}$                       
       
   575 \end{tabular}    
       
   576 \end{center}    
       
   577 %\end{definition}
       
   578 
       
   579 Sulzmann and Lu's integrated the bitcodes into regular expressions to
       
   580 create annotated regular expressions \cite{Sulzmann2014}.
       
   581 \emph{Annotated regular expressions} are defined by the following
       
   582 grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
       
   583 
       
   584 \begin{center}
       
   585 \begin{tabular}{lcl}
       
   586   $\textit{a}$ & $::=$  & $\ZERO$\\
       
   587                   & $\mid$ & $_{bs}\ONE$\\
       
   588                   & $\mid$ & $_{bs}{\bf c}$\\
       
   589                   & $\mid$ & $_{bs}\sum\,as$\\
       
   590                   & $\mid$ & $_{bs}a_1\cdot a_2$\\
       
   591                   & $\mid$ & $_{bs}a^*$
       
   592 \end{tabular}    
       
   593 \end{center}  
       
   594 %(in \textit{ALTS})
       
   595 
       
   596 \noindent
       
   597 where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
       
   598 expressions and $as$ for a list of annotated regular expressions.
       
   599 The alternative constructor($\sum$) has been generalized to 
       
   600 accept a list of annotated regular expressions rather than just 2.
       
   601 We will show that these bitcodes encode information about
       
   602 the (POSIX) value that should be generated by the Sulzmann and Lu
       
   603 algorithm.
       
   604 
       
   605 
       
   606 To do lexing using annotated regular expressions, we shall first
       
   607 transform the usual (un-annotated) regular expressions into annotated
       
   608 regular expressions. This operation is called \emph{internalisation} and
       
   609 defined as follows:
       
   610 
       
   611 %\begin{definition}
       
   612 \begin{center}
       
   613 \begin{tabular}{lcl}
       
   614   $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
       
   615   $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
       
   616   $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
       
   617   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
   618   $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
       
   619   \textit{fuse}\,[1]\,r_2^\uparrow]$\\
       
   620   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
   621          $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
       
   622   $(r^*)^\uparrow$ & $\dn$ &
       
   623          $_{[]}(r^\uparrow)^*$\\
       
   624 \end{tabular}    
       
   625 \end{center}    
       
   626 %\end{definition}
       
   627 
       
   628 \noindent
       
   629 We use up arrows here to indicate that the basic un-annotated regular
       
   630 expressions are ``lifted up'' into something slightly more complex. In the
       
   631 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
       
   632 attach bits to the front of an annotated regular expression. Its
       
   633 definition is as follows:
       
   634 
       
   635 \begin{center}
       
   636 \begin{tabular}{lcl}
       
   637   $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
       
   638   $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
       
   639      $_{bs @ bs'}\ONE$\\
       
   640   $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
       
   641      $_{bs@bs'}{\bf c}$\\
       
   642   $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
       
   643      $_{bs@bs'}\sum\textit{as}$\\
       
   644   $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
       
   645      $_{bs@bs'}a_1 \cdot a_2$\\
       
   646   $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
       
   647      $_{bs @ bs'}a^*$
       
   648 \end{tabular}    
       
   649 \end{center}  
       
   650 
       
   651 \noindent
       
   652 After internalising the regular expression, we perform successive
       
   653 derivative operations on the annotated regular expressions. This
       
   654 derivative operation is the same as what we had previously for the
       
   655 basic regular expressions, except that we beed to take care of
       
   656 the bitcodes:
       
   657 
       
   658 
       
   659 \iffalse
       
   660  %\begin{definition}{bder}
       
   661 \begin{center}
       
   662   \begin{tabular}{@{}lcl@{}}
       
   663   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   664   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   665   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
       
   666         $\textit{if}\;c=d\; \;\textit{then}\;
       
   667          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
   668   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
       
   669   $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
       
   670   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
   671      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   672 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
   673 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
   674   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
       
   675   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
       
   676       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       
   677        (\textit{STAR}\,[]\,r)$
       
   678 \end{tabular}    
       
   679 \end{center}    
       
   680 %\end{definition}
       
   681 
       
   682 \begin{center}
       
   683   \begin{tabular}{@{}lcl@{}}
       
   684   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   685   $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   686   $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
       
   687         $\textit{if}\;c=d\; \;\textit{then}\;
       
   688          _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
       
   689   $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
       
   690   $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
       
   691   $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
       
   692      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   693 					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
       
   694 					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
   695   & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
       
   696   $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
       
   697       $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
       
   698        (_{bs}\textit{STAR}\,[]\,r)$
       
   699 \end{tabular}    
       
   700 \end{center}    
       
   701 %\end{definition}
       
   702 \fi
       
   703 
       
   704 \begin{center}
       
   705   \begin{tabular}{@{}lcl@{}}
       
   706   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
   707   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
   708   $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
       
   709         $\textit{if}\;c=d\; \;\textit{then}\;
       
   710          _{bs}\ONE\;\textit{else}\;\ZERO$\\  
       
   711   $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
       
   712   $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
       
   713   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
       
   714      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   715 					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
       
   716 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
       
   717   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
       
   718   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
       
   719       $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
       
   720        (_{[]}r^*))$
       
   721 \end{tabular}    
       
   722 \end{center}    
       
   723 
       
   724 %\end{definition}
       
   725 \noindent
       
   726 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
       
   727 we need to unfold it into a sequence,
       
   728 and attach an additional bit $0$ to the front of $r \backslash c$
       
   729 to indicate one more star iteration. Also the sequence clause
       
   730 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
       
   731 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
       
   732 that it is for annotated regular expressions, therefore we omit the
       
   733 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
       
   734 $a_1$ matches the string prior to character $c$ (more on this later),
       
   735 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
       
   736 \backslash c)$ will collapse the regular expression $a_1$(as it has
       
   737 already been fully matched) and store the parsing information at the
       
   738 head of the regular expression $a_2 \backslash c$ by fusing to it. The
       
   739 bitsequence $\textit{bs}$, which was initially attached to the
       
   740 first element of the sequence $a_1 \cdot a_2$, has
       
   741 now been elevated to the top-level of $\sum$, as this information will be
       
   742 needed whichever way the sequence is matched---no matter whether $c$ belongs
       
   743 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
       
   744 the lexing information, we complete the lexing by collecting the
       
   745 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
   746 for annotated regular expressions, called $\textit{bmkeps}$:
       
   747 
       
   748 
       
   749 %\begin{definition}[\textit{bmkeps}]\mbox{}
       
   750 \begin{center}
       
   751 \begin{tabular}{lcl}
       
   752   $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
       
   753   $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
       
   754      $\textit{if}\;\textit{bnullable}\,a$\\
       
   755   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
       
   756   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
       
   757   $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
       
   758      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
       
   759   $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
       
   760      $bs \,@\, [0]$
       
   761 \end{tabular}    
       
   762 \end{center}    
       
   763 %\end{definition}
       
   764 
       
   765 \noindent
       
   766 This function completes the value information by travelling along the
       
   767 path of the regular expression that corresponds to a POSIX value and
       
   768 collecting all the bitcodes, and using $S$ to indicate the end of star
       
   769 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
       
   770 decode them, we get the value we expect. The corresponding lexing
       
   771 algorithm looks as follows:
       
   772 
       
   773 \begin{center}
       
   774 \begin{tabular}{lcl}
       
   775   $\textit{blexer}\;r\,s$ & $\dn$ &
       
   776       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
   777   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   778   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   779   & & $\;\;\textit{else}\;\textit{None}$
       
   780 \end{tabular}
       
   781 \end{center}
       
   782 
       
   783 \noindent
       
   784 In this definition $\_\backslash s$ is the  generalisation  of the derivative
       
   785 operation from characters to strings (just like the derivatives for un-annotated
       
   786 regular expressions).
       
   787 
       
   788 Now we introduce the simplifications, which is why we introduce the 
       
   789 bitcodes in the first place.
       
   790 
       
   791 \subsection*{Simplification Rules}
       
   792 
       
   793 This section introduces aggressive (in terms of size) simplification rules
       
   794 on annotated regular expressions
       
   795 to keep derivatives small. Such simplifications are promising
       
   796 as we have
       
   797 generated test data that show
       
   798 that a good tight bound can be achieved. We could only
       
   799 partially cover the search space as there are infinitely many regular
       
   800 expressions and strings. 
       
   801 
       
   802 One modification we introduced is to allow a list of annotated regular
       
   803 expressions in the $\sum$ constructor. This allows us to not just
       
   804 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
       
   805 also unnecessary ``copies'' of regular expressions (very similar to
       
   806 simplifying $r + r$ to just $r$, but in a more general setting). Another
       
   807 modification is that we use simplification rules inspired by Antimirov's
       
   808 work on partial derivatives. They maintain the idea that only the first
       
   809 ``copy'' of a regular expression in an alternative contributes to the
       
   810 calculation of a POSIX value. All subsequent copies can be pruned away from
       
   811 the regular expression. A recursive definition of our  simplification function 
       
   812 that looks somewhat similar to our Scala code is given below:
       
   813 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
   814 %Is it $ALTS$ or $ALTS$?}\\
       
   815 
       
   816 \begin{center}
       
   817   \begin{tabular}{@{}lcl@{}}
       
   818    
       
   819   $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
   820    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
   821    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
   822    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
   823    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
   824    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
       
   825 
       
   826   $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
       
   827   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
   828    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
   829    &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
       
   830 
       
   831    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
   832 \end{tabular}    
       
   833 \end{center}    
       
   834 
       
   835 \noindent
       
   836 The simplification does a pattern matching on the regular expression.
       
   837 When it detected that the regular expression is an alternative or
       
   838 sequence, it will try to simplify its child regular expressions
       
   839 recursively and then see if one of the children turns into $\ZERO$ or
       
   840 $\ONE$, which might trigger further simplification at the current level.
       
   841 The most involved part is the $\sum$ clause, where we use two
       
   842 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
       
   843 alternatives and reduce as many duplicates as possible. Function
       
   844 $\textit{distinct}$  keeps the first occurring copy only and removes all later ones
       
   845 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
       
   846 Its recursive definition is given below:
       
   847 
       
   848  \begin{center}
       
   849   \begin{tabular}{@{}lcl@{}}
       
   850   $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
       
   851      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
       
   852   $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
       
   853     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
       
   854 \end{tabular}    
       
   855 \end{center}  
       
   856 
       
   857 \noindent
       
   858 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
       
   859 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
       
   860 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
       
   861 
       
   862 Having defined the $\simp$ function,
       
   863 we can use the previous notation of  natural
       
   864 extension from derivative w.r.t.~character to derivative
       
   865 w.r.t.~string:%\comment{simp in  the [] case?}
       
   866 
       
   867 \begin{center}
       
   868 \begin{tabular}{lcl}
       
   869 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
   870 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
   871 \end{tabular}
       
   872 \end{center}
       
   873 
       
   874 \noindent
       
   875 to obtain an optimised version of the algorithm:
       
   876 
       
   877  \begin{center}
       
   878 \begin{tabular}{lcl}
       
   879   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
   880       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
   881   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   882   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   883   & & $\;\;\textit{else}\;\textit{None}$
       
   884 \end{tabular}
       
   885 \end{center}
       
   886 
       
   887 \noindent
       
   888 This algorithm keeps the regular expression size small, for example,
       
   889 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
   890 will be reduced to just 6 and stays constant, no matter how long the
       
   891 input string is.
       
   892 
       
   893 
     6 
   894 
     7 
   895 
     8 
   896 
     9 
   897 
    10 
   898 
    15 To be completed.
   903 To be completed.
    16 
   904 
    17 
   905 
    18 
   906 
    19 
   907 
    20 %-----------------------------------
   908 %----------------------------------------------------------------------------------------
    21 %	SECTION ?
   909 %	SECTION  correctness proof
    22 %-----------------------------------
   910 %----------------------------------------------------------------------------------------
    23 \section{preparatory properties for getting a finiteness bound}
   911 \section{Correctness of Bit-coded Algorithm (Without Simplification)
    24 Before we get to the proof that says the intermediate result of our lexer will
   912 We now give the proof the correctness of the algorithm with bit-codes.
    25 remain finitely bounded, which is an important efficiency/liveness guarantee,
   913 
    26 we shall first develop a few preparatory properties and definitions to 
   914 Ausaf and Urban cleverly defined an auxiliary function called $\flex$,
    27 make the process of proving that a breeze.
   915 defined as
    28 
   916 \[
    29 We define rewriting relations for $\rrexp$s, which allows us to do the 
   917 \flex \; r \; f \; [] \; v \; = \; f\; v
    30 same trick as we did for the correctness proof,
   918 \flex \; r \; f \; c :: s \; v =  \flex r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v
    31 but this time we will have stronger equalities established.
   919 \]
    32 \subsection{"hrewrite" relation}
   920 which accumulates the characters that needs to be injected back, 
    33 List of 1-step rewrite rules for regular expressions simplification without bitcodes:
   921 and does the injection in a stack-like manner (last taken derivative first injected).
    34 \begin{center}
   922 $\flex$ is connected to the $\lexer$:
    35 \begin{tabular}{lcl}
       
    36 $r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$
       
    37 \end{tabular}
       
    38 \end{center}
       
    39 
       
    40 And we define an "grewrite" relation that works on lists:
       
    41 \begin{center}
       
    42 \begin{tabular}{lcl}
       
    43 $ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
       
    44 \end{tabular}
       
    45 \end{center}
       
    46 
       
    47 
       
    48 
       
    49 With these relations we prove
       
    50 \begin{lemma}
   923 \begin{lemma}
    51 $rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
   924 $\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
    52 \end{lemma}
   925 \end{lemma}
    53 which enables us to prove "closed forms" equalities such as
   926 $\flex$ provides us a bridge between $\lexer$ and $\blexer$.
    54 \begin{lemma}
   927 What is even better about $\flex$ is that it allows us to 
    55 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
   928 directly operate on the value $\mkeps (r\backslash v)$,
       
   929 which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
       
   930 When the value created by $\mkeps$ becomes available, one can 
       
   931 prove some stepwise properties of lexing nicely:
       
   932 \begin{lemma}\label{flexStepwise}
       
   933 $\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
    56 \end{lemma}
   934 \end{lemma}
    57 
   935 
    58 But the most involved part of the above lemma is proving the following:
   936 And for $\blexer$ we have something with stepwise properties like $\flex$ as well,
    59 \begin{lemma}
   937 called $\retrieve$:
    60 $\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ 
   938 \[
       
   939 \retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}
       
   940 \]
       
   941 $\retrieve$ takes bit-codes from annotated regular expressions
       
   942 guided by a value.
       
   943 $\retrieve$ is connected to the $\blexer$ in the following way:
       
   944 \begin{lemma}\label{blexer_retrieve}
       
   945 $\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
    61 \end{lemma}
   946 \end{lemma}
    62 which is needed in proving things like 
   947 If you take derivative of an annotated regular expression, 
    63 \begin{lemma}
   948 you can $\retrieve$ the same bit-codes as before the derivative took place,
    64 $r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
   949 provided that you use the corresponding value:
       
   950 
       
   951 \begin{lemma}\label{retrieveStepwise}
       
   952 $\retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
    65 \end{lemma}
   953 \end{lemma}
    66 
   954 The other good thing about $\retrieve$ is that it can be connected to $\flex$:
    67 Fortunately this is provable by induction on the list $rs$
   955 %centralLemma1
    68 
   956 \begin{lemma}\label{flex_retrieve}
    69 
   957 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
    70 
   958 \end{lemma}
    71 %-----------------------------------
   959 \begin{proof}
    72 %	SECTION 2
   960 By induction on $s$. The induction tactic is reverse induction on strings.
    73 %-----------------------------------
   961 $v$ is allowed to be arbitrary.
    74 
   962 The crucial point is to rewrite 
    75 \section{Finiteness Property}
   963 \[
    76 In this section let us describe our argument for why the size of the simplified
   964 \retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
    77 derivatives with the aggressive simplification function is finitely bounded.
       
    78  Suppose
       
    79 we have a size function for bitcoded regular expressions, written
       
    80 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
       
    81 
       
    82 \begin{center}
       
    83 \begin{tabular}{ccc}
       
    84 $\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
       
    85 \end{tabular}
       
    86 \end{center}
       
    87 (TODO: COMPLETE this defn and for $rs$)
       
    88 
       
    89 The size is based on a recursive function on the structure of the regex,
       
    90 not the bitcodes.
       
    91 Therefore we may as well talk about size of an annotated regular expression 
       
    92 in their un-annotated form:
       
    93 \begin{center}
       
    94 $\asize(a) = \size(\erase(a))$. 
       
    95 \end{center}
       
    96 (TODO: turn equals to roughly equals)
       
    97 
       
    98 But there is a minor nuisance here:
       
    99 the erase function actually messes with the structure of the regular expression:
       
   100 \begin{center}
       
   101 \begin{tabular}{ccc}
       
   102 $\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
       
   103 \end{tabular}
       
   104 \end{center}
       
   105 (TODO: complete this definition with singleton r in alts)
       
   106 
       
   107 An alternative regular expression with an empty list of children
       
   108  is turned into an $\ZERO$ during the
       
   109 $\erase$ function, thereby changing the size and structure of the regex.
       
   110 These will likely be fixable if we really want to use plain $\rexp$s for dealing
       
   111 with size, but we choose a more straightforward (or stupid) method by 
       
   112 defining a new datatype that is similar to plain $\rexp$s but can take
       
   113 non-binary arguments for its alternative constructor,
       
   114  which we call $\rrexp$ to denote
       
   115 the difference between it and plain regular expressions. 
       
   116 \[			\rrexp ::=   \RZERO \mid  \RONE
       
   117 			 \mid  \RCHAR{c}  
       
   118 			 \mid  \RSEQ{r_1}{r_2}
       
   119 			 \mid  \RALTS{rs}
       
   120 			 \mid \RSTAR{r}        
       
   121 \]
   965 \]
   122 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
   966 as
   123 but keep everything else intact.
   967 \[
   124 It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
   968 \retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\;  \mkeps (r \backslash s@[c]))
   125 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
   969 \].
   126 $\ALTS$).
   970 This enables us to equate 
   127 We denote the operation of erasing the bits and turning an annotated regular expression 
   971 \[
   128 into an $\rrexp{}$ as $\rerase{}$.
   972 \retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
   129 \begin{center}
   973 \] 
   130 \begin{tabular}{lcl}
   974 with 
   131 $\rerase{\AZERO}$ & $=$ & $\RZERO$\\
   975 \[
   132 $\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
   976 \flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
   133 $\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
   977 \],
   134 \end{tabular}
   978 which in turn can be rewritten as
   135 \end{center}
   979 \[
   136 %TODO: FINISH definition of rerase
   980 \flex \; r \; \textit{id} \; s@[c] \;  (\mkeps (r\backslash s@[c]))
   137 Similarly we could define the derivative  and simplification on 
   981 \].
   138 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
   982 \end{proof}
   139 except that now they can operate on alternatives taking multiple arguments.
   983 
   140 
   984 With the above lemma we can now link $\flex$ and $\blexer$.
   141 \begin{center}
   985 
   142 \begin{tabular}{lcr}
   986 \begin{lemma}\label{flex_blexer}
   143 $\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
   987 $\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
   144 (other clauses omitted)
       
   145 \end{tabular}
       
   146 \end{center}
       
   147 
       
   148 Now that $\rrexp$s do not have bitcodes on them, we can do the 
       
   149 duplicate removal without  an equivalence relation:
       
   150 \begin{center}
       
   151 \begin{tabular}{lcl}
       
   152 $\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
       
   153            			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
       
   154 \end{tabular}
       
   155 \end{center}
       
   156 %TODO: definition of rsimp (maybe only the alternative clause)
       
   157 
       
   158 
       
   159 The reason why these definitions  mirror precisely the corresponding operations
       
   160 on annotated regualar expressions is that we can calculate sizes more easily:
       
   161 
       
   162 \begin{lemma}
       
   163 $\rsize{\rerase a} = \asize a$
       
   164 \end{lemma}
   988 \end{lemma}
   165 This lemma says that the size of an r-erased regex is the same as the annotated regex.
       
   166 this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
       
   167 \begin{lemma}
       
   168 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
       
   169 \end{lemma}
       
   170 Putting these two together we get a property that allows us to estimate the 
       
   171 size of an annotated regular expression derivative using its un-annotated counterpart:
       
   172 \begin{lemma}
       
   173 $\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
       
   174 \end{lemma}
       
   175 Unless stated otherwise in this chapter all $\textit{rexp}$s without
       
   176  bitcodes are seen as $\rrexp$s.
       
   177  We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
       
   178  as the former suits people's intuitive way of stating a binary alternative
       
   179  regular expression.
       
   180 
       
   181 
       
   182 \begin{theorem}
       
   183 For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
       
   184 \end{theorem}
       
   185 
       
   186 \noindent
       
   187 \begin{proof}
   989 \begin{proof}
   188 We prove this by induction on $r$. The base cases for $\AZERO$,
   990 Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
   189 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
       
   190 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
       
   191 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
       
   192 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
       
   193 %
       
   194 \begin{center}
       
   195 \begin{tabular}{lcll}
       
   196 & & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
       
   197 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
       
   198     [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
       
   199 & $\leq$ &
       
   200     $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
       
   201     [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
       
   202 & $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
       
   203              \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
       
   204 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
       
   205       \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
       
   206 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
       
   207 \end{tabular}
       
   208 \end{center}
       
   209 
       
   210 
       
   211 \noindent
       
   212 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
       
   213 The reason why we could write the derivatives of a sequence this way is described in section 2.
       
   214 The term (2) is used to control (1). 
       
   215 That is because one can obtain an overall
       
   216 smaller regex list
       
   217 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
       
   218 Section 3 is dedicated to its proof.
       
   219 In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
       
   220 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
       
   221 than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
       
   222 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
       
   223 We reason similarly for  $\STAR$.\medskip
       
   224 \end{proof}
   991 \end{proof}
   225 
   992 Finally 
   226 What guarantee does this bound give us?
       
   227 
       
   228 Whatever the regex is, it will not grow indefinitely.
       
   229 Take our previous example $(a + aa)^*$ as an example:
       
   230 \begin{center}
       
   231 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   232 \begin{tikzpicture}
       
   233 \begin{axis}[
       
   234     xlabel={number of $a$'s},
       
   235     x label style={at={(1.05,-0.05)}},
       
   236     ylabel={regex size},
       
   237     enlargelimits=false,
       
   238     xtick={0,5,...,30},
       
   239     xmax=33,
       
   240     ymax= 40,
       
   241     ytick={0,10,...,40},
       
   242     scaled ticks=false,
       
   243     axis lines=left,
       
   244     width=5cm,
       
   245     height=4cm, 
       
   246     legend entries={$(a + aa)^*$},  
       
   247     legend pos=north west,
       
   248     legend cell align=left]
       
   249 \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
       
   250 \end{axis}
       
   251 \end{tikzpicture}
       
   252 \end{tabular}
       
   253 \end{center}
       
   254 We are able to limit the size of the regex $(a + aa)^*$'s derivatives
       
   255  with our simplification
       
   256 rules very effectively.
       
   257 
       
   258 
       
   259 As the graphs of  some more randomly generated regexes show, the size of 
       
   260 the derivative might grow quickly at the start of the input,
       
   261  but after a sufficiently long  input string, the trend will stop.
       
   262 
       
   263 
       
   264 %a few sample regular experessions' derivatives
       
   265 %size change
       
   266 %TODO: giving graphs showing a few regexes' size changes 
       
   267 %w;r;t the input characters number
       
   268 %a*, aa*, aaa*, .....
       
   269 %randomly generated regexes
       
   270 \begin{center}
       
   271 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   272 \begin{tikzpicture}
       
   273 \begin{axis}[
       
   274     xlabel={number of $a$'s},
       
   275     x label style={at={(1.05,-0.05)}},
       
   276     ylabel={regex size},
       
   277     enlargelimits=false,
       
   278     xtick={0,5,...,30},
       
   279     xmax=33,
       
   280     ymax=1000,
       
   281     ytick={0,100,...,1000},
       
   282     scaled ticks=false,
       
   283     axis lines=left,
       
   284     width=5cm,
       
   285     height=4cm, 
       
   286     legend entries={regex1},  
       
   287     legend pos=north west,
       
   288     legend cell align=left]
       
   289 \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
       
   290 \end{axis}
       
   291 \end{tikzpicture}
       
   292   &
       
   293 \begin{tikzpicture}
       
   294 \begin{axis}[
       
   295     xlabel={$n$},
       
   296     x label style={at={(1.05,-0.05)}},
       
   297     %ylabel={time in secs},
       
   298     enlargelimits=false,
       
   299     xtick={0,5,...,30},
       
   300     xmax=33,
       
   301     ymax=1000,
       
   302     ytick={0,100,...,1000},
       
   303     scaled ticks=false,
       
   304     axis lines=left,
       
   305     width=5cm,
       
   306     height=4cm, 
       
   307     legend entries={regex2},  
       
   308     legend pos=north west,
       
   309     legend cell align=left]
       
   310 \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
       
   311 \end{axis}
       
   312 \end{tikzpicture}
       
   313   &
       
   314 \begin{tikzpicture}
       
   315 \begin{axis}[
       
   316     xlabel={$n$},
       
   317     x label style={at={(1.05,-0.05)}},
       
   318     %ylabel={time in secs},
       
   319     enlargelimits=false,
       
   320     xtick={0,5,...,30},
       
   321     xmax=33,
       
   322     ymax=1000,
       
   323     ytick={0,100,...,1000},
       
   324     scaled ticks=false,
       
   325     axis lines=left,
       
   326     width=5cm,
       
   327     height=4cm, 
       
   328     legend entries={regex3},  
       
   329     legend pos=north west,
       
   330     legend cell align=left]
       
   331 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
       
   332 \end{axis}
       
   333 \end{tikzpicture}\\
       
   334 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
       
   335 \end{tabular}    
       
   336 \end{center}  
       
   337 
       
   338 
       
   339 
       
   340 
       
   341 
       
   342 \noindent
       
   343 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
       
   344 far from the actual bound we can expect. 
       
   345 In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
       
   346 is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
       
   347 Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
       
   348 inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
       
   349 $f(x) = x * 2^x$.
       
   350 This means the bound we have will surge up at least
       
   351 tower-exponentially with a linear increase of the depth.
       
   352 For a regex of depth $n$, the bound
       
   353 would be approximately $4^n$.
       
   354 %TODO: change this to tower exponential!
       
   355 
       
   356 We can do better than this, but this does not improve
       
   357 the finiteness property we are proving. If we are interested in a polynomial bound,
       
   358 one would hope to obtain a similar tight bound as for partial
       
   359 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
       
   360  $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
       
   361 partial derivatives). 
       
   362 To obtain the exact same bound would mean
       
   363 we need to introduce simplifications, such as
       
   364  $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
       
   365 which exist for partial derivatives. 
       
   366 
       
   367 However, if we introduce them in our
       
   368 setting we would lose the POSIX property of our calculated values. 
       
   369 A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
       
   370 If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer 
       
   371 would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
       
   372 what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
       
   373 in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
       
   374 occurs, and apply them in the right order once we get 
       
   375 a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
       
   376 This is unlike the simplification we had before, where the rewriting rules 
       
   377 such  as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
       
   378 We will discuss better
       
   379 bounds in the last section of this chapter.\\[-6.5mm]
       
   380 
       
   381 
       
   382 
       
   383 
       
   384 %----------------------------------------------------------------------------------------
       
   385 %	SECTION ??
       
   386 %----------------------------------------------------------------------------------------
       
   387 
       
   388 \section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
       
   389 To embark on getting the "closed forms" of regexes, we first
       
   390 define a few auxiliary definitions to make expressing them smoothly.
       
   391 
       
   392  \begin{center}  
       
   393  \begin{tabular}{ccc}
       
   394  $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
       
   395 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
       
   396 $\sflataux r$ & $=$ & $ [r]$
       
   397 \end{tabular}
       
   398 \end{center}
       
   399 The intuition behind $\sflataux{\_}$ is to break up nested regexes 
       
   400 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
       
   401 into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
       
   402 It will return the singleton list $[r]$ otherwise.
       
   403 
       
   404 $\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
       
   405 the output type a regular expression, not a list:
       
   406  \begin{center} 
       
   407  \begin{tabular}{ccc}
       
   408  $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
       
   409 $\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
       
   410 $\sflat r$ & $=$ & $ [r]$
       
   411 \end{tabular}
       
   412 \end{center}
       
   413 $\sflataux{\_}$  and $\sflat{\_}$ is only recursive in terms of the
       
   414  first element of the list of children of
       
   415 an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for  dealing with the regular expression
       
   416 $r_1 \cdot r_2 \backslash s$.
       
   417 
       
   418 With $\sflat{\_}$ and $\sflataux{\_}$ we make 
       
   419 precise what  "closed forms" we have for the sequence derivatives and their simplifications,
       
   420 in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
       
   421 and $\bderssimp{(r_1\cdot r_2)}{s}$,
       
   422 if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
       
   423 and  $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$  ranges over 
       
   424 the substring of $s$?
       
   425 First let's look at a series of derivatives steps on a sequence 
       
   426 regular expression,  (assuming) that each time the first
       
   427 component of the sequence is always nullable):
       
   428 \begin{center}
       
   429 
       
   430 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
       
   431 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
       
   432  \ldots$
       
   433 
       
   434 \end{center}
       
   435 %TODO: cite indian paper
       
   436 Indianpaper have  come up with a slightly more formal way of putting the above process:
       
   437 \begin{center}
       
   438 $r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
       
   439 \delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
       
   440 + \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
       
   441 \end{center}
       
   442 where  $\delta(b, r)$ will produce $r$
       
   443 if $b$ evaluates to true, 
       
   444 and $\ZERO$ otherwise.
       
   445 
       
   446  But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
       
   447  equivalence. To make this intuition useful 
       
   448  for a formal proof, we need something
       
   449 stronger than language equivalence.
       
   450 With the help of $\sflat{\_}$ we can state the equation in Indianpaper
       
   451 more rigorously:
       
   452 \begin{lemma}
       
   453 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
       
   454 \end{lemma}
       
   455 
       
   456 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
       
   457 
       
   458 \begin{center}
       
   459 \begin{tabular}{lcl}
       
   460 $\vsuf{[]}{\_} $ & $=$ &  $[]$\\
       
   461 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
       
   462                                      && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
       
   463 \end{tabular}
       
   464 \end{center}                   
       
   465 It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
       
   466 and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
       
   467 the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
       
   468 In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
       
   469 the entire result of  $(r_1 \cdot r_2) \backslash s$, 
       
   470 it only stores all the $s''$ such that $r_2 \backslash s''$  are occurring terms in $(r_1\cdot r_2)\backslash s$.
       
   471 
       
   472 With this we can also add simplifications to both sides and get
       
   473 \begin{lemma}
       
   474 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
       
   475 \end{lemma}
       
   476 Together with the idempotency property of $\rsimp$,
       
   477 %TODO: state the idempotency property of rsimp
       
   478 \begin{lemma}
       
   479 $\rsimp(r) = \rsimp (\rsimp(r))$
       
   480 \end{lemma}
       
   481 it is possible to convert the above lemma to obtain a "closed form"
       
   482 for  derivatives nested with simplification:
       
   483 \begin{lemma}
       
   484 $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
       
   485 \end{lemma}
       
   486 And now the reason we have (1) in section 1 is clear:
       
   487 $\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, 
       
   488 is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
       
   489     as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
       
   490 
       
   491 %----------------------------------------------------------------------------------------
       
   492 %	SECTION 3
       
   493 %----------------------------------------------------------------------------------------
       
   494 
       
   495 \section{interaction between $\distinctWith$ and $\flts$}
       
   496 Note that it is not immediately obvious that 
       
   497 \begin{center}
       
   498 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
       
   499 \end{center}
       
   500 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
       
   501 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
       
   502 
       
   503 
       
   504 %----------------------------------------------------------------------------------------
       
   505 %	SECTION ALTS CLOSED FORM
       
   506 %----------------------------------------------------------------------------------------
       
   507 \section{A Closed Form for \textit{ALTS}}
       
   508 Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
       
   509 
       
   510 
       
   511 There are a few key steps, one of these steps is
       
   512 $rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))
       
   513 = rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))$
       
   514 
       
   515 One might want to prove this by something a simple statement like: 
       
   516 $map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.
       
   517 
       
   518 For this to hold we want the $\textit{distinct}$ function to pick up
       
   519 the elements before and after derivatives correctly:
       
   520 $r \in rset \equiv (rder x r) \in (rder x rset)$.
       
   521 which essentially requires that the function $\backslash$ is an injective mapping.
       
   522 
       
   523 Unfortunately the function $\backslash c$ is not an injective mapping.
       
   524 
       
   525 \subsection{function $\backslash c$ is not injective (1-to-1)}
       
   526 \begin{center}
       
   527 The derivative $w.r.t$ character $c$ is not one-to-one.
       
   528 Formally,
       
   529 	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
       
   530 \end{center}
       
   531 This property is trivially true for the
       
   532 character regex example:
       
   533 \begin{center}
       
   534 	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
       
   535 \end{center}
       
   536 But apart from the cases where the derivative
       
   537 output is $\ZERO$, are there non-trivial results
       
   538 of derivatives which contain strings?
       
   539 The answer is yes.
       
   540 For example,
       
   541 \begin{center}
       
   542 	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
       
   543 	where $a$ is not nullable.\\
       
   544 	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
       
   545 	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
       
   546 \end{center}
       
   547 We start with two syntactically different regexes,
       
   548 and end up with the same derivative result.
       
   549 This is not surprising as we have such 
       
   550 equality as below in the style of Arden's lemma:\\
       
   551 \begin{center}
       
   552 	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
       
   553 \end{center}
       
   554 
       
   555 %----------------------------------------------------------------------------------------
       
   556 %	SECTION 4
       
   557 %----------------------------------------------------------------------------------------
       
   558 \section{A Bound for the Star Regular Expression}
       
   559 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
       
   560 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
       
   561 the property of the $\distinct$ function.
       
   562 Now we try to get a bound on $r^* \backslash s$ as well.
       
   563 Again, we first look at how a star's derivatives evolve, if they grow maximally: 
       
   564 \begin{center}
       
   565 
       
   566 $r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
       
   567 r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
       
   568 (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
       
   569 \quad \ldots$
       
   570 
       
   571 \end{center}
       
   572 When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
       
   573 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
       
   574 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
       
   575 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
       
   576 count the possible size explosions of $r \backslash c$ themselves.
       
   577 
       
   578 Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
       
   579 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
       
   580 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
       
   581 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
       
   582 For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
       
   583 %TODO: definitions of  and \hflataux \hflat
       
   584  \begin{center}  
       
   585  \begin{tabular}{ccc}
       
   586  $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
       
   587 $\hflataux r$ & $=$ & $ [r]$
       
   588 \end{tabular}
       
   589 \end{center}
       
   590 
       
   591  \begin{center}  
       
   592  \begin{tabular}{ccc}
       
   593  $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
       
   594 $\hflat r$ & $=$ & $ r$
       
   595 \end{tabular}
       
   596 \end{center}s
       
   597 Again these definitions are tailor-made for dealing with alternatives that have
       
   598 originated from a star's derivatives, so we don't attempt to open up all possible 
       
   599 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
       
   600 elements.
       
   601 We give a predicate for such "star-created" regular expressions:
       
   602 \begin{center}
       
   603 \begin{tabular}{lcr}
       
   604          &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
       
   605  $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
       
   606  \end{tabular}
       
   607  \end{center}
       
   608  
       
   609  These definitions allows us the flexibility to talk about 
       
   610  regular expressions in their most convenient format,
       
   611  for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
       
   612  instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
       
   613  These definitions help express that certain classes of syntatically 
       
   614  distinct regular expressions are actually the same under simplification.
       
   615  This is not entirely true for annotated regular expressions: 
       
   616  %TODO: bsimp bders \neq bderssimp
       
   617  \begin{center}
       
   618  $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
       
   619  \end{center}
       
   620  For bit-codes, the order in which simplification is applied
       
   621  might cause a difference in the location they are placed.
       
   622  If we want something like
       
   623  \begin{center}
       
   624  $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
       
   625  \end{center}
       
   626  Some "canonicalization" procedure is required,
       
   627  which either pushes all the common bitcodes to nodes
       
   628   as senior as possible:
       
   629   \begin{center}
       
   630   $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
       
   631   \end{center}
       
   632  or does the reverse. However bitcodes are not of interest if we are talking about
       
   633  the $\llbracket r \rrbracket$ size of a regex.
       
   634  Therefore for the ease and simplicity of producing a
       
   635  proof for a size bound, we are happy to restrict ourselves to 
       
   636  unannotated regular expressions, and obtain such equalities as
       
   637  \begin{lemma}
       
   638  $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
       
   639  \end{lemma}
       
   640  
       
   641  \begin{proof}
       
   642  By using the rewriting relation $\rightsquigarrow$
       
   643  \end{proof}
       
   644  %TODO: rsimp sflat
       
   645 And from this we obtain a proof that a star's derivative will be the same
       
   646 as if it had all its nested alternatives created during deriving being flattened out:
       
   647  For example,
       
   648  \begin{lemma}
       
   649  $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
       
   650  \end{lemma}
       
   651  \begin{proof}
       
   652  By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
       
   653  \end{proof}
       
   654 % The simplification of a flattened out regular expression, provided it comes
       
   655 %from the derivative of a star, is the same as the one nested.
       
   656  
       
   657  
       
   658  
       
   659  
       
   660  
       
   661  
       
   662  
       
   663  
       
   664  
       
   665 One might wonder the actual bound rather than the loose bound we gave
       
   666 for the convenience of an easier proof.
       
   667 How much can the regex $r^* \backslash s$ grow? 
       
   668 As  earlier graphs have shown,
       
   669 %TODO: reference that graph where size grows quickly
       
   670  they can grow at a maximum speed
       
   671   exponential $w.r.t$ the number of characters, 
       
   672 but will eventually level off when the string $s$ is long enough.
       
   673 If they grow to a size exponential $w.r.t$ the original regex, our algorithm
       
   674 would still be slow.
       
   675 And unfortunately, we have concrete examples
       
   676 where such regexes grew exponentially large before levelling off:
       
   677 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
       
   678 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
       
   679  size that is  exponential on the number $n$ 
       
   680 under our current simplification rules:
       
   681 %TODO: graph of a regex whose size increases exponentially.
       
   682 \begin{center}
       
   683 \begin{tikzpicture}
       
   684     \begin{axis}[
       
   685         height=0.5\textwidth,
       
   686         width=\textwidth,
       
   687         xlabel=number of a's,
       
   688         xtick={0,...,9},
       
   689         ylabel=maximum size,
       
   690         ymode=log,
       
   691        log basis y={2}
       
   692 ]
       
   693         \addplot[mark=*,blue] table {re-chengsong.data};
       
   694     \end{axis}
       
   695 \end{tikzpicture}
       
   696 \end{center}
       
   697 
       
   698 For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
       
   699 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
       
   700 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
       
   701 The exponential size is triggered by that the regex
       
   702 $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
       
   703 inside the $(\ldots) ^*$ having exponentially many
       
   704 different derivatives, despite those difference being minor.
       
   705 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
       
   706 will therefore contain the following terms (after flattening out all nested 
       
   707 alternatives):
       
   708 \begin{center}
       
   709 $(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
       
   710 $(1 \leq m' \leq m )$
       
   711 \end{center}
       
   712 These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
       
   713  With each new input character taking the derivative against the intermediate result, more and more such distinct
       
   714  terms will accumulate, 
       
   715 until the length reaches $L.C.M.(1, \ldots, n)$.
       
   716 $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
       
   717 $(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
       
   718 
       
   719 $(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
       
   720  where $m' \neq m''$ \\
       
   721  as they are slightly different.
       
   722  This means that with our current simplification methods,
       
   723  we will not be able to control the derivative so that
       
   724  $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
       
   725  as there are already exponentially many terms.
       
   726  These terms are similar in the sense that the head of those terms
       
   727  are all consisted of sub-terms of the form: 
       
   728  $(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
       
   729  For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
       
   730  $n * (n + 1) / 2$ such terms. 
       
   731  For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
       
   732  can be described by 6 terms:
       
   733  $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, 
       
   734  $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
       
   735 The total number of different "head terms",  $n * (n + 1) / 2$,
       
   736  is proportional to the number of characters in the regex 
       
   737 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
       
   738 This suggests a slightly different notion of size, which we call the 
       
   739 alphabetic width:
       
   740 %TODO:
       
   741 (TODO: Alphabetic width def.)
       
   742 
       
   743  
       
   744 Antimirov\parencite{Antimirov95} has proven that 
       
   745 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
       
   746 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
       
   747 created by doing derivatives of $r$ against all possible strings.
       
   748 If we can make sure that at any moment in our lexing algorithm our 
       
   749 intermediate result hold at most one copy of each of the 
       
   750 subterms then we can get the same bound as Antimirov's.
       
   751 This leads to the algorithm in the next section.
       
   752 
       
   753 %----------------------------------------------------------------------------------------
       
   754 %	SECTION 5
       
   755 %----------------------------------------------------------------------------------------
       
   756 \section{A Stronger Version of Simplification Inspired by Antimirov}
       
   757 %TODO: search for isabelle proofs of algorithms that check equivalence 
       
   758 
       
   759 
       
   760 
       
   761 
       
   762 
       
   763 
   993 
   764 
   994 
   765 
   995 
   766 	
   996