ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 532 cc54ce075db5
child 543 b2bea5968b89
equal deleted inserted replaced
531:c334f0b3ef52 532:cc54ce075db5
       
     1 % Chapter Template
       
     2 
       
     3 \chapter{Finiteness Bound} % Main chapter title
       
     4 
       
     5 \label{Finite} 
       
     6 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
       
     7 %of our bitcoded algorithm, that is a finite bound on the size of any 
       
     8 %regex's derivatives. 
       
     9 
       
    10 In this chapter we give a guarantee in terms of time complexity:
       
    11 given a regular expression $r$, for any string $s$ 
       
    12 our algorithm's internal data structure is finitely bounded.
       
    13 To obtain such a proof, we need to 
       
    14 \begin{itemize}
       
    15 \item
       
    16 Define an new datatype for regular expressions that makes it easy
       
    17 to reason about the size of an annotated regular expression.
       
    18 \item
       
    19 A set of equalities for this new datatype that enables one to
       
    20 rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc.
       
    21 by their children regexes $r_1$, $r_2$, and $r$.
       
    22 \item
       
    23 Using those equalities to actually get those rewriting equations, which we call
       
    24 "closed forms".
       
    25 \item
       
    26 Bound the closed forms, thereby bounding the original
       
    27 $\blexersimp$'s internal data structures.
       
    28 \end{itemize}
       
    29 
       
    30 \section{the $\mathbf{r}$-rexp datatype and the size functions}
       
    31 
       
    32 We have a size function for bitcoded regular expressions, written
       
    33 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
       
    34 
       
    35 \begin{center}
       
    36 \begin{tabular}{ccc}
       
    37 $\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
       
    38 \end{tabular}
       
    39 \end{center}
       
    40 (TODO: COMPLETE this defn and for $rs$)
       
    41 
       
    42 The size is based on a recursive function on the structure of the regex,
       
    43 not the bitcodes.
       
    44 Therefore we may as well talk about size of an annotated regular expression 
       
    45 in their un-annotated form:
       
    46 \begin{center}
       
    47 $\asize(a) = \size(\erase(a))$. 
       
    48 \end{center}
       
    49 (TODO: turn equals to roughly equals)
       
    50 
       
    51 But there is a minor nuisance here:
       
    52 the erase function actually messes with the structure of the regular expression:
       
    53 \begin{center}
       
    54 \begin{tabular}{ccc}
       
    55 $\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
       
    56 \end{tabular}
       
    57 \end{center}
       
    58 (TODO: complete this definition with singleton r in alts)
       
    59 
       
    60 An alternative regular expression with an empty list of children
       
    61  is turned into an $\ZERO$ during the
       
    62 $\erase$ function, thereby changing the size and structure of the regex.
       
    63 These will likely be fixable if we really want to use plain $\rexp$s for dealing
       
    64 with size, but we choose a more straightforward (or stupid) method by 
       
    65 defining a new datatype that is similar to plain $\rexp$s but can take
       
    66 non-binary arguments for its alternative constructor,
       
    67  which we call $\rrexp$ to denote
       
    68 the difference between it and plain regular expressions. 
       
    69 \[			\rrexp ::=   \RZERO \mid  \RONE
       
    70 			 \mid  \RCHAR{c}  
       
    71 			 \mid  \RSEQ{r_1}{r_2}
       
    72 			 \mid  \RALTS{rs}
       
    73 			 \mid \RSTAR{r}        
       
    74 \]
       
    75 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
       
    76 but keep everything else intact.
       
    77 It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
       
    78 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
       
    79 $\ALTS$).
       
    80 We denote the operation of erasing the bits and turning an annotated regular expression 
       
    81 into an $\rrexp{}$ as $\rerase{}$.
       
    82 \begin{center}
       
    83 \begin{tabular}{lcl}
       
    84 $\rerase{\AZERO}$ & $=$ & $\RZERO$\\
       
    85 $\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
       
    86 $\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
       
    87 \end{tabular}
       
    88 \end{center}
       
    89 %TODO: FINISH definition of rerase
       
    90 Similarly we could define the derivative  and simplification on 
       
    91 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
       
    92 except that now they can operate on alternatives taking multiple arguments.
       
    93 
       
    94 \begin{center}
       
    95 \begin{tabular}{lcr}
       
    96 $\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
       
    97 (other clauses omitted)
       
    98 \end{tabular}
       
    99 \end{center}
       
   100 
       
   101 Now that $\rrexp$s do not have bitcodes on them, we can do the 
       
   102 duplicate removal without  an equivalence relation:
       
   103 \begin{center}
       
   104 \begin{tabular}{lcl}
       
   105 $\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
       
   106            			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
       
   107 \end{tabular}
       
   108 \end{center}
       
   109 %TODO: definition of rsimp (maybe only the alternative clause)
       
   110 
       
   111 
       
   112 The reason why these definitions  mirror precisely the corresponding operations
       
   113 on annotated regualar expressions is that we can calculate sizes more easily:
       
   114 
       
   115 \begin{lemma}
       
   116 $\rsize{\rerase a} = \asize a$
       
   117 \end{lemma}
       
   118 This lemma says that the size of an r-erased regex is the same as the annotated regex.
       
   119 this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
       
   120 \begin{lemma}
       
   121 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
       
   122 \end{lemma}
       
   123 Putting these two together we get a property that allows us to estimate the 
       
   124 size of an annotated regular expression derivative using its un-annotated counterpart:
       
   125 \begin{lemma}
       
   126 $\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
       
   127 \end{lemma}
       
   128 Unless stated otherwise in this chapter all $\textit{rexp}$s without
       
   129  bitcodes are seen as $\rrexp$s.
       
   130  We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
       
   131  as the former suits people's intuitive way of stating a binary alternative
       
   132  regular expression.
       
   133 
       
   134 
       
   135 
       
   136 %-----------------------------------
       
   137 %	SECTION ?
       
   138 %-----------------------------------
       
   139 \section{preparatory properties for getting a finiteness bound}
       
   140 Before we get to the proof that says the intermediate result of our lexer will
       
   141 remain finitely bounded, which is an important efficiency/liveness guarantee,
       
   142 we shall first develop a few preparatory properties and definitions to 
       
   143 make the process of proving that a breeze.
       
   144 
       
   145 We define rewriting relations for $\rrexp$s, which allows us to do the 
       
   146 same trick as we did for the correctness proof,
       
   147 but this time we will have stronger equalities established.
       
   148 \subsection{"hrewrite" relation}
       
   149 List of 1-step rewrite rules for regular expressions simplification without bitcodes:
       
   150 \begin{figure}
       
   151 \caption{the "h-rewrite" rules}
       
   152 \[
       
   153 r_1 \cdot \ZERO \rightarrow_h \ZERO \]
       
   154 
       
   155 \[
       
   156 \ZERO \cdot r_2 \rightarrow \ZERO 
       
   157 \]
       
   158 \end{figure}
       
   159 And we define an "grewrite" relation that works on lists:
       
   160 \begin{center}
       
   161 \begin{tabular}{lcl}
       
   162 $ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
       
   163 \end{tabular}
       
   164 \end{center}
       
   165 
       
   166 
       
   167 
       
   168 With these relations we prove
       
   169 \begin{lemma}
       
   170 $rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
       
   171 \end{lemma}
       
   172 which enables us to prove "closed forms" equalities such as
       
   173 \begin{lemma}
       
   174 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\suffix \; s \; r_1 ))}$
       
   175 \end{lemma}
       
   176 
       
   177 But the most involved part of the above lemma is proving the following:
       
   178 \begin{lemma}
       
   179 $\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ 
       
   180 \end{lemma}
       
   181 which is needed in proving things like 
       
   182 \begin{lemma}
       
   183 $r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
       
   184 \end{lemma}
       
   185 
       
   186 Fortunately this is provable by induction on the list $rs$
       
   187 
       
   188 
       
   189 
       
   190 %-----------------------------------
       
   191 %	SECTION 2
       
   192 %-----------------------------------
       
   193 
       
   194 \begin{theorem}
       
   195 For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
       
   196 \end{theorem}
       
   197 
       
   198 \noindent
       
   199 \begin{proof}
       
   200 We prove this by induction on $r$. The base cases for $\AZERO$,
       
   201 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
       
   202 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
       
   203 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
       
   204 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
       
   205 %
       
   206 \begin{center}
       
   207 \begin{tabular}{lcll}
       
   208 & & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
       
   209 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
       
   210     [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
       
   211 & $\leq$ &
       
   212     $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
       
   213     [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
       
   214 & $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
       
   215              \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
       
   216 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
       
   217       \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
       
   218 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
       
   219 \end{tabular}
       
   220 \end{center}
       
   221 
       
   222 
       
   223 \noindent
       
   224 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
       
   225 The reason why we could write the derivatives of a sequence this way is described in section 2.
       
   226 The term (2) is used to control (1). 
       
   227 That is because one can obtain an overall
       
   228 smaller regex list
       
   229 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
       
   230 Section 3 is dedicated to its proof.
       
   231 In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
       
   232 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
       
   233 than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
       
   234 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
       
   235 We reason similarly for  $\STAR$.\medskip
       
   236 \end{proof}
       
   237 
       
   238 What guarantee does this bound give us?
       
   239 
       
   240 Whatever the regex is, it will not grow indefinitely.
       
   241 Take our previous example $(a + aa)^*$ as an example:
       
   242 \begin{center}
       
   243 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   244 \begin{tikzpicture}
       
   245 \begin{axis}[
       
   246     xlabel={number of $a$'s},
       
   247     x label style={at={(1.05,-0.05)}},
       
   248     ylabel={regex size},
       
   249     enlargelimits=false,
       
   250     xtick={0,5,...,30},
       
   251     xmax=33,
       
   252     ymax= 40,
       
   253     ytick={0,10,...,40},
       
   254     scaled ticks=false,
       
   255     axis lines=left,
       
   256     width=5cm,
       
   257     height=4cm, 
       
   258     legend entries={$(a + aa)^*$},  
       
   259     legend pos=north west,
       
   260     legend cell align=left]
       
   261 \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
       
   262 \end{axis}
       
   263 \end{tikzpicture}
       
   264 \end{tabular}
       
   265 \end{center}
       
   266 We are able to limit the size of the regex $(a + aa)^*$'s derivatives
       
   267  with our simplification
       
   268 rules very effectively.
       
   269 
       
   270 
       
   271 In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
       
   272 is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
       
   273 Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
       
   274 inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
       
   275 $f(x) = x * 2^x$.
       
   276 This means the bound we have will surge up at least
       
   277 tower-exponentially with a linear increase of the depth.
       
   278 For a regex of depth $n$, the bound
       
   279 would be approximately $4^n$.
       
   280 
       
   281 Test data in the graphs from randomly generated regular expressions
       
   282 shows that the giant bounds are far from being hit.
       
   283 %a few sample regular experessions' derivatives
       
   284 %size change
       
   285 %TODO: giving regex1_size_change.data showing a few regexes' size changes 
       
   286 %w;r;t the input characters number, where the size is usually cubic in terms of original size
       
   287 %a*, aa*, aaa*, .....
       
   288 %randomly generated regexes
       
   289 \begin{center}
       
   290 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   291 \begin{tikzpicture}
       
   292 \begin{axis}[
       
   293     xlabel={number of $a$'s},
       
   294     x label style={at={(1.05,-0.05)}},
       
   295     ylabel={regex size},
       
   296     enlargelimits=false,
       
   297     xtick={0,5,...,30},
       
   298     xmax=33,
       
   299     ymax=1000,
       
   300     ytick={0,100,...,1000},
       
   301     scaled ticks=false,
       
   302     axis lines=left,
       
   303     width=5cm,
       
   304     height=4cm, 
       
   305     legend entries={regex1},  
       
   306     legend pos=north west,
       
   307     legend cell align=left]
       
   308 \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
       
   309 \end{axis}
       
   310 \end{tikzpicture}
       
   311   &
       
   312 \begin{tikzpicture}
       
   313 \begin{axis}[
       
   314     xlabel={$n$},
       
   315     x label style={at={(1.05,-0.05)}},
       
   316     %ylabel={time in secs},
       
   317     enlargelimits=false,
       
   318     xtick={0,5,...,30},
       
   319     xmax=33,
       
   320     ymax=1000,
       
   321     ytick={0,100,...,1000},
       
   322     scaled ticks=false,
       
   323     axis lines=left,
       
   324     width=5cm,
       
   325     height=4cm, 
       
   326     legend entries={regex2},  
       
   327     legend pos=north west,
       
   328     legend cell align=left]
       
   329 \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
       
   330 \end{axis}
       
   331 \end{tikzpicture}
       
   332   &
       
   333 \begin{tikzpicture}
       
   334 \begin{axis}[
       
   335     xlabel={$n$},
       
   336     x label style={at={(1.05,-0.05)}},
       
   337     %ylabel={time in secs},
       
   338     enlargelimits=false,
       
   339     xtick={0,5,...,30},
       
   340     xmax=33,
       
   341     ymax=1000,
       
   342     ytick={0,100,...,1000},
       
   343     scaled ticks=false,
       
   344     axis lines=left,
       
   345     width=5cm,
       
   346     height=4cm, 
       
   347     legend entries={regex3},  
       
   348     legend pos=north west,
       
   349     legend cell align=left]
       
   350 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
       
   351 \end{axis}
       
   352 \end{tikzpicture}\\
       
   353 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
       
   354 \end{tabular}    
       
   355 \end{center}  
       
   356 
       
   357 
       
   358 
       
   359 
       
   360 
       
   361 \noindent
       
   362 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the 
       
   363 original size.
       
   364 This suggests a link towrads "partial derivatives"
       
   365  introduced by Antimirov \cite{Antimirov95}.
       
   366  
       
   367  \section{Antimirov's partial derivatives}
       
   368  The idea behind Antimirov's partial derivatives
       
   369 is to do derivatives in a similar way as suggested by Brzozowski, 
       
   370 but maintain a set of regular expressions instead of a single one:
       
   371 
       
   372 %TODO: antimirov proposition 3.1, needs completion
       
   373  \begin{center}  
       
   374  \begin{tabular}{ccc}
       
   375  $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\
       
   376 $\partial_x(\ONE)$ & $=$ & $\phi$
       
   377 \end{tabular}
       
   378 \end{center}
       
   379 
       
   380 Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together
       
   381 using the alternatives constructor, Antimirov cleverly chose to put them into
       
   382 a set instead.  This breaks the terms in a derivative regular expression up, 
       
   383 allowing us to understand what are the "atomic" components of it.
       
   384 For example, To compute what regular expression $x^*(xx + y)^*$'s 
       
   385 derivative against $x$ is made of, one can do a partial derivative
       
   386 of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$
       
   387 from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.
       
   388 To get all the "atomic" components of a regular expression's possible derivatives,
       
   389 there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes
       
   390 whatever character is available at the head of the string inside the language of a
       
   391 regular expression, and gives back the character and the derivative regular expression
       
   392 as a pair (which he called "monomial"):
       
   393  \begin{center}  
       
   394  \begin{tabular}{ccc}
       
   395  $\lf(\ONE)$ & $=$ & $\phi$\\
       
   396 $\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\
       
   397  $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
       
   398  $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\
       
   399 \end{tabular}
       
   400 \end{center}
       
   401 %TODO: completion
       
   402 
       
   403 There is a slight difference in the last three clauses compared
       
   404 with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular 
       
   405 expression $r$ with every element inside $\textit{rset}$ to create a set of 
       
   406 sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates 
       
   407 on a set of monomials (which Antimirov called "linear form") and a regular 
       
   408 expression, and returns a linear form:
       
   409  \begin{center}  
       
   410  \begin{tabular}{ccc}
       
   411  $l \bigodot (\ZERO)$ & $=$ & $\phi$\\
       
   412  $l \bigodot (\ONE)$ & $=$ & $l$\\
       
   413  $\phi \bigodot t$ & $=$ & $\phi$\\
       
   414  $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\
       
   415  $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\
       
   416   $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\
       
   417  $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
       
   418  $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\
       
   419 \end{tabular}
       
   420 \end{center}
       
   421 %TODO: completion
       
   422 
       
   423  Some degree of simplification is applied when doing $\bigodot$, for example,
       
   424  $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,
       
   425  and  $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and
       
   426   $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,
       
   427   and so on.
       
   428   
       
   429   With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regex $r$ with 
       
   430   an iterative procedure:
       
   431    \begin{center}  
       
   432  \begin{tabular}{llll}
       
   433 $\textit{while}$ & $(\Delta_i \neq \phi)$  &                &          \\
       
   434  		       &  $\Delta_{i+1}$           &        $ =$ & $\lf(\Delta_i) - \PD_i$ \\
       
   435 		       &  $\PD_{i+1}$              &         $ =$ & $\Delta_{i+1} \cup \PD_i$ \\
       
   436 $\partial_{UNIV}(r)$ & $=$ & $\PD$ &		     
       
   437 \end{tabular}
       
   438 \end{center}
       
   439 
       
   440 
       
   441  $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
       
   442 
       
   443 
       
   444 However, if we introduce them in our
       
   445 setting we would lose the POSIX property of our calculated values. 
       
   446 A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
       
   447 If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer 
       
   448 would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
       
   449 what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
       
   450 in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
       
   451 occurs, and apply them in the right order once we get 
       
   452 a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
       
   453 This is unlike the simplification we had before, where the rewriting rules 
       
   454 such  as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
       
   455 We will discuss better
       
   456 bounds in the last section of this chapter.\\[-6.5mm]
       
   457 
       
   458 
       
   459 
       
   460 
       
   461 %----------------------------------------------------------------------------------------
       
   462 %	SECTION ??
       
   463 %----------------------------------------------------------------------------------------
       
   464 
       
   465 \section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
       
   466 To embark on getting the "closed forms" of regexes, we first
       
   467 define a few auxiliary definitions to make expressing them smoothly.
       
   468 
       
   469  \begin{center}  
       
   470  \begin{tabular}{ccc}
       
   471  $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
       
   472 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
       
   473 $\sflataux r$ & $=$ & $ [r]$
       
   474 \end{tabular}
       
   475 \end{center}
       
   476 The intuition behind $\sflataux{\_}$ is to break up nested regexes 
       
   477 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
       
   478 into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
       
   479 It will return the singleton list $[r]$ otherwise.
       
   480 
       
   481 $\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
       
   482 the output type a regular expression, not a list:
       
   483  \begin{center} 
       
   484  \begin{tabular}{ccc}
       
   485  $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
       
   486 $\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
       
   487 $\sflat r$ & $=$ & $ [r]$
       
   488 \end{tabular}
       
   489 \end{center}
       
   490 $\sflataux{\_}$  and $\sflat{\_}$ is only recursive in terms of the
       
   491  first element of the list of children of
       
   492 an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for  dealing with the regular expression
       
   493 $r_1 \cdot r_2 \backslash s$.
       
   494 
       
   495 With $\sflat{\_}$ and $\sflataux{\_}$ we make 
       
   496 precise what  "closed forms" we have for the sequence derivatives and their simplifications,
       
   497 in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
       
   498 and $\bderssimp{(r_1\cdot r_2)}{s}$,
       
   499 if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
       
   500 and  $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$  ranges over 
       
   501 the substring of $s$?
       
   502 First let's look at a series of derivatives steps on a sequence 
       
   503 regular expression,  (assuming) that each time the first
       
   504 component of the sequence is always nullable):
       
   505 \begin{center}
       
   506 
       
   507 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
       
   508 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
       
   509  \ldots$
       
   510 
       
   511 \end{center}
       
   512 %TODO: cite indian paper
       
   513 Indianpaper have  come up with a slightly more formal way of putting the above process:
       
   514 \begin{center}
       
   515 $r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
       
   516 \delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
       
   517 + \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
       
   518 \end{center}
       
   519 where  $\delta(b, r)$ will produce $r$
       
   520 if $b$ evaluates to true, 
       
   521 and $\ZERO$ otherwise.
       
   522 
       
   523  But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
       
   524  equivalence. To make this intuition useful 
       
   525  for a formal proof, we need something
       
   526 stronger than language equivalence.
       
   527 With the help of $\sflat{\_}$ we can state the equation in Indianpaper
       
   528 more rigorously:
       
   529 \begin{lemma}
       
   530 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
       
   531 \end{lemma}
       
   532 
       
   533 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
       
   534 
       
   535 \begin{center}
       
   536 \begin{tabular}{lcl}
       
   537 $\vsuf{[]}{\_} $ & $=$ &  $[]$\\
       
   538 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
       
   539                                      && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
       
   540 \end{tabular}
       
   541 \end{center}                   
       
   542 It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
       
   543 and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
       
   544 the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
       
   545 In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
       
   546 the entire result of  $(r_1 \cdot r_2) \backslash s$, 
       
   547 it only stores all the $s''$ such that $r_2 \backslash s''$  are occurring terms in $(r_1\cdot r_2)\backslash s$.
       
   548 
       
   549 With this we can also add simplifications to both sides and get
       
   550 \begin{lemma}
       
   551 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
       
   552 \end{lemma}
       
   553 Together with the idempotency property of $\rsimp$,
       
   554 %TODO: state the idempotency property of rsimp
       
   555 \begin{lemma}
       
   556 $\rsimp{r} = \rsimp{\rsimp{r}}$
       
   557 \end{lemma}
       
   558 it is possible to convert the above lemma to obtain a "closed form"
       
   559 for  derivatives nested with simplification:
       
   560 \begin{lemma}
       
   561 $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
       
   562 \end{lemma}
       
   563 And now the reason we have (1) in section 1 is clear:
       
   564 $\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, 
       
   565 is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
       
   566     as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
       
   567 
       
   568 %----------------------------------------------------------------------------------------
       
   569 %	SECTION 3
       
   570 %----------------------------------------------------------------------------------------
       
   571 
       
   572 \section{interaction between $\distinctWith$ and $\flts$}
       
   573 Note that it is not immediately obvious that 
       
   574 \begin{center}
       
   575 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
       
   576 \end{center}
       
   577 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
       
   578 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
       
   579 
       
   580 
       
   581 %-----------------------------------
       
   582 %	SECTION syntactic equivalence under simp
       
   583 %-----------------------------------
       
   584 \section{Syntactic Equivalence Under $\simp$}
       
   585 We prove that minor differences can be annhilated
       
   586 by $\simp$.
       
   587 For example,
       
   588 \begin{center}
       
   589 $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
       
   590  \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
       
   591 \end{center}
       
   592 
       
   593 
       
   594 %----------------------------------------------------------------------------------------
       
   595 %	SECTION ALTS CLOSED FORM
       
   596 %----------------------------------------------------------------------------------------
       
   597 \section{A Closed Form for \textit{ALTS}}
       
   598 Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
       
   599 
       
   600 
       
   601 There are a few key steps, one of these steps is
       
   602 \[
       
   603 rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))
       
   604 = rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))
       
   605 \]
       
   606 
       
   607 
       
   608 One might want to prove this by something a simple statement like: 
       
   609 $map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.
       
   610 
       
   611 For this to hold we want the $\textit{distinct}$ function to pick up
       
   612 the elements before and after derivatives correctly:
       
   613 $r \in rset \equiv (rder x r) \in (rder x rset)$.
       
   614 which essentially requires that the function $\backslash$ is an injective mapping.
       
   615 
       
   616 Unfortunately the function $\backslash c$ is not an injective mapping.
       
   617 
       
   618 \subsection{function $\backslash c$ is not injective (1-to-1)}
       
   619 \begin{center}
       
   620 The derivative $w.r.t$ character $c$ is not one-to-one.
       
   621 Formally,
       
   622 	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
       
   623 \end{center}
       
   624 This property is trivially true for the
       
   625 character regex example:
       
   626 \begin{center}
       
   627 	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
       
   628 \end{center}
       
   629 But apart from the cases where the derivative
       
   630 output is $\ZERO$, are there non-trivial results
       
   631 of derivatives which contain strings?
       
   632 The answer is yes.
       
   633 For example,
       
   634 \begin{center}
       
   635 	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
       
   636 	where $a$ is not nullable.\\
       
   637 	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
       
   638 	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
       
   639 \end{center}
       
   640 We start with two syntactically different regexes,
       
   641 and end up with the same derivative result.
       
   642 This is not surprising as we have such 
       
   643 equality as below in the style of Arden's lemma:\\
       
   644 \begin{center}
       
   645 	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
       
   646 \end{center}
       
   647 
       
   648 %----------------------------------------------------------------------------------------
       
   649 %	SECTION 4
       
   650 %----------------------------------------------------------------------------------------
       
   651 \section{A Bound for the Star Regular Expression}
       
   652 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
       
   653 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
       
   654 the property of the $\distinct$ function.
       
   655 Now we try to get a bound on $r^* \backslash s$ as well.
       
   656 Again, we first look at how a star's derivatives evolve, if they grow maximally: 
       
   657 \begin{center}
       
   658 
       
   659 $r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
       
   660 r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
       
   661 (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
       
   662 \quad \ldots$
       
   663 
       
   664 \end{center}
       
   665 When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
       
   666 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
       
   667 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
       
   668 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
       
   669 count the possible size explosions of $r \backslash c$ themselves.
       
   670 
       
   671 Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
       
   672 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
       
   673 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
       
   674 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
       
   675 For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
       
   676 %TODO: definitions of  and \hflataux \hflat
       
   677  \begin{center}  
       
   678  \begin{tabular}{ccc}
       
   679  $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
       
   680 $\hflataux r$ & $=$ & $ [r]$
       
   681 \end{tabular}
       
   682 \end{center}
       
   683 
       
   684  \begin{center}  
       
   685  \begin{tabular}{ccc}
       
   686  $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
       
   687 $\hflat r$ & $=$ & $ r$
       
   688 \end{tabular}
       
   689 \end{center}s
       
   690 Again these definitions are tailor-made for dealing with alternatives that have
       
   691 originated from a star's derivatives, so we don't attempt to open up all possible 
       
   692 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
       
   693 elements.
       
   694 We give a predicate for such "star-created" regular expressions:
       
   695 \begin{center}
       
   696 \begin{tabular}{lcr}
       
   697          &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
       
   698  $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
       
   699  \end{tabular}
       
   700  \end{center}
       
   701  
       
   702  These definitions allows us the flexibility to talk about 
       
   703  regular expressions in their most convenient format,
       
   704  for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
       
   705  instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
       
   706  These definitions help express that certain classes of syntatically 
       
   707  distinct regular expressions are actually the same under simplification.
       
   708  This is not entirely true for annotated regular expressions: 
       
   709  %TODO: bsimp bders \neq bderssimp
       
   710  \begin{center}
       
   711  $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
       
   712  \end{center}
       
   713  For bit-codes, the order in which simplification is applied
       
   714  might cause a difference in the location they are placed.
       
   715  If we want something like
       
   716  \begin{center}
       
   717  $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
       
   718  \end{center}
       
   719  Some "canonicalization" procedure is required,
       
   720  which either pushes all the common bitcodes to nodes
       
   721   as senior as possible:
       
   722   \begin{center}
       
   723   $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
       
   724   \end{center}
       
   725  or does the reverse. However bitcodes are not of interest if we are talking about
       
   726  the $\llbracket r \rrbracket$ size of a regex.
       
   727  Therefore for the ease and simplicity of producing a
       
   728  proof for a size bound, we are happy to restrict ourselves to 
       
   729  unannotated regular expressions, and obtain such equalities as
       
   730  \begin{lemma}
       
   731  $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
       
   732  \end{lemma}
       
   733  
       
   734  \begin{proof}
       
   735  By using the rewriting relation $\rightsquigarrow$
       
   736  \end{proof}
       
   737  %TODO: rsimp sflat
       
   738 And from this we obtain a proof that a star's derivative will be the same
       
   739 as if it had all its nested alternatives created during deriving being flattened out:
       
   740  For example,
       
   741  \begin{lemma}
       
   742  $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
       
   743  \end{lemma}
       
   744  \begin{proof}
       
   745  By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
       
   746  \end{proof}
       
   747 % The simplification of a flattened out regular expression, provided it comes
       
   748 %from the derivative of a star, is the same as the one nested.
       
   749  
       
   750  
       
   751  
       
   752  
       
   753  
       
   754  
       
   755  
       
   756  
       
   757  
       
   758 One might wonder the actual bound rather than the loose bound we gave
       
   759 for the convenience of an easier proof.
       
   760 How much can the regex $r^* \backslash s$ grow? 
       
   761 As  earlier graphs have shown,
       
   762 %TODO: reference that graph where size grows quickly
       
   763  they can grow at a maximum speed
       
   764   exponential $w.r.t$ the number of characters, 
       
   765 but will eventually level off when the string $s$ is long enough.
       
   766 If they grow to a size exponential $w.r.t$ the original regex, our algorithm
       
   767 would still be slow.
       
   768 And unfortunately, we have concrete examples
       
   769 where such regexes grew exponentially large before levelling off:
       
   770 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
       
   771 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
       
   772  size that is  exponential on the number $n$ 
       
   773 under our current simplification rules:
       
   774 %TODO: graph of a regex whose size increases exponentially.
       
   775 \begin{center}
       
   776 \begin{tikzpicture}
       
   777     \begin{axis}[
       
   778         height=0.5\textwidth,
       
   779         width=\textwidth,
       
   780         xlabel=number of a's,
       
   781         xtick={0,...,9},
       
   782         ylabel=maximum size,
       
   783         ymode=log,
       
   784        log basis y={2}
       
   785 ]
       
   786         \addplot[mark=*,blue] table {re-chengsong.data};
       
   787     \end{axis}
       
   788 \end{tikzpicture}
       
   789 \end{center}
       
   790 
       
   791 For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
       
   792 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
       
   793 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
       
   794 The exponential size is triggered by that the regex
       
   795 $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
       
   796 inside the $(\ldots) ^*$ having exponentially many
       
   797 different derivatives, despite those difference being minor.
       
   798 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
       
   799 will therefore contain the following terms (after flattening out all nested 
       
   800 alternatives):
       
   801 \begin{center}
       
   802 $(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
       
   803 $(1 \leq m' \leq m )$
       
   804 \end{center}
       
   805 These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
       
   806  With each new input character taking the derivative against the intermediate result, more and more such distinct
       
   807  terms will accumulate, 
       
   808 until the length reaches $L.C.M.(1, \ldots, n)$.
       
   809 $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
       
   810 $(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
       
   811 
       
   812 $(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
       
   813  where $m' \neq m''$ \\
       
   814  as they are slightly different.
       
   815  This means that with our current simplification methods,
       
   816  we will not be able to control the derivative so that
       
   817  $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
       
   818  as there are already exponentially many terms.
       
   819  These terms are similar in the sense that the head of those terms
       
   820  are all consisted of sub-terms of the form: 
       
   821  $(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
       
   822  For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
       
   823  $n * (n + 1) / 2$ such terms. 
       
   824  For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
       
   825  can be described by 6 terms:
       
   826  $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, 
       
   827  $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
       
   828 The total number of different "head terms",  $n * (n + 1) / 2$,
       
   829  is proportional to the number of characters in the regex 
       
   830 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
       
   831 This suggests a slightly different notion of size, which we call the 
       
   832 alphabetic width:
       
   833 %TODO:
       
   834 (TODO: Alphabetic width def.)
       
   835 
       
   836  
       
   837 Antimirov\parencite{Antimirov95} has proven that 
       
   838 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
       
   839 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
       
   840 created by doing derivatives of $r$ against all possible strings.
       
   841 If we can make sure that at any moment in our lexing algorithm our 
       
   842 intermediate result hold at most one copy of each of the 
       
   843 subterms then we can get the same bound as Antimirov's.
       
   844 This leads to the algorithm in the next chapter.
       
   845 
       
   846 
       
   847 
       
   848 
       
   849 
       
   850 %----------------------------------------------------------------------------------------
       
   851 %	SECTION 1
       
   852 %----------------------------------------------------------------------------------------
       
   853 
       
   854 \section{Idempotency of $\simp$}
       
   855 
       
   856 \begin{equation}
       
   857 	\simp \;r = \simp\; \simp \; r 
       
   858 \end{equation}
       
   859 This property means we do not have to repeatedly
       
   860 apply simplification in each step, which justifies
       
   861 our definition of $\blexersimp$.
       
   862 It will also be useful in future proofs where properties such as 
       
   863 closed forms are needed.
       
   864 The proof is by structural induction on $r$.
       
   865 
       
   866 %-----------------------------------
       
   867 %	SUBSECTION 1
       
   868 %-----------------------------------
       
   869 \subsection{Syntactic Equivalence Under $\simp$}
       
   870 We prove that minor differences can be annhilated
       
   871 by $\simp$.
       
   872 For example,
       
   873 \begin{center}
       
   874 $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
       
   875  \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
       
   876 \end{center}
       
   877