ChengsongTanPhdThesis/Chapters/ChapterFinite.tex
changeset 527 2c907b118f78
child 528 28751de4b4ba
equal deleted inserted replaced
526:cb702fb4227f 527:2c907b118f78
       
     1 % Chapter Template
       
     2 
       
     3 \chapter{Finiteness Bound} % Main chapter title
       
     4 
       
     5 \label{ChapterFinite} 
       
     6 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
       
     7 %of our bitcoded algorithm, that is a finite bound on the size of any 
       
     8 %regex's derivatives. 
       
     9 
       
    10 
       
    11 %-----------------------------------
       
    12 %	SECTION ?
       
    13 %-----------------------------------
       
    14 \section{preparatory properties for getting a finiteness bound}
       
    15 Before we get to the proof that says the intermediate result of our lexer will
       
    16 remain finitely bounded, which is an important efficiency/liveness guarantee,
       
    17 we shall first develop a few preparatory properties and definitions to 
       
    18 make the process of proving that a breeze.
       
    19 
       
    20 We define rewriting relations for $\rrexp$s, which allows us to do the 
       
    21 same trick as we did for the correctness proof,
       
    22 but this time we will have stronger equalities established.
       
    23 \subsection{"hrewrite" relation}
       
    24 List of 1-step rewrite rules for regular expressions simplification without bitcodes:
       
    25 \begin{center}
       
    26 \begin{tabular}{lcl}
       
    27 $r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$
       
    28 \end{tabular}
       
    29 \end{center}
       
    30 
       
    31 And we define an "grewrite" relation that works on lists:
       
    32 \begin{center}
       
    33 \begin{tabular}{lcl}
       
    34 $ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
       
    35 \end{tabular}
       
    36 \end{center}
       
    37 
       
    38 
       
    39 
       
    40 With these relations we prove
       
    41 \begin{lemma}
       
    42 $rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
       
    43 \end{lemma}
       
    44 which enables us to prove "closed forms" equalities such as
       
    45 \begin{lemma}
       
    46 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
       
    47 \end{lemma}
       
    48 
       
    49 But the most involved part of the above lemma is proving the following:
       
    50 \begin{lemma}
       
    51 $\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ 
       
    52 \end{lemma}
       
    53 which is needed in proving things like 
       
    54 \begin{lemma}
       
    55 $r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
       
    56 \end{lemma}
       
    57 
       
    58 Fortunately this is provable by induction on the list $rs$
       
    59 
       
    60 
       
    61 
       
    62 %-----------------------------------
       
    63 %	SECTION 2
       
    64 %-----------------------------------
       
    65 
       
    66 \section{Finiteness Property}
       
    67 In this section let us describe our argument for why the size of the simplified
       
    68 derivatives with the aggressive simplification function is finitely bounded.
       
    69  Suppose
       
    70 we have a size function for bitcoded regular expressions, written
       
    71 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
       
    72 
       
    73 \begin{center}
       
    74 \begin{tabular}{ccc}
       
    75 $\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
       
    76 \end{tabular}
       
    77 \end{center}
       
    78 (TODO: COMPLETE this defn and for $rs$)
       
    79 
       
    80 The size is based on a recursive function on the structure of the regex,
       
    81 not the bitcodes.
       
    82 Therefore we may as well talk about size of an annotated regular expression 
       
    83 in their un-annotated form:
       
    84 \begin{center}
       
    85 $\asize(a) = \size(\erase(a))$. 
       
    86 \end{center}
       
    87 (TODO: turn equals to roughly equals)
       
    88 
       
    89 But there is a minor nuisance here:
       
    90 the erase function actually messes with the structure of the regular expression:
       
    91 \begin{center}
       
    92 \begin{tabular}{ccc}
       
    93 $\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
       
    94 \end{tabular}
       
    95 \end{center}
       
    96 (TODO: complete this definition with singleton r in alts)
       
    97 
       
    98 An alternative regular expression with an empty list of children
       
    99  is turned into an $\ZERO$ during the
       
   100 $\erase$ function, thereby changing the size and structure of the regex.
       
   101 These will likely be fixable if we really want to use plain $\rexp$s for dealing
       
   102 with size, but we choose a more straightforward (or stupid) method by 
       
   103 defining a new datatype that is similar to plain $\rexp$s but can take
       
   104 non-binary arguments for its alternative constructor,
       
   105  which we call $\rrexp$ to denote
       
   106 the difference between it and plain regular expressions. 
       
   107 \[			\rrexp ::=   \RZERO \mid  \RONE
       
   108 			 \mid  \RCHAR{c}  
       
   109 			 \mid  \RSEQ{r_1}{r_2}
       
   110 			 \mid  \RALTS{rs}
       
   111 			 \mid \RSTAR{r}        
       
   112 \]
       
   113 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
       
   114 but keep everything else intact.
       
   115 It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
       
   116 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
       
   117 $\ALTS$).
       
   118 We denote the operation of erasing the bits and turning an annotated regular expression 
       
   119 into an $\rrexp{}$ as $\rerase{}$.
       
   120 \begin{center}
       
   121 \begin{tabular}{lcl}
       
   122 $\rerase{\AZERO}$ & $=$ & $\RZERO$\\
       
   123 $\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
       
   124 $\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
       
   125 \end{tabular}
       
   126 \end{center}
       
   127 %TODO: FINISH definition of rerase
       
   128 Similarly we could define the derivative  and simplification on 
       
   129 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
       
   130 except that now they can operate on alternatives taking multiple arguments.
       
   131 
       
   132 \begin{center}
       
   133 \begin{tabular}{lcr}
       
   134 $\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
       
   135 (other clauses omitted)
       
   136 \end{tabular}
       
   137 \end{center}
       
   138 
       
   139 Now that $\rrexp$s do not have bitcodes on them, we can do the 
       
   140 duplicate removal without  an equivalence relation:
       
   141 \begin{center}
       
   142 \begin{tabular}{lcl}
       
   143 $\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
       
   144            			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
       
   145 \end{tabular}
       
   146 \end{center}
       
   147 %TODO: definition of rsimp (maybe only the alternative clause)
       
   148 
       
   149 
       
   150 The reason why these definitions  mirror precisely the corresponding operations
       
   151 on annotated regualar expressions is that we can calculate sizes more easily:
       
   152 
       
   153 \begin{lemma}
       
   154 $\rsize{\rerase a} = \asize a$
       
   155 \end{lemma}
       
   156 This lemma says that the size of an r-erased regex is the same as the annotated regex.
       
   157 this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
       
   158 \begin{lemma}
       
   159 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
       
   160 \end{lemma}
       
   161 Putting these two together we get a property that allows us to estimate the 
       
   162 size of an annotated regular expression derivative using its un-annotated counterpart:
       
   163 \begin{lemma}
       
   164 $\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
       
   165 \end{lemma}
       
   166 Unless stated otherwise in this chapter all $\textit{rexp}$s without
       
   167  bitcodes are seen as $\rrexp$s.
       
   168  We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
       
   169  as the former suits people's intuitive way of stating a binary alternative
       
   170  regular expression.
       
   171 
       
   172 
       
   173 \begin{theorem}
       
   174 For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
       
   175 \end{theorem}
       
   176 
       
   177 \noindent
       
   178 \begin{proof}
       
   179 We prove this by induction on $r$. The base cases for $\AZERO$,
       
   180 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
       
   181 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
       
   182 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
       
   183 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
       
   184 %
       
   185 \begin{center}
       
   186 \begin{tabular}{lcll}
       
   187 & & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
       
   188 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
       
   189     [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
       
   190 & $\leq$ &
       
   191     $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
       
   192     [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
       
   193 & $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
       
   194              \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
       
   195 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
       
   196       \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
       
   197 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
       
   198 \end{tabular}
       
   199 \end{center}
       
   200 
       
   201 
       
   202 \noindent
       
   203 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
       
   204 The reason why we could write the derivatives of a sequence this way is described in section 2.
       
   205 The term (2) is used to control (1). 
       
   206 That is because one can obtain an overall
       
   207 smaller regex list
       
   208 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
       
   209 Section 3 is dedicated to its proof.
       
   210 In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
       
   211 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
       
   212 than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
       
   213 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
       
   214 We reason similarly for  $\STAR$.\medskip
       
   215 \end{proof}
       
   216 
       
   217 What guarantee does this bound give us?
       
   218 
       
   219 Whatever the regex is, it will not grow indefinitely.
       
   220 Take our previous example $(a + aa)^*$ as an example:
       
   221 \begin{center}
       
   222 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   223 \begin{tikzpicture}
       
   224 \begin{axis}[
       
   225     xlabel={number of $a$'s},
       
   226     x label style={at={(1.05,-0.05)}},
       
   227     ylabel={regex size},
       
   228     enlargelimits=false,
       
   229     xtick={0,5,...,30},
       
   230     xmax=33,
       
   231     ymax= 40,
       
   232     ytick={0,10,...,40},
       
   233     scaled ticks=false,
       
   234     axis lines=left,
       
   235     width=5cm,
       
   236     height=4cm, 
       
   237     legend entries={$(a + aa)^*$},  
       
   238     legend pos=north west,
       
   239     legend cell align=left]
       
   240 \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
       
   241 \end{axis}
       
   242 \end{tikzpicture}
       
   243 \end{tabular}
       
   244 \end{center}
       
   245 We are able to limit the size of the regex $(a + aa)^*$'s derivatives
       
   246  with our simplification
       
   247 rules very effectively.
       
   248 
       
   249 
       
   250 As the graphs of  some more randomly generated regexes show, the size of 
       
   251 the derivative might grow quickly at the start of the input,
       
   252  but after a sufficiently long  input string, the trend will stop.
       
   253 
       
   254 
       
   255 %a few sample regular experessions' derivatives
       
   256 %size change
       
   257 %TODO: giving graphs showing a few regexes' size changes 
       
   258 %w;r;t the input characters number
       
   259 %a*, aa*, aaa*, .....
       
   260 %randomly generated regexes
       
   261 \begin{center}
       
   262 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   263 \begin{tikzpicture}
       
   264 \begin{axis}[
       
   265     xlabel={number of $a$'s},
       
   266     x label style={at={(1.05,-0.05)}},
       
   267     ylabel={regex size},
       
   268     enlargelimits=false,
       
   269     xtick={0,5,...,30},
       
   270     xmax=33,
       
   271     ymax=1000,
       
   272     ytick={0,100,...,1000},
       
   273     scaled ticks=false,
       
   274     axis lines=left,
       
   275     width=5cm,
       
   276     height=4cm, 
       
   277     legend entries={regex1},  
       
   278     legend pos=north west,
       
   279     legend cell align=left]
       
   280 \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
       
   281 \end{axis}
       
   282 \end{tikzpicture}
       
   283   &
       
   284 \begin{tikzpicture}
       
   285 \begin{axis}[
       
   286     xlabel={$n$},
       
   287     x label style={at={(1.05,-0.05)}},
       
   288     %ylabel={time in secs},
       
   289     enlargelimits=false,
       
   290     xtick={0,5,...,30},
       
   291     xmax=33,
       
   292     ymax=1000,
       
   293     ytick={0,100,...,1000},
       
   294     scaled ticks=false,
       
   295     axis lines=left,
       
   296     width=5cm,
       
   297     height=4cm, 
       
   298     legend entries={regex2},  
       
   299     legend pos=north west,
       
   300     legend cell align=left]
       
   301 \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
       
   302 \end{axis}
       
   303 \end{tikzpicture}
       
   304   &
       
   305 \begin{tikzpicture}
       
   306 \begin{axis}[
       
   307     xlabel={$n$},
       
   308     x label style={at={(1.05,-0.05)}},
       
   309     %ylabel={time in secs},
       
   310     enlargelimits=false,
       
   311     xtick={0,5,...,30},
       
   312     xmax=33,
       
   313     ymax=1000,
       
   314     ytick={0,100,...,1000},
       
   315     scaled ticks=false,
       
   316     axis lines=left,
       
   317     width=5cm,
       
   318     height=4cm, 
       
   319     legend entries={regex3},  
       
   320     legend pos=north west,
       
   321     legend cell align=left]
       
   322 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
       
   323 \end{axis}
       
   324 \end{tikzpicture}\\
       
   325 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
       
   326 \end{tabular}    
       
   327 \end{center}  
       
   328 
       
   329 
       
   330 
       
   331 
       
   332 
       
   333 \noindent
       
   334 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
       
   335 far from the actual bound we can expect. 
       
   336 In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
       
   337 is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
       
   338 Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
       
   339 inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
       
   340 $f(x) = x * 2^x$.
       
   341 This means the bound we have will surge up at least
       
   342 tower-exponentially with a linear increase of the depth.
       
   343 For a regex of depth $n$, the bound
       
   344 would be approximately $4^n$.
       
   345 %TODO: change this to tower exponential!
       
   346 
       
   347 We can do better than this, but this does not improve
       
   348 the finiteness property we are proving. If we are interested in a polynomial bound,
       
   349 one would hope to obtain a similar tight bound as for partial
       
   350 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
       
   351  $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
       
   352 partial derivatives). 
       
   353 To obtain the exact same bound would mean
       
   354 we need to introduce simplifications, such as
       
   355  $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
       
   356 which exist for partial derivatives. 
       
   357 
       
   358 However, if we introduce them in our
       
   359 setting we would lose the POSIX property of our calculated values. 
       
   360 A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
       
   361 If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer 
       
   362 would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
       
   363 what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
       
   364 in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
       
   365 occurs, and apply them in the right order once we get 
       
   366 a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
       
   367 This is unlike the simplification we had before, where the rewriting rules 
       
   368 such  as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
       
   369 We will discuss better
       
   370 bounds in the last section of this chapter.\\[-6.5mm]
       
   371 
       
   372 
       
   373 
       
   374 
       
   375 %----------------------------------------------------------------------------------------
       
   376 %	SECTION ??
       
   377 %----------------------------------------------------------------------------------------
       
   378 
       
   379 \section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
       
   380 To embark on getting the "closed forms" of regexes, we first
       
   381 define a few auxiliary definitions to make expressing them smoothly.
       
   382 
       
   383  \begin{center}  
       
   384  \begin{tabular}{ccc}
       
   385  $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
       
   386 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
       
   387 $\sflataux r$ & $=$ & $ [r]$
       
   388 \end{tabular}
       
   389 \end{center}
       
   390 The intuition behind $\sflataux{\_}$ is to break up nested regexes 
       
   391 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
       
   392 into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
       
   393 It will return the singleton list $[r]$ otherwise.
       
   394 
       
   395 $\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
       
   396 the output type a regular expression, not a list:
       
   397  \begin{center} 
       
   398  \begin{tabular}{ccc}
       
   399  $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
       
   400 $\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
       
   401 $\sflat r$ & $=$ & $ [r]$
       
   402 \end{tabular}
       
   403 \end{center}
       
   404 $\sflataux{\_}$  and $\sflat{\_}$ is only recursive in terms of the
       
   405  first element of the list of children of
       
   406 an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for  dealing with the regular expression
       
   407 $r_1 \cdot r_2 \backslash s$.
       
   408 
       
   409 With $\sflat{\_}$ and $\sflataux{\_}$ we make 
       
   410 precise what  "closed forms" we have for the sequence derivatives and their simplifications,
       
   411 in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
       
   412 and $\bderssimp{(r_1\cdot r_2)}{s}$,
       
   413 if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
       
   414 and  $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$  ranges over 
       
   415 the substring of $s$?
       
   416 First let's look at a series of derivatives steps on a sequence 
       
   417 regular expression,  (assuming) that each time the first
       
   418 component of the sequence is always nullable):
       
   419 \begin{center}
       
   420 
       
   421 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
       
   422 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
       
   423  \ldots$
       
   424 
       
   425 \end{center}
       
   426 %TODO: cite indian paper
       
   427 Indianpaper have  come up with a slightly more formal way of putting the above process:
       
   428 \begin{center}
       
   429 $r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
       
   430 \delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
       
   431 + \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
       
   432 \end{center}
       
   433 where  $\delta(b, r)$ will produce $r$
       
   434 if $b$ evaluates to true, 
       
   435 and $\ZERO$ otherwise.
       
   436 
       
   437  But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
       
   438  equivalence. To make this intuition useful 
       
   439  for a formal proof, we need something
       
   440 stronger than language equivalence.
       
   441 With the help of $\sflat{\_}$ we can state the equation in Indianpaper
       
   442 more rigorously:
       
   443 \begin{lemma}
       
   444 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
       
   445 \end{lemma}
       
   446 
       
   447 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
       
   448 
       
   449 \begin{center}
       
   450 \begin{tabular}{lcl}
       
   451 $\vsuf{[]}{\_} $ & $=$ &  $[]$\\
       
   452 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
       
   453                                      && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
       
   454 \end{tabular}
       
   455 \end{center}                   
       
   456 It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
       
   457 and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
       
   458 the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
       
   459 In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
       
   460 the entire result of  $(r_1 \cdot r_2) \backslash s$, 
       
   461 it only stores all the $s''$ such that $r_2 \backslash s''$  are occurring terms in $(r_1\cdot r_2)\backslash s$.
       
   462 
       
   463 With this we can also add simplifications to both sides and get
       
   464 \begin{lemma}
       
   465 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
       
   466 \end{lemma}
       
   467 Together with the idempotency property of $\rsimp$,
       
   468 %TODO: state the idempotency property of rsimp
       
   469 \begin{lemma}
       
   470 $\rsimp{r} = \rsimp{\rsimp{r}}$
       
   471 \end{lemma}
       
   472 it is possible to convert the above lemma to obtain a "closed form"
       
   473 for  derivatives nested with simplification:
       
   474 \begin{lemma}
       
   475 $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
       
   476 \end{lemma}
       
   477 And now the reason we have (1) in section 1 is clear:
       
   478 $\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, 
       
   479 is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
       
   480     as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
       
   481 
       
   482 %----------------------------------------------------------------------------------------
       
   483 %	SECTION 3
       
   484 %----------------------------------------------------------------------------------------
       
   485 
       
   486 \section{interaction between $\distinctWith$ and $\flts$}
       
   487 Note that it is not immediately obvious that 
       
   488 \begin{center}
       
   489 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
       
   490 \end{center}
       
   491 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
       
   492 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
       
   493 
       
   494 
       
   495 %-----------------------------------
       
   496 %	SECTION syntactic equivalence under simp
       
   497 %-----------------------------------
       
   498 \section{Syntactic Equivalence Under $\simp$}
       
   499 We prove that minor differences can be annhilated
       
   500 by $\simp$.
       
   501 For example,
       
   502 \begin{center}
       
   503 $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
       
   504  \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
       
   505 \end{center}
       
   506 
       
   507 
       
   508 %----------------------------------------------------------------------------------------
       
   509 %	SECTION ALTS CLOSED FORM
       
   510 %----------------------------------------------------------------------------------------
       
   511 \section{A Closed Form for \textit{ALTS}}
       
   512 Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
       
   513 
       
   514 
       
   515 There are a few key steps, one of these steps is
       
   516 \[
       
   517 rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))
       
   518 = rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))
       
   519 \]
       
   520 
       
   521 
       
   522 One might want to prove this by something a simple statement like: 
       
   523 $map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.
       
   524 
       
   525 For this to hold we want the $\textit{distinct}$ function to pick up
       
   526 the elements before and after derivatives correctly:
       
   527 $r \in rset \equiv (rder x r) \in (rder x rset)$.
       
   528 which essentially requires that the function $\backslash$ is an injective mapping.
       
   529 
       
   530 Unfortunately the function $\backslash c$ is not an injective mapping.
       
   531 
       
   532 \subsection{function $\backslash c$ is not injective (1-to-1)}
       
   533 \begin{center}
       
   534 The derivative $w.r.t$ character $c$ is not one-to-one.
       
   535 Formally,
       
   536 	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
       
   537 \end{center}
       
   538 This property is trivially true for the
       
   539 character regex example:
       
   540 \begin{center}
       
   541 	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
       
   542 \end{center}
       
   543 But apart from the cases where the derivative
       
   544 output is $\ZERO$, are there non-trivial results
       
   545 of derivatives which contain strings?
       
   546 The answer is yes.
       
   547 For example,
       
   548 \begin{center}
       
   549 	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
       
   550 	where $a$ is not nullable.\\
       
   551 	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
       
   552 	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
       
   553 \end{center}
       
   554 We start with two syntactically different regexes,
       
   555 and end up with the same derivative result.
       
   556 This is not surprising as we have such 
       
   557 equality as below in the style of Arden's lemma:\\
       
   558 \begin{center}
       
   559 	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
       
   560 \end{center}
       
   561 
       
   562 %----------------------------------------------------------------------------------------
       
   563 %	SECTION 4
       
   564 %----------------------------------------------------------------------------------------
       
   565 \section{A Bound for the Star Regular Expression}
       
   566 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
       
   567 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
       
   568 the property of the $\distinct$ function.
       
   569 Now we try to get a bound on $r^* \backslash s$ as well.
       
   570 Again, we first look at how a star's derivatives evolve, if they grow maximally: 
       
   571 \begin{center}
       
   572 
       
   573 $r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
       
   574 r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
       
   575 (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
       
   576 \quad \ldots$
       
   577 
       
   578 \end{center}
       
   579 When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
       
   580 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
       
   581 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
       
   582 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
       
   583 count the possible size explosions of $r \backslash c$ themselves.
       
   584 
       
   585 Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
       
   586 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
       
   587 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
       
   588 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
       
   589 For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
       
   590 %TODO: definitions of  and \hflataux \hflat
       
   591  \begin{center}  
       
   592  \begin{tabular}{ccc}
       
   593  $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
       
   594 $\hflataux r$ & $=$ & $ [r]$
       
   595 \end{tabular}
       
   596 \end{center}
       
   597 
       
   598  \begin{center}  
       
   599  \begin{tabular}{ccc}
       
   600  $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
       
   601 $\hflat r$ & $=$ & $ r$
       
   602 \end{tabular}
       
   603 \end{center}s
       
   604 Again these definitions are tailor-made for dealing with alternatives that have
       
   605 originated from a star's derivatives, so we don't attempt to open up all possible 
       
   606 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
       
   607 elements.
       
   608 We give a predicate for such "star-created" regular expressions:
       
   609 \begin{center}
       
   610 \begin{tabular}{lcr}
       
   611          &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
       
   612  $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
       
   613  \end{tabular}
       
   614  \end{center}
       
   615  
       
   616  These definitions allows us the flexibility to talk about 
       
   617  regular expressions in their most convenient format,
       
   618  for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
       
   619  instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
       
   620  These definitions help express that certain classes of syntatically 
       
   621  distinct regular expressions are actually the same under simplification.
       
   622  This is not entirely true for annotated regular expressions: 
       
   623  %TODO: bsimp bders \neq bderssimp
       
   624  \begin{center}
       
   625  $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
       
   626  \end{center}
       
   627  For bit-codes, the order in which simplification is applied
       
   628  might cause a difference in the location they are placed.
       
   629  If we want something like
       
   630  \begin{center}
       
   631  $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
       
   632  \end{center}
       
   633  Some "canonicalization" procedure is required,
       
   634  which either pushes all the common bitcodes to nodes
       
   635   as senior as possible:
       
   636   \begin{center}
       
   637   $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
       
   638   \end{center}
       
   639  or does the reverse. However bitcodes are not of interest if we are talking about
       
   640  the $\llbracket r \rrbracket$ size of a regex.
       
   641  Therefore for the ease and simplicity of producing a
       
   642  proof for a size bound, we are happy to restrict ourselves to 
       
   643  unannotated regular expressions, and obtain such equalities as
       
   644  \begin{lemma}
       
   645  $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
       
   646  \end{lemma}
       
   647  
       
   648  \begin{proof}
       
   649  By using the rewriting relation $\rightsquigarrow$
       
   650  \end{proof}
       
   651  %TODO: rsimp sflat
       
   652 And from this we obtain a proof that a star's derivative will be the same
       
   653 as if it had all its nested alternatives created during deriving being flattened out:
       
   654  For example,
       
   655  \begin{lemma}
       
   656  $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
       
   657  \end{lemma}
       
   658  \begin{proof}
       
   659  By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
       
   660  \end{proof}
       
   661 % The simplification of a flattened out regular expression, provided it comes
       
   662 %from the derivative of a star, is the same as the one nested.
       
   663  
       
   664  
       
   665  
       
   666  
       
   667  
       
   668  
       
   669  
       
   670  
       
   671  
       
   672 One might wonder the actual bound rather than the loose bound we gave
       
   673 for the convenience of an easier proof.
       
   674 How much can the regex $r^* \backslash s$ grow? 
       
   675 As  earlier graphs have shown,
       
   676 %TODO: reference that graph where size grows quickly
       
   677  they can grow at a maximum speed
       
   678   exponential $w.r.t$ the number of characters, 
       
   679 but will eventually level off when the string $s$ is long enough.
       
   680 If they grow to a size exponential $w.r.t$ the original regex, our algorithm
       
   681 would still be slow.
       
   682 And unfortunately, we have concrete examples
       
   683 where such regexes grew exponentially large before levelling off:
       
   684 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
       
   685 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
       
   686  size that is  exponential on the number $n$ 
       
   687 under our current simplification rules:
       
   688 %TODO: graph of a regex whose size increases exponentially.
       
   689 \begin{center}
       
   690 \begin{tikzpicture}
       
   691     \begin{axis}[
       
   692         height=0.5\textwidth,
       
   693         width=\textwidth,
       
   694         xlabel=number of a's,
       
   695         xtick={0,...,9},
       
   696         ylabel=maximum size,
       
   697         ymode=log,
       
   698        log basis y={2}
       
   699 ]
       
   700         \addplot[mark=*,blue] table {re-chengsong.data};
       
   701     \end{axis}
       
   702 \end{tikzpicture}
       
   703 \end{center}
       
   704 
       
   705 For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
       
   706 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
       
   707 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
       
   708 The exponential size is triggered by that the regex
       
   709 $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
       
   710 inside the $(\ldots) ^*$ having exponentially many
       
   711 different derivatives, despite those difference being minor.
       
   712 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
       
   713 will therefore contain the following terms (after flattening out all nested 
       
   714 alternatives):
       
   715 \begin{center}
       
   716 $(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
       
   717 $(1 \leq m' \leq m )$
       
   718 \end{center}
       
   719 These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
       
   720  With each new input character taking the derivative against the intermediate result, more and more such distinct
       
   721  terms will accumulate, 
       
   722 until the length reaches $L.C.M.(1, \ldots, n)$.
       
   723 $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
       
   724 $(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
       
   725 
       
   726 $(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
       
   727  where $m' \neq m''$ \\
       
   728  as they are slightly different.
       
   729  This means that with our current simplification methods,
       
   730  we will not be able to control the derivative so that
       
   731  $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
       
   732  as there are already exponentially many terms.
       
   733  These terms are similar in the sense that the head of those terms
       
   734  are all consisted of sub-terms of the form: 
       
   735  $(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
       
   736  For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
       
   737  $n * (n + 1) / 2$ such terms. 
       
   738  For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
       
   739  can be described by 6 terms:
       
   740  $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, 
       
   741  $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
       
   742 The total number of different "head terms",  $n * (n + 1) / 2$,
       
   743  is proportional to the number of characters in the regex 
       
   744 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
       
   745 This suggests a slightly different notion of size, which we call the 
       
   746 alphabetic width:
       
   747 %TODO:
       
   748 (TODO: Alphabetic width def.)
       
   749 
       
   750  
       
   751 Antimirov\parencite{Antimirov95} has proven that 
       
   752 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
       
   753 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
       
   754 created by doing derivatives of $r$ against all possible strings.
       
   755 If we can make sure that at any moment in our lexing algorithm our 
       
   756 intermediate result hold at most one copy of each of the 
       
   757 subterms then we can get the same bound as Antimirov's.
       
   758 This leads to the algorithm in the next chapter.
       
   759 
       
   760 
       
   761 
       
   762 
       
   763 
       
   764 %----------------------------------------------------------------------------------------
       
   765 %	SECTION 1
       
   766 %----------------------------------------------------------------------------------------
       
   767 
       
   768 \section{Idempotency of $\simp$}
       
   769 
       
   770 \begin{equation}
       
   771 	\simp \;r = \simp\; \simp \; r 
       
   772 \end{equation}
       
   773 This property means we do not have to repeatedly
       
   774 apply simplification in each step, which justifies
       
   775 our definition of $\blexersimp$.
       
   776 It will also be useful in future proofs where properties such as 
       
   777 closed forms are needed.
       
   778 The proof is by structural induction on $r$.
       
   779 
       
   780 %-----------------------------------
       
   781 %	SUBSECTION 1
       
   782 %-----------------------------------
       
   783 \subsection{Syntactic Equivalence Under $\simp$}
       
   784 We prove that minor differences can be annhilated
       
   785 by $\simp$.
       
   786 For example,
       
   787 \begin{center}
       
   788 $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
       
   789  \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
       
   790 \end{center}
       
   791