ChengsongTanPhdThesis/Chapters/Chapter2.tex
changeset 518 ff7945a988a3
parent 516 6fecb7fe8cd0
child 519 856d025dbc15
equal deleted inserted replaced
517:3c5e58d08939 518:ff7945a988a3
     5 \label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
     5 \label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
     6 
     6 
     7 
     7 
     8 
     8 
     9 
     9 
    10 %----------------------------------------------------------------------------------------
    10 
    11 %	SECTION 1
       
    12 %----------------------------------------------------------------------------------------
       
    13 
       
    14 \section{Properties of $\backslash c$}
       
    15 
       
    16 To have a clear idea of what we can and 
       
    17 need to prove about the algorithms involving
       
    18 Brzozowski's derivatives, there are a few 
       
    19 properties we need to be clear about .
       
    20 \subsection{function $\backslash c$ is not 1-to-1}
       
    21 \begin{center}
       
    22 The derivative $w.r.t$ character $c$ is not one-to-one.
       
    23 Formally,
       
    24 	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
       
    25 \end{center}
       
    26 This property is trivially true for the
       
    27 character regex example:
       
    28 \begin{center}
       
    29 	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
       
    30 \end{center}
       
    31 But apart from the cases where the derivative
       
    32 output is $\ZERO$, are there non-trivial results
       
    33 of derivatives which contain strings?
       
    34 The answer is yes.
       
    35 For example,
       
    36 \begin{center}
       
    37 	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
       
    38 	where $a$ is not nullable.\\
       
    39 	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
       
    40 	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
       
    41 \end{center}
       
    42 We start with two syntactically different regexes,
       
    43 and end up with the same derivative result, which is
       
    44 a "meaningful" regex because it contains strings.
       
    45 We have rediscovered Arden's lemma:\\
       
    46 \begin{center}
       
    47 	$A^*B = A\cdot A^* \cdot B + B$
       
    48 \end{center}
       
    49 
       
    50 	
       
    51 %-----------------------------------
    11 %-----------------------------------
    52 %	SUBSECTION 1
    12 %	SUBSECTION 1
    53 %-----------------------------------
    13 %-----------------------------------
    54 \subsection{Subsection 1}
    14 \section{Specifications of Certain Functions to be Used}
    55 To be completed.
    15 To be completed.
    56 
    16 
    57 
    17 
    58 
    18 
    59 
    19 
       
    20 %-----------------------------------
       
    21 %	SECTION ?
       
    22 %-----------------------------------
       
    23 \section{preparatory properties for getting a finiteness bound}
       
    24 Before we get to the proof that says the intermediate result of our lexer will
       
    25 remain finitely bounded, which is an important efficiency/liveness guarantee,
       
    26 we shall first develop a few preparatory properties and definitions to 
       
    27 make the process of proving that a breeze.
       
    28 
       
    29 We define rewriting relations for $\rrexp$s, which allows us to do the 
       
    30 same trick as we did for the correctness proof,
       
    31 but this time we will have stronger equalities established.
       
    32 \subsection{"hrewrite" relation}
       
    33 List of 1-step rewrite rules for regular expressions simplification without bitcodes:
       
    34 \begin{center}
       
    35 \begin{tabular}{lcl}
       
    36 $r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$
       
    37 \end{tabular}
       
    38 \end{center}
       
    39 
       
    40 And we define an "grewrite" relation that works on lists:
       
    41 \begin{center}
       
    42 \begin{tabular}{lcl}
       
    43 $ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
       
    44 \end{tabular}
       
    45 \end{center}
       
    46 
       
    47 
       
    48 
       
    49 With these relations we prove
       
    50 \begin{lemma}
       
    51 $rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
       
    52 \end{lemma}
       
    53 which enables us to prove "closed forms" equalities such as
       
    54 \begin{lemma}
       
    55 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
       
    56 \end{lemma}
       
    57 
       
    58 But the most involved part of the above lemma is proving the following:
       
    59 \begin{lemma}
       
    60 $\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ 
       
    61 \end{lemma}
       
    62 which is needed in proving things like 
       
    63 \begin{lemma}
       
    64 $r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
       
    65 \end{lemma}
       
    66 
       
    67 Fortunately this is provable by induction on the list $rs$
    60 
    68 
    61 
    69 
    62 
    70 
    63 %-----------------------------------
    71 %-----------------------------------
    64 %	SECTION 2
    72 %	SECTION 2
    65 %-----------------------------------
    73 %-----------------------------------
    66 
    74 
    67 \section{Finiteness Property}
    75 \section{Finiteness Property}
    68 In this section let us sketch our argument for why the size of the simplified
    76 In this section let us describe our argument for why the size of the simplified
    69 derivatives with the aggressive simplification function can be finitely bounded.
    77 derivatives with the aggressive simplification function is finitely bounded.
    70 
    78  Suppose
    71 For convenience, we use a new datatype which we call $\rrexp$ to denote
    79 we have a size function for bitcoded regular expressions, written
    72 the difference between it and annotated regular expressions. 
    80 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
       
    81 
       
    82 \begin{center}
       
    83 \begin{tabular}{ccc}
       
    84 $\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
       
    85 \end{tabular}
       
    86 \end{center}
       
    87 (TODO: COMPLETE this defn and for $rs$)
       
    88 
       
    89 The size is based on a recursive function on the structure of the regex,
       
    90 not the bitcodes.
       
    91 Therefore we may as well talk about size of an annotated regular expression 
       
    92 in their un-annotated form:
       
    93 \begin{center}
       
    94 $\asize(a) = \size(\erase(a))$. 
       
    95 \end{center}
       
    96 (TODO: turn equals to roughly equals)
       
    97 
       
    98 But there is a minor nuisance here:
       
    99 the erase function actually messes with the structure of the regular expression:
       
   100 \begin{center}
       
   101 \begin{tabular}{ccc}
       
   102 $\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
       
   103 \end{tabular}
       
   104 \end{center}
       
   105 (TODO: complete this definition with singleton r in alts)
       
   106 
       
   107 An alternative regular expression with an empty list of children
       
   108  is turned into an $\ZERO$ during the
       
   109 $\erase$ function, thereby changing the size and structure of the regex.
       
   110 These will likely be fixable if we really want to use plain $\rexp$s for dealing
       
   111 with size, but we choose a more straightforward (or stupid) method by 
       
   112 defining a new datatype that is similar to plain $\rexp$s but can take
       
   113 non-binary arguments for its alternative constructor,
       
   114  which we call $\rrexp$ to denote
       
   115 the difference between it and plain regular expressions. 
    73 \[			\rrexp ::=   \RZERO \mid  \RONE
   116 \[			\rrexp ::=   \RZERO \mid  \RONE
    74 			 \mid  \RCHAR{c}  
   117 			 \mid  \RCHAR{c}  
    75 			 \mid  \RSEQ{r_1}{r_2}
   118 			 \mid  \RSEQ{r_1}{r_2}
    76 			 \mid  \RALTS{rs}
   119 			 \mid  \RALTS{rs}
    77 			 \mid \RSTAR{r}        
   120 			 \mid \RSTAR{r}        
    82 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
   125 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
    83 $\ALTS$).
   126 $\ALTS$).
    84 We denote the operation of erasing the bits and turning an annotated regular expression 
   127 We denote the operation of erasing the bits and turning an annotated regular expression 
    85 into an $\rrexp{}$ as $\rerase{}$.
   128 into an $\rrexp{}$ as $\rerase{}$.
    86 \begin{center}
   129 \begin{center}
    87 \begin{tabular}{lcr}
   130 \begin{tabular}{lcl}
    88 $\rerase{\AZERO}$ & $=$ & $\RZERO$\\
   131 $\rerase{\AZERO}$ & $=$ & $\RZERO$\\
    89 $\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
   132 $\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
    90 $\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
   133 $\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
    91 \end{tabular}
   134 \end{tabular}
    92 \end{center}
   135 \end{center}
   117 on annotated regualar expressions is that we can calculate sizes more easily:
   160 on annotated regualar expressions is that we can calculate sizes more easily:
   118 
   161 
   119 \begin{lemma}
   162 \begin{lemma}
   120 $\rsize{\rerase a} = \asize a$
   163 $\rsize{\rerase a} = \asize a$
   121 \end{lemma}
   164 \end{lemma}
   122 A similar equation holds for annotated regular expressions' simplification
   165 This lemma says that the size of an r-erased regex is the same as the annotated regex.
   123 and the plain regular expressions' simplification:
   166 this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
   124 \begin{lemma}
   167 \begin{lemma}
   125 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
   168 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
   126 \end{lemma}
   169 \end{lemma}
   127 Putting these two together we get a property that allows us to estimate the 
   170 Putting these two together we get a property that allows us to estimate the 
   128 size of an annotated regular expression derivative using its un-annotated counterpart:
   171 size of an annotated regular expression derivative using its un-annotated counterpart:
   133  bitcodes are seen as $\rrexp$s.
   176  bitcodes are seen as $\rrexp$s.
   134  We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
   177  We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
   135  as the former suits people's intuitive way of stating a binary alternative
   178  as the former suits people's intuitive way of stating a binary alternative
   136  regular expression.
   179  regular expression.
   137 
   180 
   138  Suppose
   181 
   139 we have a size function for bitcoded regular expressions, written
   182 \begin{theorem}
   140 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
   183 For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
   141 (we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
   184 \end{theorem}
   142 there exists a bound $N$
       
   143 such that 
       
   144 
       
   145 \begin{center}
       
   146 $\forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N$
       
   147 \end{center}
       
   148 
   185 
   149 \noindent
   186 \noindent
       
   187 \begin{proof}
   150 We prove this by induction on $r$. The base cases for $\AZERO$,
   188 We prove this by induction on $r$. The base cases for $\AZERO$,
   151 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
   189 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
   152 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
   190 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
   153 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
   191 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
   154 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
   192 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
   167       \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
   205       \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
   168 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
   206 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
   169 \end{tabular}
   207 \end{tabular}
   170 \end{center}
   208 \end{center}
   171 
   209 
   172 % tell Chengsong about Indian paper of closed forms of derivatives
       
   173 
   210 
   174 \noindent
   211 \noindent
   175 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
   212 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
   176 The reason why we could write the derivatives of a sequence this way is described in section 2.
   213 The reason why we could write the derivatives of a sequence this way is described in section 2.
   177 The term (2) is used to control (1). 
   214 The term (2) is used to control (1). 
   182 In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
   219 In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
   183 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
   220 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
   184 than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
   221 than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
   185 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
   222 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
   186 We reason similarly for  $\STAR$.\medskip
   223 We reason similarly for  $\STAR$.\medskip
       
   224 \end{proof}
       
   225 
       
   226 What guarantee does this bound give us?
       
   227 
       
   228 Whatever the regex is, it will not grow indefinitely.
       
   229 Take our previous example $(a + aa)^*$ as an example:
       
   230 \begin{center}
       
   231 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   232 \begin{tikzpicture}
       
   233 \begin{axis}[
       
   234     xlabel={number of $a$'s},
       
   235     x label style={at={(1.05,-0.05)}},
       
   236     ylabel={regex size},
       
   237     enlargelimits=false,
       
   238     xtick={0,5,...,30},
       
   239     xmax=33,
       
   240     ymax= 40,
       
   241     ytick={0,10,...,40},
       
   242     scaled ticks=false,
       
   243     axis lines=left,
       
   244     width=5cm,
       
   245     height=4cm, 
       
   246     legend entries={$(a + aa)^*$},  
       
   247     legend pos=north west,
       
   248     legend cell align=left]
       
   249 \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
       
   250 \end{axis}
       
   251 \end{tikzpicture}
       
   252 \end{tabular}
       
   253 \end{center}
       
   254 We are able to limit the size of the regex $(a + aa)^*$'s derivatives
       
   255  with our simplification
       
   256 rules very effectively.
       
   257 
       
   258 
       
   259 As the graphs of  some more randomly generated regexes show, the size of 
       
   260 the derivative might grow quickly at the start of the input,
       
   261  but after a sufficiently long  input string, the trend will stop.
       
   262 
       
   263 
       
   264 %a few sample regular experessions' derivatives
       
   265 %size change
       
   266 %TODO: giving graphs showing a few regexes' size changes 
       
   267 %w;r;t the input characters number
       
   268 %a*, aa*, aaa*, .....
       
   269 %randomly generated regexes
       
   270 \begin{center}
       
   271 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   272 \begin{tikzpicture}
       
   273 \begin{axis}[
       
   274     xlabel={number of $a$'s},
       
   275     x label style={at={(1.05,-0.05)}},
       
   276     ylabel={regex size},
       
   277     enlargelimits=false,
       
   278     xtick={0,5,...,30},
       
   279     xmax=33,
       
   280     ymax=1000,
       
   281     ytick={0,100,...,1000},
       
   282     scaled ticks=false,
       
   283     axis lines=left,
       
   284     width=5cm,
       
   285     height=4cm, 
       
   286     legend entries={regex1},  
       
   287     legend pos=north west,
       
   288     legend cell align=left]
       
   289 \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
       
   290 \end{axis}
       
   291 \end{tikzpicture}
       
   292   &
       
   293 \begin{tikzpicture}
       
   294 \begin{axis}[
       
   295     xlabel={$n$},
       
   296     x label style={at={(1.05,-0.05)}},
       
   297     %ylabel={time in secs},
       
   298     enlargelimits=false,
       
   299     xtick={0,5,...,30},
       
   300     xmax=33,
       
   301     ymax=1000,
       
   302     ytick={0,100,...,1000},
       
   303     scaled ticks=false,
       
   304     axis lines=left,
       
   305     width=5cm,
       
   306     height=4cm, 
       
   307     legend entries={regex2},  
       
   308     legend pos=north west,
       
   309     legend cell align=left]
       
   310 \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
       
   311 \end{axis}
       
   312 \end{tikzpicture}
       
   313   &
       
   314 \begin{tikzpicture}
       
   315 \begin{axis}[
       
   316     xlabel={$n$},
       
   317     x label style={at={(1.05,-0.05)}},
       
   318     %ylabel={time in secs},
       
   319     enlargelimits=false,
       
   320     xtick={0,5,...,30},
       
   321     xmax=33,
       
   322     ymax=1000,
       
   323     ytick={0,100,...,1000},
       
   324     scaled ticks=false,
       
   325     axis lines=left,
       
   326     width=5cm,
       
   327     height=4cm, 
       
   328     legend entries={regex3},  
       
   329     legend pos=north west,
       
   330     legend cell align=left]
       
   331 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
       
   332 \end{axis}
       
   333 \end{tikzpicture}\\
       
   334 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
       
   335 \end{tabular}    
       
   336 \end{center}  
       
   337 
       
   338 
       
   339 
       
   340 
   187 
   341 
   188 \noindent
   342 \noindent
   189 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
   343 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
   190 far from the actual bound we can expect. We can do better than this, but this does not improve
   344 far from the actual bound we can expect. 
       
   345 In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
       
   346 is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
       
   347 Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
       
   348 inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
       
   349 $f(x) = x * 2^x$.
       
   350 This means the bound we have will surge up at least
       
   351 tower-exponentially with a linear increase of the depth.
       
   352 For a regex of depth $n$, the bound
       
   353 would be approximately $4^n$.
       
   354 %TODO: change this to tower exponential!
       
   355 
       
   356 We can do better than this, but this does not improve
   191 the finiteness property we are proving. If we are interested in a polynomial bound,
   357 the finiteness property we are proving. If we are interested in a polynomial bound,
   192 one would hope to obtain a similar tight bound as for partial
   358 one would hope to obtain a similar tight bound as for partial
   193 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
   359 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
   194  $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
   360  $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
   195 partial derivatives). Unfortunately to obtain the exact same bound would mean
   361 partial derivatives). 
       
   362 To obtain the exact same bound would mean
   196 we need to introduce simplifications, such as
   363 we need to introduce simplifications, such as
   197  $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
   364  $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
   198 which exist for partial derivatives. However, if we introduce them in our
   365 which exist for partial derivatives. 
   199 setting we would lose the POSIX property of our calculated values. We leave better
   366 
   200 bounds for future work.\\[-6.5mm]
   367 However, if we introduce them in our
   201 
   368 setting we would lose the POSIX property of our calculated values. 
   202 
   369 A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
   203 
   370 If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer 
   204 %-----------------------------------
   371 would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
   205 %	SECTION ?
   372 what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
   206 %-----------------------------------
   373 in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
   207 \section{preparatory properties for getting a finiteness bound}
   374 occurs, and apply them in the right order once we get 
   208 Before we get to the proof that says the intermediate result of our lexer will
   375 a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
   209 remain finitely bounded, which is an important efficiency/liveness guarantee,
   376 This is unlike the simplification we had before, where the rewriting rules 
   210 we shall first develop a few preparatory properties and definitions to 
   377 such  as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
   211 make the process of proving that a breeze.
   378 We will discuss better
   212 
   379 bounds in the last section of this chapter.\\[-6.5mm]
   213 We define rewriting relations for $\rrexp$s, which allows us to do the 
   380 
   214 same trick as we did for the correctness proof,
   381 
   215 but this time we will have stronger equalities established.
   382 
   216 \subsection{"hrewrite" relation}
       
   217 List of 1-step rewrite rules for regular expressions simplification without bitcodes:
       
   218 \begin{center}
       
   219 \begin{tabular}{lcl}
       
   220 $r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$
       
   221 \end{tabular}
       
   222 \end{center}
       
   223 
       
   224 And we define an "grewrite" relation that works on lists:
       
   225 \begin{center}
       
   226 \begin{tabular}{lcl}
       
   227 $ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
       
   228 \end{tabular}
       
   229 \end{center}
       
   230 
       
   231 
       
   232 
       
   233 With these relations we prove
       
   234 \begin{lemma}
       
   235 $rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
       
   236 \end{lemma}
       
   237 which enables us to prove "closed forms" equalities such as
       
   238 \begin{lemma}
       
   239 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
       
   240 \end{lemma}
       
   241 
       
   242 But the most involved part of the above lemma is proving the following:
       
   243 \begin{lemma}
       
   244 $\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ 
       
   245 \end{lemma}
       
   246 which is needed in proving things like 
       
   247 \begin{lemma}
       
   248 $r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
       
   249 \end{lemma}
       
   250 .
       
   251 Fortunately this is provable by induction on the list $rs$
       
   252 
   383 
   253 %----------------------------------------------------------------------------------------
   384 %----------------------------------------------------------------------------------------
   254 %	SECTION ??
   385 %	SECTION ??
   255 %----------------------------------------------------------------------------------------
   386 %----------------------------------------------------------------------------------------
   256 
   387 
   367 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
   498 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
   368 \end{center}
   499 \end{center}
   369 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
   500 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
   370 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
   501 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
   371 
   502 
       
   503 
       
   504 %----------------------------------------------------------------------------------------
       
   505 %	SECTION ALTS CLOSED FORM
       
   506 %----------------------------------------------------------------------------------------
       
   507 \section{A Closed Form for \textit{ALTS}}
       
   508 Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
       
   509 
       
   510 
       
   511 There are a few key steps, one of these steps is
       
   512 $rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))
       
   513 = rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))$
       
   514 
       
   515 One might want to prove this by something a simple statement like: 
       
   516 $map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.
       
   517 
       
   518 For this to hold we want the $\textit{distinct}$ function to pick up
       
   519 the elements before and after derivatives correctly:
       
   520 $r \in rset \equiv (rder x r) \in (rder x rset)$.
       
   521 which essentially requires that the function $\backslash$ is an injective mapping.
       
   522 
       
   523 Unfortunately the function $\backslash c$ is not an injective mapping.
       
   524 
       
   525 \subsection{function $\backslash c$ is not injective (1-to-1)}
       
   526 \begin{center}
       
   527 The derivative $w.r.t$ character $c$ is not one-to-one.
       
   528 Formally,
       
   529 	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
       
   530 \end{center}
       
   531 This property is trivially true for the
       
   532 character regex example:
       
   533 \begin{center}
       
   534 	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
       
   535 \end{center}
       
   536 But apart from the cases where the derivative
       
   537 output is $\ZERO$, are there non-trivial results
       
   538 of derivatives which contain strings?
       
   539 The answer is yes.
       
   540 For example,
       
   541 \begin{center}
       
   542 	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
       
   543 	where $a$ is not nullable.\\
       
   544 	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
       
   545 	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
       
   546 \end{center}
       
   547 We start with two syntactically different regexes,
       
   548 and end up with the same derivative result.
       
   549 This is not surprising as we have such 
       
   550 equality as below in the style of Arden's lemma:\\
       
   551 \begin{center}
       
   552 	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
       
   553 \end{center}
       
   554 
   372 %----------------------------------------------------------------------------------------
   555 %----------------------------------------------------------------------------------------
   373 %	SECTION 4
   556 %	SECTION 4
   374 %----------------------------------------------------------------------------------------
   557 %----------------------------------------------------------------------------------------
   375 \section{a bound for the star regular expression}
   558 \section{A Bound for the Star Regular Expression}
   376 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
   559 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
   377 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
   560 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
   378 the property of the $\distinct$ function.
   561 the property of the $\distinct$ function.
   379 Now we try to get a bound on $r^* \backslash s$ as well.
   562 Now we try to get a bound on $r^* \backslash s$ as well.
   380 Again, we first look at how a star's derivatives evolve, if they grow maximally: 
   563 Again, we first look at how a star's derivatives evolve, if they grow maximally: 
   408  \begin{center}  
   591  \begin{center}  
   409  \begin{tabular}{ccc}
   592  \begin{tabular}{ccc}
   410  $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
   593  $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
   411 $\hflat r$ & $=$ & $ r$
   594 $\hflat r$ & $=$ & $ r$
   412 \end{tabular}
   595 \end{tabular}
   413 \end{center}
   596 \end{center}s
   414 Again these definitions are tailor-made for dealing with alternatives that have
   597 Again these definitions are tailor-made for dealing with alternatives that have
   415 originated from a star's derivatives, so we don't attempt to open up all possible 
   598 originated from a star's derivatives, so we don't attempt to open up all possible 
   416 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
   599 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
   417 elements.
   600 elements.
   418 We give a predicate for such "star-created" regular expressions:
   601 We give a predicate for such "star-created" regular expressions:
   450  the $\llbracket r \rrbracket$ size of a regex.
   633  the $\llbracket r \rrbracket$ size of a regex.
   451  Therefore for the ease and simplicity of producing a
   634  Therefore for the ease and simplicity of producing a
   452  proof for a size bound, we are happy to restrict ourselves to 
   635  proof for a size bound, we are happy to restrict ourselves to 
   453  unannotated regular expressions, and obtain such equalities as
   636  unannotated regular expressions, and obtain such equalities as
   454  \begin{lemma}
   637  \begin{lemma}
   455  $\rsimp{(r_1 + r_2)} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
   638  $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
   456  \end{lemma}
   639  \end{lemma}
   457  
   640  
   458  \begin{proof}
   641  \begin{proof}
   459  By using the rewriting relation $\rightsquigarrow$
   642  By using the rewriting relation $\rightsquigarrow$
   460  \end{proof}
   643  \end{proof}
   478  
   661  
   479  
   662  
   480  
   663  
   481  
   664  
   482 One might wonder the actual bound rather than the loose bound we gave
   665 One might wonder the actual bound rather than the loose bound we gave
   483 for the convenience of a easier proof.
   666 for the convenience of an easier proof.
   484 How much can the regex $r^* \backslash s$ grow? As hinted earlier,
   667 How much can the regex $r^* \backslash s$ grow? 
       
   668 As  earlier graphs have shown,
       
   669 %TODO: reference that graph where size grows quickly
   485  they can grow at a maximum speed
   670  they can grow at a maximum speed
   486   exponential $w.r.t$ the number of characters, 
   671   exponential $w.r.t$ the number of characters, 
   487 but will eventually level off when the string $s$ is long enough,
   672 but will eventually level off when the string $s$ is long enough.
   488 as suggested by the finiteness bound proof.
   673 If they grow to a size exponential $w.r.t$ the original regex, our algorithm
   489 And unfortunately we have concrete examples
   674 would still be slow.
       
   675 And unfortunately, we have concrete examples
   490 where such regexes grew exponentially large before levelling off:
   676 where such regexes grew exponentially large before levelling off:
   491 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
   677 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
   492 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
   678 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
   493 size that is  exponential on the number $n$.
   679  size that is  exponential on the number $n$ 
   494 %TODO: give out a graph showing how the size of the regex grows and levels off at a size exponential to the original regex
   680 under our current simplification rules:
   495 
   681 %TODO: graph of a regex whose size increases exponentially.
   496 
   682 \begin{center}
   497 While the tight upper bound will definitely be a lot lower than 
   683 \begin{tikzpicture}
   498 the one we gave for the finiteness proof, 
   684     \begin{axis}[
   499 it will still be at least expoential, under our current simplification rules.
   685         height=0.5\textwidth,
   500 
   686         width=\textwidth,
   501 This suggests stronger simplifications that keep the tight bound polynomial.
   687         xlabel=number of a's,
       
   688         xtick={0,...,9},
       
   689         ylabel=maximum size,
       
   690         ymode=log,
       
   691        log basis y={2}
       
   692 ]
       
   693         \addplot[mark=*,blue] table {re-chengsong.data};
       
   694     \end{axis}
       
   695 \end{tikzpicture}
       
   696 \end{center}
       
   697 
       
   698 For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
       
   699 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
       
   700 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
       
   701 The exponential size is triggered by that the regex
       
   702 $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
       
   703 inside the $(\ldots) ^*$ having exponentially many
       
   704 different derivatives, despite those difference being minor.
       
   705 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
       
   706 will therefore contain the following terms (after flattening out all nested 
       
   707 alternatives):
       
   708 \begin{center}
       
   709 $(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
       
   710 $(1 \leq m' \leq m )$
       
   711 \end{center}
       
   712 These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
       
   713  With each new input character taking the derivative against the intermediate result, more and more such distinct
       
   714  terms will accumulate, 
       
   715 until the length reaches $L.C.M.(1, \ldots, n)$.
       
   716 $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
       
   717 $(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
       
   718 
       
   719 $(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
       
   720  where $m' \neq m''$ \\
       
   721  as they are slightly different.
       
   722  This means that with our current simplification methods,
       
   723  we will not be able to control the derivative so that
       
   724  $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
       
   725  as there are already exponentially many terms.
       
   726  These terms are similar in the sense that the head of those terms
       
   727  are all consisted of sub-terms of the form: 
       
   728  $(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
       
   729  For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
       
   730  $n * (n + 1) / 2$ such terms. 
       
   731  For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
       
   732  can be described by 6 terms:
       
   733  $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, 
       
   734  $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
       
   735 The total number of different "head terms",  $n * (n + 1) / 2$,
       
   736  is proportional to the number of characters in the regex 
       
   737 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
       
   738 This suggests a slightly different notion of size, which we call the 
       
   739 alphabetic width:
       
   740 %TODO:
       
   741 (TODO: Alphabetic width def.)
       
   742 
       
   743  
       
   744 Antimirov\parencite{Antimirov95} has proven that 
       
   745 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
       
   746 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
       
   747 created by doing derivatives of $r$ against all possible strings.
       
   748 If we can make sure that at any moment in our lexing algorithm our 
       
   749 intermediate result hold at most one copy of each of the 
       
   750 subterms then we can get the same bound as Antimirov's.
       
   751 This leads to the algorithm in the next section.
   502 
   752 
   503 %----------------------------------------------------------------------------------------
   753 %----------------------------------------------------------------------------------------
   504 %	SECTION 5
   754 %	SECTION 5
   505 %----------------------------------------------------------------------------------------
   755 %----------------------------------------------------------------------------------------
   506 \section{A Stronger Version of Simplification}
   756 \section{A Stronger Version of Simplification Inspired by Antimirov}
   507 %TODO: search for isabelle proofs of algorithms that check equivalence 
   757 %TODO: search for isabelle proofs of algorithms that check equivalence 
   508 
   758 
   509 
   759 
       
   760 
       
   761 
       
   762 
       
   763 
       
   764 
       
   765 
       
   766