--- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Fri May 20 18:52:03 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Thu May 26 20:51:40 2022 +0100
@@ -7,200 +7,16 @@
 
 
 
-%----------------------------------------------------------------------------------------
-%	SECTION 1
-%----------------------------------------------------------------------------------------
 
-\section{Properties of $\backslash c$}
-
-To have a clear idea of what we can and 
-need to prove about the algorithms involving
-Brzozowski's derivatives, there are a few 
-properties we need to be clear about .
-\subsection{function $\backslash c$ is not 1-to-1}
-\begin{center}
-The derivative $w.r.t$ character $c$ is not one-to-one.
-Formally,
-	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
-\end{center}
-This property is trivially true for the
-character regex example:
-\begin{center}
-	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
-\end{center}
-But apart from the cases where the derivative
-output is $\ZERO$, are there non-trivial results
-of derivatives which contain strings?
-The answer is yes.
-For example,
-\begin{center}
-	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
-	where $a$ is not nullable.\\
-	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
-	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
-\end{center}
-We start with two syntactically different regexes,
-and end up with the same derivative result, which is
-a "meaningful" regex because it contains strings.
-We have rediscovered Arden's lemma:\\
-\begin{center}
-	$A^*B = A\cdot A^* \cdot B + B$
-\end{center}
-
-	
 %-----------------------------------
 %	SUBSECTION 1
 %-----------------------------------
-\subsection{Subsection 1}
+\section{Specifications of Certain Functions to be Used}
 To be completed.
 
 
 
 
-
-
-
-%-----------------------------------
-%	SECTION 2
-%-----------------------------------
-
-\section{Finiteness Property}
-In this section let us sketch our argument for why the size of the simplified
-derivatives with the aggressive simplification function can be finitely bounded.
-
-For convenience, we use a new datatype which we call $\rrexp$ to denote
-the difference between it and annotated regular expressions. 
-\[			\rrexp ::=   \RZERO \mid  \RONE
-			 \mid  \RCHAR{c}  
-			 \mid  \RSEQ{r_1}{r_2}
-			 \mid  \RALTS{rs}
-			 \mid \RSTAR{r}        
-\]
-For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
-but keep everything else intact.
-It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
-(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
-$\ALTS$).
-We denote the operation of erasing the bits and turning an annotated regular expression 
-into an $\rrexp{}$ as $\rerase{}$.
-\begin{center}
-\begin{tabular}{lcr}
-$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
-$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
-$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
-\end{tabular}
-\end{center}
-%TODO: FINISH definition of rerase
-Similarly we could define the derivative  and simplification on 
-$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
-except that now they can operate on alternatives taking multiple arguments.
-
-\begin{center}
-\begin{tabular}{lcr}
-$\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
-(other clauses omitted)
-\end{tabular}
-\end{center}
-
-Now that $\rrexp$s do not have bitcodes on them, we can do the 
-duplicate removal without  an equivalence relation:
-\begin{center}
-\begin{tabular}{lcl}
-$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
-           			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
-\end{tabular}
-\end{center}
-%TODO: definition of rsimp (maybe only the alternative clause)
-
-
-The reason why these definitions  mirror precisely the corresponding operations
-on annotated regualar expressions is that we can calculate sizes more easily:
-
-\begin{lemma}
-$\rsize{\rerase a} = \asize a$
-\end{lemma}
-A similar equation holds for annotated regular expressions' simplification
-and the plain regular expressions' simplification:
-\begin{lemma}
-$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
-\end{lemma}
-Putting these two together we get a property that allows us to estimate the 
-size of an annotated regular expression derivative using its un-annotated counterpart:
-\begin{lemma}
-$\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
-\end{lemma}
-Unless stated otherwise in this chapter all $\textit{rexp}$s without
- bitcodes are seen as $\rrexp$s.
- We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
- as the former suits people's intuitive way of stating a binary alternative
- regular expression.
-
- Suppose
-we have a size function for bitcoded regular expressions, written
-$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
-(we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
-there exists a bound $N$
-such that 
-
-\begin{center}
-$\forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N$
-\end{center}
-
-\noindent
-We prove this by induction on $r$. The base cases for $\AZERO$,
-$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
-for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
-hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
-$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
-%
-\begin{center}
-\begin{tabular}{lcll}
-& & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
-& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
-    [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
-& $\leq$ &
-    $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
-    [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
-& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
-             \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
-      \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
-\end{tabular}
-\end{center}
-
-% tell Chengsong about Indian paper of closed forms of derivatives
-
-\noindent
-where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
-The reason why we could write the derivatives of a sequence this way is described in section 2.
-The term (2) is used to control (1). 
-That is because one can obtain an overall
-smaller regex list
-by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
-Section 3 is dedicated to its proof.
-In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
-bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
-than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
-for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
-We reason similarly for  $\STAR$.\medskip
-
-\noindent
-Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
-far from the actual bound we can expect. We can do better than this, but this does not improve
-the finiteness property we are proving. If we are interested in a polynomial bound,
-one would hope to obtain a similar tight bound as for partial
-derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
- $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
-partial derivatives). Unfortunately to obtain the exact same bound would mean
-we need to introduce simplifications, such as
- $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
-which exist for partial derivatives. However, if we introduce them in our
-setting we would lose the POSIX property of our calculated values. We leave better
-bounds for future work.\\[-6.5mm]
-
-
-
 %-----------------------------------
 %	SECTION ?
 %-----------------------------------
@@ -247,9 +63,324 @@
 \begin{lemma}
 $r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
 \end{lemma}
-.
+
 Fortunately this is provable by induction on the list $rs$
 
+
+
+%-----------------------------------
+%	SECTION 2
+%-----------------------------------
+
+\section{Finiteness Property}
+In this section let us describe our argument for why the size of the simplified
+derivatives with the aggressive simplification function is finitely bounded.
+ Suppose
+we have a size function for bitcoded regular expressions, written
+$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
+
+\begin{center}
+\begin{tabular}{ccc}
+$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
+\end{tabular}
+\end{center}
+(TODO: COMPLETE this defn and for $rs$)
+
+The size is based on a recursive function on the structure of the regex,
+not the bitcodes.
+Therefore we may as well talk about size of an annotated regular expression 
+in their un-annotated form:
+\begin{center}
+$\asize(a) = \size(\erase(a))$. 
+\end{center}
+(TODO: turn equals to roughly equals)
+
+But there is a minor nuisance here:
+the erase function actually messes with the structure of the regular expression:
+\begin{center}
+\begin{tabular}{ccc}
+$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
+\end{tabular}
+\end{center}
+(TODO: complete this definition with singleton r in alts)
+
+An alternative regular expression with an empty list of children
+ is turned into an $\ZERO$ during the
+$\erase$ function, thereby changing the size and structure of the regex.
+These will likely be fixable if we really want to use plain $\rexp$s for dealing
+with size, but we choose a more straightforward (or stupid) method by 
+defining a new datatype that is similar to plain $\rexp$s but can take
+non-binary arguments for its alternative constructor,
+ which we call $\rrexp$ to denote
+the difference between it and plain regular expressions. 
+\[			\rrexp ::=   \RZERO \mid  \RONE
+			 \mid  \RCHAR{c}  
+			 \mid  \RSEQ{r_1}{r_2}
+			 \mid  \RALTS{rs}
+			 \mid \RSTAR{r}        
+\]
+For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
+but keep everything else intact.
+It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
+(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
+$\ALTS$).
+We denote the operation of erasing the bits and turning an annotated regular expression 
+into an $\rrexp{}$ as $\rerase{}$.
+\begin{center}
+\begin{tabular}{lcl}
+$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
+$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
+$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
+\end{tabular}
+\end{center}
+%TODO: FINISH definition of rerase
+Similarly we could define the derivative  and simplification on 
+$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
+except that now they can operate on alternatives taking multiple arguments.
+
+\begin{center}
+\begin{tabular}{lcr}
+$\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
+(other clauses omitted)
+\end{tabular}
+\end{center}
+
+Now that $\rrexp$s do not have bitcodes on them, we can do the 
+duplicate removal without  an equivalence relation:
+\begin{center}
+\begin{tabular}{lcl}
+$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
+           			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
+\end{tabular}
+\end{center}
+%TODO: definition of rsimp (maybe only the alternative clause)
+
+
+The reason why these definitions  mirror precisely the corresponding operations
+on annotated regualar expressions is that we can calculate sizes more easily:
+
+\begin{lemma}
+$\rsize{\rerase a} = \asize a$
+\end{lemma}
+This lemma says that the size of an r-erased regex is the same as the annotated regex.
+this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
+\begin{lemma}
+$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
+\end{lemma}
+Putting these two together we get a property that allows us to estimate the 
+size of an annotated regular expression derivative using its un-annotated counterpart:
+\begin{lemma}
+$\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
+\end{lemma}
+Unless stated otherwise in this chapter all $\textit{rexp}$s without
+ bitcodes are seen as $\rrexp$s.
+ We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
+ as the former suits people's intuitive way of stating a binary alternative
+ regular expression.
+
+
+\begin{theorem}
+For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
+\end{theorem}
+
+\noindent
+\begin{proof}
+We prove this by induction on $r$. The base cases for $\AZERO$,
+$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
+for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
+hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
+$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
+%
+\begin{center}
+\begin{tabular}{lcll}
+& & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
+& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
+    [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
+& $\leq$ &
+    $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
+    [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
+& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
+             \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
+& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
+      \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
+& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
+\end{tabular}
+\end{center}
+
+
+\noindent
+where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
+The reason why we could write the derivatives of a sequence this way is described in section 2.
+The term (2) is used to control (1). 
+That is because one can obtain an overall
+smaller regex list
+by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
+Section 3 is dedicated to its proof.
+In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
+bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
+than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
+for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
+We reason similarly for  $\STAR$.\medskip
+\end{proof}
+
+What guarantee does this bound give us?
+
+Whatever the regex is, it will not grow indefinitely.
+Take our previous example $(a + aa)^*$ as an example:
+\begin{center}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={number of $a$'s},
+    x label style={at={(1.05,-0.05)}},
+    ylabel={regex size},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax= 40,
+    ytick={0,10,...,40},
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
+    legend entries={$(a + aa)^*$},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
+We are able to limit the size of the regex $(a + aa)^*$'s derivatives
+ with our simplification
+rules very effectively.
+
+
+As the graphs of  some more randomly generated regexes show, the size of 
+the derivative might grow quickly at the start of the input,
+ but after a sufficiently long  input string, the trend will stop.
+
+
+%a few sample regular experessions' derivatives
+%size change
+%TODO: giving graphs showing a few regexes' size changes 
+%w;r;t the input characters number
+%a*, aa*, aaa*, .....
+%randomly generated regexes
+\begin{center}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={number of $a$'s},
+    x label style={at={(1.05,-0.05)}},
+    ylabel={regex size},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=1000,
+    ytick={0,100,...,1000},
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
+    legend entries={regex1},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
+\end{axis}
+\end{tikzpicture}
+  &
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,-0.05)}},
+    %ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=1000,
+    ytick={0,100,...,1000},
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
+    legend entries={regex2},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
+\end{axis}
+\end{tikzpicture}
+  &
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,-0.05)}},
+    %ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=1000,
+    ytick={0,100,...,1000},
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
+    legend entries={regex3},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
+\end{tabular}    
+\end{center}  
+
+
+
+
+
+\noindent
+Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
+far from the actual bound we can expect. 
+In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
+is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
+Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
+inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
+$f(x) = x * 2^x$.
+This means the bound we have will surge up at least
+tower-exponentially with a linear increase of the depth.
+For a regex of depth $n$, the bound
+would be approximately $4^n$.
+%TODO: change this to tower exponential!
+
+We can do better than this, but this does not improve
+the finiteness property we are proving. If we are interested in a polynomial bound,
+one would hope to obtain a similar tight bound as for partial
+derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
+ $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
+partial derivatives). 
+To obtain the exact same bound would mean
+we need to introduce simplifications, such as
+ $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
+which exist for partial derivatives. 
+
+However, if we introduce them in our
+setting we would lose the POSIX property of our calculated values. 
+A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
+If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer 
+would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
+what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
+in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
+occurs, and apply them in the right order once we get 
+a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
+This is unlike the simplification we had before, where the rewriting rules 
+such  as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
+We will discuss better
+bounds in the last section of this chapter.\\[-6.5mm]
+
+
+
+
 %----------------------------------------------------------------------------------------
 %	SECTION ??
 %----------------------------------------------------------------------------------------
@@ -369,10 +500,62 @@
 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
 
+
+%----------------------------------------------------------------------------------------
+%	SECTION ALTS CLOSED FORM
+%----------------------------------------------------------------------------------------
+\section{A Closed Form for \textit{ALTS}}
+Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
+
+
+There are a few key steps, one of these steps is
+$rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))
+= rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))$
+
+One might want to prove this by something a simple statement like: 
+$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.
+
+For this to hold we want the $\textit{distinct}$ function to pick up
+the elements before and after derivatives correctly:
+$r \in rset \equiv (rder x r) \in (rder x rset)$.
+which essentially requires that the function $\backslash$ is an injective mapping.
+
+Unfortunately the function $\backslash c$ is not an injective mapping.
+
+\subsection{function $\backslash c$ is not injective (1-to-1)}
+\begin{center}
+The derivative $w.r.t$ character $c$ is not one-to-one.
+Formally,
+	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
+\end{center}
+This property is trivially true for the
+character regex example:
+\begin{center}
+	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
+\end{center}
+But apart from the cases where the derivative
+output is $\ZERO$, are there non-trivial results
+of derivatives which contain strings?
+The answer is yes.
+For example,
+\begin{center}
+	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
+	where $a$ is not nullable.\\
+	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
+	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
+\end{center}
+We start with two syntactically different regexes,
+and end up with the same derivative result.
+This is not surprising as we have such 
+equality as below in the style of Arden's lemma:\\
+\begin{center}
+	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
+\end{center}
+
 %----------------------------------------------------------------------------------------
 %	SECTION 4
 %----------------------------------------------------------------------------------------
-\section{a bound for the star regular expression}
+\section{A Bound for the Star Regular Expression}
 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
 the property of the $\distinct$ function.
@@ -410,7 +593,7 @@
  $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
 $\hflat r$ & $=$ & $ r$
 \end{tabular}
-\end{center}
+\end{center}s
 Again these definitions are tailor-made for dealing with alternatives that have
 originated from a star's derivatives, so we don't attempt to open up all possible 
 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
@@ -452,7 +635,7 @@
  proof for a size bound, we are happy to restrict ourselves to 
  unannotated regular expressions, and obtain such equalities as
  \begin{lemma}
- $\rsimp{(r_1 + r_2)} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
+ $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
  \end{lemma}
  
  \begin{proof}
@@ -480,30 +663,104 @@
  
  
 One might wonder the actual bound rather than the loose bound we gave
-for the convenience of a easier proof.
-How much can the regex $r^* \backslash s$ grow? As hinted earlier,
+for the convenience of an easier proof.
+How much can the regex $r^* \backslash s$ grow? 
+As  earlier graphs have shown,
+%TODO: reference that graph where size grows quickly
  they can grow at a maximum speed
   exponential $w.r.t$ the number of characters, 
-but will eventually level off when the string $s$ is long enough,
-as suggested by the finiteness bound proof.
-And unfortunately we have concrete examples
+but will eventually level off when the string $s$ is long enough.
+If they grow to a size exponential $w.r.t$ the original regex, our algorithm
+would still be slow.
+And unfortunately, we have concrete examples
 where such regexes grew exponentially large before levelling off:
 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
-size that is  exponential on the number $n$.
-%TODO: give out a graph showing how the size of the regex grows and levels off at a size exponential to the original regex
-
+ size that is  exponential on the number $n$ 
+under our current simplification rules:
+%TODO: graph of a regex whose size increases exponentially.
+\begin{center}
+\begin{tikzpicture}
+    \begin{axis}[
+        height=0.5\textwidth,
+        width=\textwidth,
+        xlabel=number of a's,
+        xtick={0,...,9},
+        ylabel=maximum size,
+        ymode=log,
+       log basis y={2}
+]
+        \addplot[mark=*,blue] table {re-chengsong.data};
+    \end{axis}
+\end{tikzpicture}
+\end{center}
 
-While the tight upper bound will definitely be a lot lower than 
-the one we gave for the finiteness proof, 
-it will still be at least expoential, under our current simplification rules.
+For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
+to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
+(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
+The exponential size is triggered by that the regex
+$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
+inside the $(\ldots) ^*$ having exponentially many
+different derivatives, despite those difference being minor.
+$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
+will therefore contain the following terms (after flattening out all nested 
+alternatives):
+\begin{center}
+$(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
+$(1 \leq m' \leq m )$
+\end{center}
+These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
+ With each new input character taking the derivative against the intermediate result, more and more such distinct
+ terms will accumulate, 
+until the length reaches $L.C.M.(1, \ldots, n)$.
+$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
+$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
 
-This suggests stronger simplifications that keep the tight bound polynomial.
+$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
+ where $m' \neq m''$ \\
+ as they are slightly different.
+ This means that with our current simplification methods,
+ we will not be able to control the derivative so that
+ $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
+ as there are already exponentially many terms.
+ These terms are similar in the sense that the head of those terms
+ are all consisted of sub-terms of the form: 
+ $(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
+ For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
+ $n * (n + 1) / 2$ such terms. 
+ For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
+ can be described by 6 terms:
+ $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, 
+ $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
+The total number of different "head terms",  $n * (n + 1) / 2$,
+ is proportional to the number of characters in the regex 
+$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
+This suggests a slightly different notion of size, which we call the 
+alphabetic width:
+%TODO:
+(TODO: Alphabetic width def.)
+
+ 
+Antimirov\parencite{Antimirov95} has proven that 
+$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
+where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
+created by doing derivatives of $r$ against all possible strings.
+If we can make sure that at any moment in our lexing algorithm our 
+intermediate result hold at most one copy of each of the 
+subterms then we can get the same bound as Antimirov's.
+This leads to the algorithm in the next section.
 
 %----------------------------------------------------------------------------------------
 %	SECTION 5
 %----------------------------------------------------------------------------------------
-\section{A Stronger Version of Simplification}
+\section{A Stronger Version of Simplification Inspired by Antimirov}
 %TODO: search for isabelle proofs of algorithms that check equivalence 
 
 
+
+
+
+
+
+
+
changeset 518	ff7945a988a3
parent 516	6fecb7fe8cd0
child 519	856d025dbc15