--- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex Fri May 20 18:52:03 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex Thu May 26 20:51:40 2022 +0100
@@ -7,200 +7,16 @@
-%----------------------------------------------------------------------------------------
-% SECTION 1
-%----------------------------------------------------------------------------------------
-\section{Properties of $\backslash c$}
-
-To have a clear idea of what we can and
-need to prove about the algorithms involving
-Brzozowski's derivatives, there are a few
-properties we need to be clear about .
-\subsection{function $\backslash c$ is not 1-to-1}
-\begin{center}
-The derivative $w.r.t$ character $c$ is not one-to-one.
-Formally,
- $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
-\end{center}
-This property is trivially true for the
-character regex example:
-\begin{center}
- $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
-\end{center}
-But apart from the cases where the derivative
-output is $\ZERO$, are there non-trivial results
-of derivatives which contain strings?
-The answer is yes.
-For example,
-\begin{center}
- Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
- where $a$ is not nullable.\\
- $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
- $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
-\end{center}
-We start with two syntactically different regexes,
-and end up with the same derivative result, which is
-a "meaningful" regex because it contains strings.
-We have rediscovered Arden's lemma:\\
-\begin{center}
- $A^*B = A\cdot A^* \cdot B + B$
-\end{center}
-
-
%-----------------------------------
% SUBSECTION 1
%-----------------------------------
-\subsection{Subsection 1}
+\section{Specifications of Certain Functions to be Used}
To be completed.
-
-
-
-%-----------------------------------
-% SECTION 2
-%-----------------------------------
-
-\section{Finiteness Property}
-In this section let us sketch our argument for why the size of the simplified
-derivatives with the aggressive simplification function can be finitely bounded.
-
-For convenience, we use a new datatype which we call $\rrexp$ to denote
-the difference between it and annotated regular expressions.
-\[ \rrexp ::= \RZERO \mid \RONE
- \mid \RCHAR{c}
- \mid \RSEQ{r_1}{r_2}
- \mid \RALTS{rs}
- \mid \RSTAR{r}
-\]
-For $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
-but keep everything else intact.
-It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
-(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
-$\ALTS$).
-We denote the operation of erasing the bits and turning an annotated regular expression
-into an $\rrexp{}$ as $\rerase{}$.
-\begin{center}
-\begin{tabular}{lcr}
-$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
-$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
-$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
-\end{tabular}
-\end{center}
-%TODO: FINISH definition of rerase
-Similarly we could define the derivative and simplification on
-$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
-except that now they can operate on alternatives taking multiple arguments.
-
-\begin{center}
-\begin{tabular}{lcr}
-$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\
-(other clauses omitted)
-\end{tabular}
-\end{center}
-
-Now that $\rrexp$s do not have bitcodes on them, we can do the
-duplicate removal without an equivalence relation:
-\begin{center}
-\begin{tabular}{lcl}
-$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
- & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
-\end{tabular}
-\end{center}
-%TODO: definition of rsimp (maybe only the alternative clause)
-
-
-The reason why these definitions mirror precisely the corresponding operations
-on annotated regualar expressions is that we can calculate sizes more easily:
-
-\begin{lemma}
-$\rsize{\rerase a} = \asize a$
-\end{lemma}
-A similar equation holds for annotated regular expressions' simplification
-and the plain regular expressions' simplification:
-\begin{lemma}
-$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
-\end{lemma}
-Putting these two together we get a property that allows us to estimate the
-size of an annotated regular expression derivative using its un-annotated counterpart:
-\begin{lemma}
-$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$
-\end{lemma}
-Unless stated otherwise in this chapter all $\textit{rexp}$s without
- bitcodes are seen as $\rrexp$s.
- We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
- as the former suits people's intuitive way of stating a binary alternative
- regular expression.
-
- Suppose
-we have a size function for bitcoded regular expressions, written
-$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
-(we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
-there exists a bound $N$
-such that
-
-\begin{center}
-$\forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N$
-\end{center}
-
-\noindent
-We prove this by induction on $r$. The base cases for $\AZERO$,
-$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
-for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
-hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
-$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
-%
-\begin{center}
-\begin{tabular}{lcll}
-& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
-& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
- [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
-& $\leq$ &
- $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
- [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
-& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
- \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
- \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
-\end{tabular}
-\end{center}
-
-% tell Chengsong about Indian paper of closed forms of derivatives
-
-\noindent
-where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
-The reason why we could write the derivatives of a sequence this way is described in section 2.
-The term (2) is used to control (1).
-That is because one can obtain an overall
-smaller regex list
-by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
-Section 3 is dedicated to its proof.
-In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is
-bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
-than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
-for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
-We reason similarly for $\STAR$.\medskip
-
-\noindent
-Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
-far from the actual bound we can expect. We can do better than this, but this does not improve
-the finiteness property we are proving. If we are interested in a polynomial bound,
-one would hope to obtain a similar tight bound as for partial
-derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
- $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
-partial derivatives). Unfortunately to obtain the exact same bound would mean
-we need to introduce simplifications, such as
- $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
-which exist for partial derivatives. However, if we introduce them in our
-setting we would lose the POSIX property of our calculated values. We leave better
-bounds for future work.\\[-6.5mm]
-
-
-
%-----------------------------------
% SECTION ?
%-----------------------------------
@@ -247,9 +63,324 @@
\begin{lemma}
$r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$
\end{lemma}
-.
+
Fortunately this is provable by induction on the list $rs$
+
+
+%-----------------------------------
+% SECTION 2
+%-----------------------------------
+
+\section{Finiteness Property}
+In this section let us describe our argument for why the size of the simplified
+derivatives with the aggressive simplification function is finitely bounded.
+ Suppose
+we have a size function for bitcoded regular expressions, written
+$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
+
+\begin{center}
+\begin{tabular}{ccc}
+$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
+\end{tabular}
+\end{center}
+(TODO: COMPLETE this defn and for $rs$)
+
+The size is based on a recursive function on the structure of the regex,
+not the bitcodes.
+Therefore we may as well talk about size of an annotated regular expression
+in their un-annotated form:
+\begin{center}
+$\asize(a) = \size(\erase(a))$.
+\end{center}
+(TODO: turn equals to roughly equals)
+
+But there is a minor nuisance here:
+the erase function actually messes with the structure of the regular expression:
+\begin{center}
+\begin{tabular}{ccc}
+$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
+\end{tabular}
+\end{center}
+(TODO: complete this definition with singleton r in alts)
+
+An alternative regular expression with an empty list of children
+ is turned into an $\ZERO$ during the
+$\erase$ function, thereby changing the size and structure of the regex.
+These will likely be fixable if we really want to use plain $\rexp$s for dealing
+with size, but we choose a more straightforward (or stupid) method by
+defining a new datatype that is similar to plain $\rexp$s but can take
+non-binary arguments for its alternative constructor,
+ which we call $\rrexp$ to denote
+the difference between it and plain regular expressions.
+\[ \rrexp ::= \RZERO \mid \RONE
+ \mid \RCHAR{c}
+ \mid \RSEQ{r_1}{r_2}
+ \mid \RALTS{rs}
+ \mid \RSTAR{r}
+\]
+For $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
+but keep everything else intact.
+It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
+(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
+$\ALTS$).
+We denote the operation of erasing the bits and turning an annotated regular expression
+into an $\rrexp{}$ as $\rerase{}$.
+\begin{center}
+\begin{tabular}{lcl}
+$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
+$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
+$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
+\end{tabular}
+\end{center}
+%TODO: FINISH definition of rerase
+Similarly we could define the derivative and simplification on
+$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
+except that now they can operate on alternatives taking multiple arguments.
+
+\begin{center}
+\begin{tabular}{lcr}
+$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\
+(other clauses omitted)
+\end{tabular}
+\end{center}
+
+Now that $\rrexp$s do not have bitcodes on them, we can do the
+duplicate removal without an equivalence relation:
+\begin{center}
+\begin{tabular}{lcl}
+$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
+ & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
+\end{tabular}
+\end{center}
+%TODO: definition of rsimp (maybe only the alternative clause)
+
+
+The reason why these definitions mirror precisely the corresponding operations
+on annotated regualar expressions is that we can calculate sizes more easily:
+
+\begin{lemma}
+$\rsize{\rerase a} = \asize a$
+\end{lemma}
+This lemma says that the size of an r-erased regex is the same as the annotated regex.
+this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
+\begin{lemma}
+$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
+\end{lemma}
+Putting these two together we get a property that allows us to estimate the
+size of an annotated regular expression derivative using its un-annotated counterpart:
+\begin{lemma}
+$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$
+\end{lemma}
+Unless stated otherwise in this chapter all $\textit{rexp}$s without
+ bitcodes are seen as $\rrexp$s.
+ We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
+ as the former suits people's intuitive way of stating a binary alternative
+ regular expression.
+
+
+\begin{theorem}
+For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
+\end{theorem}
+
+\noindent
+\begin{proof}
+We prove this by induction on $r$. The base cases for $\AZERO$,
+$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
+for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
+hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
+$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
+%
+\begin{center}
+\begin{tabular}{lcll}
+& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
+& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
+ [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
+& $\leq$ &
+ $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
+ [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
+& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
+ \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
+& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
+ \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
+& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
+\end{tabular}
+\end{center}
+
+
+\noindent
+where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
+The reason why we could write the derivatives of a sequence this way is described in section 2.
+The term (2) is used to control (1).
+That is because one can obtain an overall
+smaller regex list
+by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
+Section 3 is dedicated to its proof.
+In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is
+bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
+than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
+for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
+We reason similarly for $\STAR$.\medskip
+\end{proof}
+
+What guarantee does this bound give us?
+
+Whatever the regex is, it will not grow indefinitely.
+Take our previous example $(a + aa)^*$ as an example:
+\begin{center}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={number of $a$'s},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={regex size},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax= 40,
+ ytick={0,10,...,40},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={$(a + aa)^*$},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
+We are able to limit the size of the regex $(a + aa)^*$'s derivatives
+ with our simplification
+rules very effectively.
+
+
+As the graphs of some more randomly generated regexes show, the size of
+the derivative might grow quickly at the start of the input,
+ but after a sufficiently long input string, the trend will stop.
+
+
+%a few sample regular experessions' derivatives
+%size change
+%TODO: giving graphs showing a few regexes' size changes
+%w;r;t the input characters number
+%a*, aa*, aaa*, .....
+%randomly generated regexes
+\begin{center}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={number of $a$'s},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={regex size},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=1000,
+ ytick={0,100,...,1000},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={regex1},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
+\end{axis}
+\end{tikzpicture}
+ &
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ %ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=1000,
+ ytick={0,100,...,1000},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={regex2},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
+\end{axis}
+\end{tikzpicture}
+ &
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ %ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=1000,
+ ytick={0,100,...,1000},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={regex3},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
+\end{tabular}
+\end{center}
+
+
+
+
+
+\noindent
+Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
+far from the actual bound we can expect.
+In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
+is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
+Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
+inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
+$f(x) = x * 2^x$.
+This means the bound we have will surge up at least
+tower-exponentially with a linear increase of the depth.
+For a regex of depth $n$, the bound
+would be approximately $4^n$.
+%TODO: change this to tower exponential!
+
+We can do better than this, but this does not improve
+the finiteness property we are proving. If we are interested in a polynomial bound,
+one would hope to obtain a similar tight bound as for partial
+derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
+ $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
+partial derivatives).
+To obtain the exact same bound would mean
+we need to introduce simplifications, such as
+ $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
+which exist for partial derivatives.
+
+However, if we introduce them in our
+setting we would lose the POSIX property of our calculated values.
+A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
+If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer
+would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
+what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
+in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
+occurs, and apply them in the right order once we get
+a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
+This is unlike the simplification we had before, where the rewriting rules
+such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
+We will discuss better
+bounds in the last section of this chapter.\\[-6.5mm]
+
+
+
+
%----------------------------------------------------------------------------------------
% SECTION ??
%----------------------------------------------------------------------------------------
@@ -369,10 +500,62 @@
The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
+
+%----------------------------------------------------------------------------------------
+% SECTION ALTS CLOSED FORM
+%----------------------------------------------------------------------------------------
+\section{A Closed Form for \textit{ALTS}}
+Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
+
+
+There are a few key steps, one of these steps is
+$rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))
+= rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))$
+
+One might want to prove this by something a simple statement like:
+$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.
+
+For this to hold we want the $\textit{distinct}$ function to pick up
+the elements before and after derivatives correctly:
+$r \in rset \equiv (rder x r) \in (rder x rset)$.
+which essentially requires that the function $\backslash$ is an injective mapping.
+
+Unfortunately the function $\backslash c$ is not an injective mapping.
+
+\subsection{function $\backslash c$ is not injective (1-to-1)}
+\begin{center}
+The derivative $w.r.t$ character $c$ is not one-to-one.
+Formally,
+ $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
+\end{center}
+This property is trivially true for the
+character regex example:
+\begin{center}
+ $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
+\end{center}
+But apart from the cases where the derivative
+output is $\ZERO$, are there non-trivial results
+of derivatives which contain strings?
+The answer is yes.
+For example,
+\begin{center}
+ Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
+ where $a$ is not nullable.\\
+ $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
+ $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
+\end{center}
+We start with two syntactically different regexes,
+and end up with the same derivative result.
+This is not surprising as we have such
+equality as below in the style of Arden's lemma:\\
+\begin{center}
+ $L(A^*B) = L(A\cdot A^* \cdot B + B)$
+\end{center}
+
%----------------------------------------------------------------------------------------
% SECTION 4
%----------------------------------------------------------------------------------------
-\section{a bound for the star regular expression}
+\section{A Bound for the Star Regular Expression}
We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then
the property of the $\distinct$ function.
@@ -410,7 +593,7 @@
$\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
$\hflat r$ & $=$ & $ r$
\end{tabular}
-\end{center}
+\end{center}s
Again these definitions are tailor-made for dealing with alternatives that have
originated from a star's derivatives, so we don't attempt to open up all possible
regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
@@ -452,7 +635,7 @@
proof for a size bound, we are happy to restrict ourselves to
unannotated regular expressions, and obtain such equalities as
\begin{lemma}
- $\rsimp{(r_1 + r_2)} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
+ $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
\end{lemma}
\begin{proof}
@@ -480,30 +663,104 @@
One might wonder the actual bound rather than the loose bound we gave
-for the convenience of a easier proof.
-How much can the regex $r^* \backslash s$ grow? As hinted earlier,
+for the convenience of an easier proof.
+How much can the regex $r^* \backslash s$ grow?
+As earlier graphs have shown,
+%TODO: reference that graph where size grows quickly
they can grow at a maximum speed
exponential $w.r.t$ the number of characters,
-but will eventually level off when the string $s$ is long enough,
-as suggested by the finiteness bound proof.
-And unfortunately we have concrete examples
+but will eventually level off when the string $s$ is long enough.
+If they grow to a size exponential $w.r.t$ the original regex, our algorithm
+would still be slow.
+And unfortunately, we have concrete examples
where such regexes grew exponentially large before levelling off:
$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
-size that is exponential on the number $n$.
-%TODO: give out a graph showing how the size of the regex grows and levels off at a size exponential to the original regex
-
+ size that is exponential on the number $n$
+under our current simplification rules:
+%TODO: graph of a regex whose size increases exponentially.
+\begin{center}
+\begin{tikzpicture}
+ \begin{axis}[
+ height=0.5\textwidth,
+ width=\textwidth,
+ xlabel=number of a's,
+ xtick={0,...,9},
+ ylabel=maximum size,
+ ymode=log,
+ log basis y={2}
+]
+ \addplot[mark=*,blue] table {re-chengsong.data};
+ \end{axis}
+\end{tikzpicture}
+\end{center}
-While the tight upper bound will definitely be a lot lower than
-the one we gave for the finiteness proof,
-it will still be at least expoential, under our current simplification rules.
+For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
+to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
+(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
+The exponential size is triggered by that the regex
+$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
+inside the $(\ldots) ^*$ having exponentially many
+different derivatives, despite those difference being minor.
+$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
+will therefore contain the following terms (after flattening out all nested
+alternatives):
+\begin{center}
+$(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
+$(1 \leq m' \leq m )$
+\end{center}
+These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
+ With each new input character taking the derivative against the intermediate result, more and more such distinct
+ terms will accumulate,
+until the length reaches $L.C.M.(1, \ldots, n)$.
+$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms
+$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
-This suggests stronger simplifications that keep the tight bound polynomial.
+$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
+ where $m' \neq m''$ \\
+ as they are slightly different.
+ This means that with our current simplification methods,
+ we will not be able to control the derivative so that
+ $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
+ as there are already exponentially many terms.
+ These terms are similar in the sense that the head of those terms
+ are all consisted of sub-terms of the form:
+ $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $.
+ For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
+ $n * (n + 1) / 2$ such terms.
+ For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
+ can be described by 6 terms:
+ $a^*$, $a\cdot (aa)^*$, $ (aa)^*$,
+ $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
+The total number of different "head terms", $n * (n + 1) / 2$,
+ is proportional to the number of characters in the regex
+$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
+This suggests a slightly different notion of size, which we call the
+alphabetic width:
+%TODO:
+(TODO: Alphabetic width def.)
+
+
+Antimirov\parencite{Antimirov95} has proven that
+$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
+where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
+created by doing derivatives of $r$ against all possible strings.
+If we can make sure that at any moment in our lexing algorithm our
+intermediate result hold at most one copy of each of the
+subterms then we can get the same bound as Antimirov's.
+This leads to the algorithm in the next section.
%----------------------------------------------------------------------------------------
% SECTION 5
%----------------------------------------------------------------------------------------
-\section{A Stronger Version of Simplification}
+\section{A Stronger Version of Simplification Inspired by Antimirov}
%TODO: search for isabelle proofs of algorithms that check equivalence
+
+
+
+
+
+
+