lexing: comparison ChengsongTanPhdThesis/Chapters/Chapter2.tex

equal deleted inserted replaced

-:3c5e58d08939
+:ff7945a988a3
 \label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
-%----------------------------------------------------------------------------------------
-%	SECTION 1
-%----------------------------------------------------------------------------------------
-\section{Properties of $\backslash c$}
-To have a clear idea of what we can and
-need to prove about the algorithms involving
-Brzozowski's derivatives, there are a few
-properties we need to be clear about .
-\subsection{function $\backslash c$ is not 1-to-1}
-\begin{center}
-The derivative $w.r.t$ character $c$ is not one-to-one.
-Formally,
-	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
-\end{center}
-This property is trivially true for the
-character regex example:
-\begin{center}
-	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
-\end{center}
-But apart from the cases where the derivative
-output is $\ZERO$, are there non-trivial results
-of derivatives which contain strings?
-The answer is yes.
-For example,
-\begin{center}
-	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
-	where $a$ is not nullable.\\
-	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
-	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
-\end{center}
-We start with two syntactically different regexes,
-and end up with the same derivative result, which is
-a "meaningful" regex because it contains strings.
-We have rediscovered Arden's lemma:\\
-\begin{center}
-	$A^*B = A\cdot A^* \cdot B + B$
-\end{center}
 %-----------------------------------
 %	SUBSECTION 1
 %-----------------------------------
-\subsection{Subsection 1}
+\section{Specifications of Certain Functions to be Used}
 To be completed.
+%-----------------------------------
+%	SECTION ?
+%-----------------------------------
+\section{preparatory properties for getting a finiteness bound}
+Before we get to the proof that says the intermediate result of our lexer will
+remain finitely bounded, which is an important efficiency/liveness guarantee,
+we shall first develop a few preparatory properties and definitions to
+make the process of proving that a breeze.
+We define rewriting relations for $\rrexp$s, which allows us to do the
+same trick as we did for the correctness proof,
+but this time we will have stronger equalities established.
+\subsection{"hrewrite" relation}
+List of 1-step rewrite rules for regular expressions simplification without bitcodes:
+\begin{center}
+\begin{tabular}{lcl}
+$r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$
+\end{tabular}
+\end{center}
+And we define an "grewrite" relation that works on lists:
+\begin{center}
+\begin{tabular}{lcl}
+$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
+\end{tabular}
+\end{center}
+With these relations we prove
+\begin{lemma}
+$rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
+\end{lemma}
+which enables us to prove "closed forms" equalities such as
+\begin{lemma}
+$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
+\end{lemma}
+But the most involved part of the above lemma is proving the following:
+\begin{lemma}
+$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $
+\end{lemma}
+which is needed in proving things like
+\begin{lemma}
+$r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
+\end{lemma}
+Fortunately this is provable by induction on the list $rs$
 %-----------------------------------
 %	SECTION 2
 %-----------------------------------
 \section{Finiteness Property}
-In this section let us sketch our argument for why the size of the simplified
+In this section let us describe our argument for why the size of the simplified
-derivatives with the aggressive simplification function can be finitely bounded.
+derivatives with the aggressive simplification function is finitely bounded.
+Suppose
-For convenience, we use a new datatype which we call $\rrexp$ to denote
+we have a size function for bitcoded regular expressions, written
-the difference between it and annotated regular expressions.
+$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
+\begin{center}
+\begin{tabular}{ccc}
+$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
+\end{tabular}
+\end{center}
+(TODO: COMPLETE this defn and for $rs$)
+The size is based on a recursive function on the structure of the regex,
+not the bitcodes.
+Therefore we may as well talk about size of an annotated regular expression
+in their un-annotated form:
+\begin{center}
+$\asize(a) = \size(\erase(a))$.
+\end{center}
+(TODO: turn equals to roughly equals)
+But there is a minor nuisance here:
+the erase function actually messes with the structure of the regular expression:
+\begin{center}
+\begin{tabular}{ccc}
+$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
+\end{tabular}
+\end{center}
+(TODO: complete this definition with singleton r in alts)
+An alternative regular expression with an empty list of children
+is turned into an $\ZERO$ during the
+$\erase$ function, thereby changing the size and structure of the regex.
+These will likely be fixable if we really want to use plain $\rexp$s for dealing
+with size, but we choose a more straightforward (or stupid) method by
+defining a new datatype that is similar to plain $\rexp$s but can take
+non-binary arguments for its alternative constructor,
+which we call $\rrexp$ to denote
+the difference between it and plain regular expressions.
 \[			\rrexp ::=   \RZERO \mid  \RONE
 			 \mid  \RCHAR{c}
 			 \mid  \RSEQ{r_1}{r_2}
 			 \mid  \RALTS{rs}
 			 \mid \RSTAR{r}
 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
 $\ALTS$).
 We denote the operation of erasing the bits and turning an annotated regular expression
 into an $\rrexp{}$ as $\rerase{}$.
 \begin{center}
-\begin{tabular}{lcr}
+\begin{tabular}{lcl}
 $\rerase{\AZERO}$ & $=$ & $\RZERO$\\
 $\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
 $\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
 \end{tabular}
 \end{center}
 on annotated regualar expressions is that we can calculate sizes more easily:
 \begin{lemma}
 $\rsize{\rerase a} = \asize a$
 \end{lemma}
-A similar equation holds for annotated regular expressions' simplification
+This lemma says that the size of an r-erased regex is the same as the annotated regex.
-and the plain regular expressions' simplification:
+this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
 \begin{lemma}
 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
 \end{lemma}
 Putting these two together we get a property that allows us to estimate the
 size of an annotated regular expression derivative using its un-annotated counterpart:
 bitcodes are seen as $\rrexp$s.
 We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
 as the former suits people's intuitive way of stating a binary alternative
 regular expression.
-Suppose
-we have a size function for bitcoded regular expressions, written
+\begin{theorem}
-$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
+For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
-(we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
+\end{theorem}
-there exists a bound $N$
-such that
-\begin{center}
-$\forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N$
-\end{center}
 \noindent
+\begin{proof}
 We prove this by induction on $r$. The base cases for $\AZERO$,
 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
 \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
 \end{tabular}
 \end{center}
-% tell Chengsong about Indian paper of closed forms of derivatives
 \noindent
 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
 The reason why we could write the derivatives of a sequence this way is described in section 2.
 The term (2) is used to control (1).
 In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is
 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
 than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
 We reason similarly for  $\STAR$.\medskip
+\end{proof}
+What guarantee does this bound give us?
+Whatever the regex is, it will not grow indefinitely.
+Take our previous example $(a + aa)^*$ as an example:
+\begin{center}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={number of $a$'s},
+x label style={at={(1.05,-0.05)}},
+ylabel={regex size},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax= 40,
+ytick={0,10,...,40},
+scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend entries={$(a + aa)^*$},
+legend pos=north west,
+legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
+We are able to limit the size of the regex $(a + aa)^*$'s derivatives
+with our simplification
+rules very effectively.
+As the graphs of  some more randomly generated regexes show, the size of
+the derivative might grow quickly at the start of the input,
+but after a sufficiently long  input string, the trend will stop.
+%a few sample regular experessions' derivatives
+%size change
+%TODO: giving graphs showing a few regexes' size changes
+%w;r;t the input characters number
+%a*, aa*, aaa*, .....
+%randomly generated regexes
+\begin{center}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={number of $a$'s},
+x label style={at={(1.05,-0.05)}},
+ylabel={regex size},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=1000,
+ytick={0,100,...,1000},
+scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend entries={regex1},
+legend pos=north west,
+legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,-0.05)}},
+%ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=1000,
+ytick={0,100,...,1000},
+scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend entries={regex2},
+legend pos=north west,
+legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,-0.05)}},
+%ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=1000,
+ytick={0,100,...,1000},
+scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend entries={regex3},
+legend pos=north west,
+legend cell align=left]
+\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
+\end{tabular}
+\end{center}
 \noindent
 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
-far from the actual bound we can expect. We can do better than this, but this does not improve
+far from the actual bound we can expect.
+In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
+is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
+Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
+inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
+$f(x) = x * 2^x$.
+This means the bound we have will surge up at least
+tower-exponentially with a linear increase of the depth.
+For a regex of depth $n$, the bound
+would be approximately $4^n$.
+%TODO: change this to tower exponential!
+We can do better than this, but this does not improve
 the finiteness property we are proving. If we are interested in a polynomial bound,
 one would hope to obtain a similar tight bound as for partial
 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
 $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
-partial derivatives). Unfortunately to obtain the exact same bound would mean
+partial derivatives).
+To obtain the exact same bound would mean
 we need to introduce simplifications, such as
 $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
-which exist for partial derivatives. However, if we introduce them in our
+which exist for partial derivatives.
-setting we would lose the POSIX property of our calculated values. We leave better
-bounds for future work.\\[-6.5mm]
+However, if we introduce them in our
+setting we would lose the POSIX property of our calculated values.
+A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
+If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer
-%-----------------------------------
+would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
-%	SECTION ?
+what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
-%-----------------------------------
+in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
-\section{preparatory properties for getting a finiteness bound}
+occurs, and apply them in the right order once we get
-Before we get to the proof that says the intermediate result of our lexer will
+a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
-remain finitely bounded, which is an important efficiency/liveness guarantee,
+This is unlike the simplification we had before, where the rewriting rules
-we shall first develop a few preparatory properties and definitions to
+such  as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
-make the process of proving that a breeze.
+We will discuss better
+bounds in the last section of this chapter.\\[-6.5mm]
-We define rewriting relations for $\rrexp$s, which allows us to do the
-same trick as we did for the correctness proof,
-but this time we will have stronger equalities established.
-\subsection{"hrewrite" relation}
-List of 1-step rewrite rules for regular expressions simplification without bitcodes:
-\begin{center}
-\begin{tabular}{lcl}
-$r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$
-\end{tabular}
-\end{center}
-And we define an "grewrite" relation that works on lists:
-\begin{center}
-\begin{tabular}{lcl}
-$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
-\end{tabular}
-\end{center}
-With these relations we prove
-\begin{lemma}
-$rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
-\end{lemma}
-which enables us to prove "closed forms" equalities such as
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
-\end{lemma}
-But the most involved part of the above lemma is proving the following:
-\begin{lemma}
-$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $
-\end{lemma}
-which is needed in proving things like
-\begin{lemma}
-$r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
-\end{lemma}
-.
-Fortunately this is provable by induction on the list $rs$
 %----------------------------------------------------------------------------------------
 %	SECTION ??
 %----------------------------------------------------------------------------------------
 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
 \end{center}
 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
+%----------------------------------------------------------------------------------------
+%	SECTION ALTS CLOSED FORM
+%----------------------------------------------------------------------------------------
+\section{A Closed Form for \textit{ALTS}}
+Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
+There are a few key steps, one of these steps is
+$rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))
+= rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))$
+One might want to prove this by something a simple statement like:
+$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.
+For this to hold we want the $\textit{distinct}$ function to pick up
+the elements before and after derivatives correctly:
+$r \in rset \equiv (rder x r) \in (rder x rset)$.
+which essentially requires that the function $\backslash$ is an injective mapping.
+Unfortunately the function $\backslash c$ is not an injective mapping.
+\subsection{function $\backslash c$ is not injective (1-to-1)}
+\begin{center}
+The derivative $w.r.t$ character $c$ is not one-to-one.
+Formally,
+	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
+\end{center}
+This property is trivially true for the
+character regex example:
+\begin{center}
+	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
+\end{center}
+But apart from the cases where the derivative
+output is $\ZERO$, are there non-trivial results
+of derivatives which contain strings?
+The answer is yes.
+For example,
+\begin{center}
+	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
+	where $a$ is not nullable.\\
+	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
+	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
+\end{center}
+We start with two syntactically different regexes,
+and end up with the same derivative result.
+This is not surprising as we have such
+equality as below in the style of Arden's lemma:\\
+\begin{center}
+	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
+\end{center}
 %----------------------------------------------------------------------------------------
 %	SECTION 4
 %----------------------------------------------------------------------------------------
-\section{a bound for the star regular expression}
+\section{A Bound for the Star Regular Expression}
 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then
 the property of the $\distinct$ function.
 Now we try to get a bound on $r^* \backslash s$ as well.
 Again, we first look at how a star's derivatives evolve, if they grow maximally:
 \begin{center}
 \begin{tabular}{ccc}
 $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
 $\hflat r$ & $=$ & $ r$
 \end{tabular}
-\end{center}
+\end{center}s
 Again these definitions are tailor-made for dealing with alternatives that have
 originated from a star's derivatives, so we don't attempt to open up all possible
 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
 elements.
 We give a predicate for such "star-created" regular expressions:
 the $\llbracket r \rrbracket$ size of a regex.
 Therefore for the ease and simplicity of producing a
 proof for a size bound, we are happy to restrict ourselves to
 unannotated regular expressions, and obtain such equalities as
 \begin{lemma}
-$\rsimp{(r_1 + r_2)} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
+$\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
 \end{lemma}
 \begin{proof}
 By using the rewriting relation $\rightsquigarrow$
 \end{proof}
 One might wonder the actual bound rather than the loose bound we gave
-for the convenience of a easier proof.
+for the convenience of an easier proof.
-How much can the regex $r^* \backslash s$ grow? As hinted earlier,
+How much can the regex $r^* \backslash s$ grow?
+As  earlier graphs have shown,
+%TODO: reference that graph where size grows quickly
 they can grow at a maximum speed
 exponential $w.r.t$ the number of characters,
-but will eventually level off when the string $s$ is long enough,
+but will eventually level off when the string $s$ is long enough.
-as suggested by the finiteness bound proof.
+If they grow to a size exponential $w.r.t$ the original regex, our algorithm
-And unfortunately we have concrete examples
+would still be slow.
+And unfortunately, we have concrete examples
 where such regexes grew exponentially large before levelling off:
 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
-size that is  exponential on the number $n$.
+size that is  exponential on the number $n$
-%TODO: give out a graph showing how the size of the regex grows and levels off at a size exponential to the original regex
+under our current simplification rules:
+%TODO: graph of a regex whose size increases exponentially.
+\begin{center}
-While the tight upper bound will definitely be a lot lower than
+\begin{tikzpicture}
-the one we gave for the finiteness proof,
+\begin{axis}[
-it will still be at least expoential, under our current simplification rules.
+height=0.5\textwidth,
+width=\textwidth,
-This suggests stronger simplifications that keep the tight bound polynomial.
+xlabel=number of a's,
+xtick={0,...,9},
+ylabel=maximum size,
+ymode=log,
+log basis y={2}
+]
+\addplot[mark=*,blue] table {re-chengsong.data};
+\end{axis}
+\end{tikzpicture}
+\end{center}
+For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
+to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
+(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
+The exponential size is triggered by that the regex
+$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
+inside the $(\ldots) ^*$ having exponentially many
+different derivatives, despite those difference being minor.
+$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
+will therefore contain the following terms (after flattening out all nested
+alternatives):
+\begin{center}
+$(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
+$(1 \leq m' \leq m )$
+\end{center}
+These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
+With each new input character taking the derivative against the intermediate result, more and more such distinct
+terms will accumulate,
+until the length reaches $L.C.M.(1, \ldots, n)$.
+$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms
+$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
+$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
+where $m' \neq m''$ \\
+as they are slightly different.
+This means that with our current simplification methods,
+we will not be able to control the derivative so that
+$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
+as there are already exponentially many terms.
+These terms are similar in the sense that the head of those terms
+are all consisted of sub-terms of the form:
+$(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
+For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
+$n * (n + 1) / 2$ such terms.
+For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
+can be described by 6 terms:
+$a^*$, $a\cdot (aa)^*$, $ (aa)^*$,
+$aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
+The total number of different "head terms",  $n * (n + 1) / 2$,
+is proportional to the number of characters in the regex
+$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
+This suggests a slightly different notion of size, which we call the
+alphabetic width:
+%TODO:
+(TODO: Alphabetic width def.)
+Antimirov\parencite{Antimirov95} has proven that
+$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
+where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
+created by doing derivatives of $r$ against all possible strings.
+If we can make sure that at any moment in our lexing algorithm our
+intermediate result hold at most one copy of each of the
+subterms then we can get the same bound as Antimirov's.
+This leads to the algorithm in the next section.
 %----------------------------------------------------------------------------------------
 %	SECTION 5
 %----------------------------------------------------------------------------------------
-\section{A Stronger Version of Simplification}
+\section{A Stronger Version of Simplification Inspired by Antimirov}
 %TODO: search for isabelle proofs of algorithms that check equivalence

changeset 518	ff7945a988a3
parent 516	6fecb7fe8cd0
child 519	856d025dbc15