--- a/ChengsongTanPhdThesis/Chapters/ChapterFinite.tex Wed Jul 13 08:27:28 2022 +0100
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,877 +0,0 @@
-% Chapter Template
-
-\chapter{Finiteness Bound} % Main chapter title
-
-\label{ChapterFinite}
-% In Chapter 4 \ref{Chapter4} we give the second guarantee
-%of our bitcoded algorithm, that is a finite bound on the size of any
-%regex's derivatives.
-
-In this chapter we give a guarantee in terms of time complexity:
-given a regular expression $r$, for any string $s$
-our algorithm's internal data structure is finitely bounded.
-To obtain such a proof, we need to
-\begin{itemize}
-\item
-Define an new datatype for regular expressions that makes it easy
-to reason about the size of an annotated regular expression.
-\item
-A set of equalities for this new datatype that enables one to
-rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc.
-by their children regexes $r_1$, $r_2$, and $r$.
-\item
-Using those equalities to actually get those rewriting equations, which we call
-"closed forms".
-\item
-Bound the closed forms, thereby bounding the original
-$\blexersimp$'s internal data structures.
-\end{itemize}
-
-\section{the $\mathbf{r}$-rexp datatype and the size functions}
-
-We have a size function for bitcoded regular expressions, written
-$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
-
-\begin{center}
-\begin{tabular}{ccc}
-$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
-\end{tabular}
-\end{center}
-(TODO: COMPLETE this defn and for $rs$)
-
-The size is based on a recursive function on the structure of the regex,
-not the bitcodes.
-Therefore we may as well talk about size of an annotated regular expression
-in their un-annotated form:
-\begin{center}
-$\asize(a) = \size(\erase(a))$.
-\end{center}
-(TODO: turn equals to roughly equals)
-
-But there is a minor nuisance here:
-the erase function actually messes with the structure of the regular expression:
-\begin{center}
-\begin{tabular}{ccc}
-$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
-\end{tabular}
-\end{center}
-(TODO: complete this definition with singleton r in alts)
-
-An alternative regular expression with an empty list of children
- is turned into an $\ZERO$ during the
-$\erase$ function, thereby changing the size and structure of the regex.
-These will likely be fixable if we really want to use plain $\rexp$s for dealing
-with size, but we choose a more straightforward (or stupid) method by
-defining a new datatype that is similar to plain $\rexp$s but can take
-non-binary arguments for its alternative constructor,
- which we call $\rrexp$ to denote
-the difference between it and plain regular expressions.
-\[ \rrexp ::= \RZERO \mid \RONE
- \mid \RCHAR{c}
- \mid \RSEQ{r_1}{r_2}
- \mid \RALTS{rs}
- \mid \RSTAR{r}
-\]
-For $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
-but keep everything else intact.
-It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
-(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
-$\ALTS$).
-We denote the operation of erasing the bits and turning an annotated regular expression
-into an $\rrexp{}$ as $\rerase{}$.
-\begin{center}
-\begin{tabular}{lcl}
-$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
-$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
-$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
-\end{tabular}
-\end{center}
-%TODO: FINISH definition of rerase
-Similarly we could define the derivative and simplification on
-$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
-except that now they can operate on alternatives taking multiple arguments.
-
-\begin{center}
-\begin{tabular}{lcr}
-$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\
-(other clauses omitted)
-\end{tabular}
-\end{center}
-
-Now that $\rrexp$s do not have bitcodes on them, we can do the
-duplicate removal without an equivalence relation:
-\begin{center}
-\begin{tabular}{lcl}
-$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
- & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
-\end{tabular}
-\end{center}
-%TODO: definition of rsimp (maybe only the alternative clause)
-
-
-The reason why these definitions mirror precisely the corresponding operations
-on annotated regualar expressions is that we can calculate sizes more easily:
-
-\begin{lemma}
-$\rsize{\rerase a} = \asize a$
-\end{lemma}
-This lemma says that the size of an r-erased regex is the same as the annotated regex.
-this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
-\begin{lemma}
-$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
-\end{lemma}
-Putting these two together we get a property that allows us to estimate the
-size of an annotated regular expression derivative using its un-annotated counterpart:
-\begin{lemma}
-$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$
-\end{lemma}
-Unless stated otherwise in this chapter all $\textit{rexp}$s without
- bitcodes are seen as $\rrexp$s.
- We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
- as the former suits people's intuitive way of stating a binary alternative
- regular expression.
-
-
-
-%-----------------------------------
-% SECTION ?
-%-----------------------------------
-\section{preparatory properties for getting a finiteness bound}
-Before we get to the proof that says the intermediate result of our lexer will
-remain finitely bounded, which is an important efficiency/liveness guarantee,
-we shall first develop a few preparatory properties and definitions to
-make the process of proving that a breeze.
-
-We define rewriting relations for $\rrexp$s, which allows us to do the
-same trick as we did for the correctness proof,
-but this time we will have stronger equalities established.
-\subsection{"hrewrite" relation}
-List of 1-step rewrite rules for regular expressions simplification without bitcodes:
-\begin{figure}
-\caption{the "h-rewrite" rules}
-\[
-r_1 \cdot \ZERO \rightarrow_h \ZERO \]
-
-\[
-\ZERO \cdot r_2 \rightarrow \ZERO
-\]
-\end{figure}
-And we define an "grewrite" relation that works on lists:
-\begin{center}
-\begin{tabular}{lcl}
-$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
-\end{tabular}
-\end{center}
-
-
-
-With these relations we prove
-\begin{lemma}
-$rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
-\end{lemma}
-which enables us to prove "closed forms" equalities such as
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\suffix \; s \; r_1 ))}$
-\end{lemma}
-
-But the most involved part of the above lemma is proving the following:
-\begin{lemma}
-$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $
-\end{lemma}
-which is needed in proving things like
-\begin{lemma}
-$r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$
-\end{lemma}
-
-Fortunately this is provable by induction on the list $rs$
-
-
-
-%-----------------------------------
-% SECTION 2
-%-----------------------------------
-
-\begin{theorem}
-For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
-\end{theorem}
-
-\noindent
-\begin{proof}
-We prove this by induction on $r$. The base cases for $\AZERO$,
-$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
-for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
-hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
-$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
-%
-\begin{center}
-\begin{tabular}{lcll}
-& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
-& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
- [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
-& $\leq$ &
- $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
- [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
-& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
- \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
- \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
-\end{tabular}
-\end{center}
-
-
-\noindent
-where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
-The reason why we could write the derivatives of a sequence this way is described in section 2.
-The term (2) is used to control (1).
-That is because one can obtain an overall
-smaller regex list
-by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
-Section 3 is dedicated to its proof.
-In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is
-bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
-than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
-for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
-We reason similarly for $\STAR$.\medskip
-\end{proof}
-
-What guarantee does this bound give us?
-
-Whatever the regex is, it will not grow indefinitely.
-Take our previous example $(a + aa)^*$ as an example:
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={number of $a$'s},
- x label style={at={(1.05,-0.05)}},
- ylabel={regex size},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax= 40,
- ytick={0,10,...,40},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={$(a + aa)^*$},
- legend pos=north west,
- legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}
-We are able to limit the size of the regex $(a + aa)^*$'s derivatives
- with our simplification
-rules very effectively.
-
-
-In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
-is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
-Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
-inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
-$f(x) = x * 2^x$.
-This means the bound we have will surge up at least
-tower-exponentially with a linear increase of the depth.
-For a regex of depth $n$, the bound
-would be approximately $4^n$.
-
-Test data in the graphs from randomly generated regular expressions
-shows that the giant bounds are far from being hit.
-%a few sample regular experessions' derivatives
-%size change
-%TODO: giving regex1_size_change.data showing a few regexes' size changes
-%w;r;t the input characters number, where the size is usually cubic in terms of original size
-%a*, aa*, aaa*, .....
-%randomly generated regexes
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={number of $a$'s},
- x label style={at={(1.05,-0.05)}},
- ylabel={regex size},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=1000,
- ytick={0,100,...,1000},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={regex1},
- legend pos=north west,
- legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
-\end{axis}
-\end{tikzpicture}
- &
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- %ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=1000,
- ytick={0,100,...,1000},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={regex2},
- legend pos=north west,
- legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
-\end{axis}
-\end{tikzpicture}
- &
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- %ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=1000,
- ytick={0,100,...,1000},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={regex3},
- legend pos=north west,
- legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
-\end{tabular}
-\end{center}
-
-
-
-
-
-\noindent
-Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the
-original size.
-This suggests a link towrads "partial derivatives"
- introduced by Antimirov \cite{Antimirov95}.
-
- \section{Antimirov's partial derivatives}
- The idea behind Antimirov's partial derivatives
-is to do derivatives in a similar way as suggested by Brzozowski,
-but maintain a set of regular expressions instead of a single one:
-
-%TODO: antimirov proposition 3.1, needs completion
- \begin{center}
- \begin{tabular}{ccc}
- $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\
-$\partial_x(\ONE)$ & $=$ & $\phi$
-\end{tabular}
-\end{center}
-
-Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together
-using the alternatives constructor, Antimirov cleverly chose to put them into
-a set instead. This breaks the terms in a derivative regular expression up,
-allowing us to understand what are the "atomic" components of it.
-For example, To compute what regular expression $x^*(xx + y)^*$'s
-derivative against $x$ is made of, one can do a partial derivative
-of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$
-from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.
-To get all the "atomic" components of a regular expression's possible derivatives,
-there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes
-whatever character is available at the head of the string inside the language of a
-regular expression, and gives back the character and the derivative regular expression
-as a pair (which he called "monomial"):
- \begin{center}
- \begin{tabular}{ccc}
- $\lf(\ONE)$ & $=$ & $\phi$\\
-$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\
- $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
- $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\
-\end{tabular}
-\end{center}
-%TODO: completion
-
-There is a slight difference in the last three clauses compared
-with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular
-expression $r$ with every element inside $\textit{rset}$ to create a set of
-sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates
-on a set of monomials (which Antimirov called "linear form") and a regular
-expression, and returns a linear form:
- \begin{center}
- \begin{tabular}{ccc}
- $l \bigodot (\ZERO)$ & $=$ & $\phi$\\
- $l \bigodot (\ONE)$ & $=$ & $l$\\
- $\phi \bigodot t$ & $=$ & $\phi$\\
- $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\
- $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\
- $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\
- $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
- $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\
-\end{tabular}
-\end{center}
-%TODO: completion
-
- Some degree of simplification is applied when doing $\bigodot$, for example,
- $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,
- and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and
- $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,
- and so on.
-
- With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regex $r$ with
- an iterative procedure:
- \begin{center}
- \begin{tabular}{llll}
-$\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\
- & $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\
- & $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\
-$\partial_{UNIV}(r)$ & $=$ & $\PD$ &
-\end{tabular}
-\end{center}
-
-
- $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
-
-
-However, if we introduce them in our
-setting we would lose the POSIX property of our calculated values.
-A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
-If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer
-would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
-what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
-in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
-occurs, and apply them in the right order once we get
-a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
-This is unlike the simplification we had before, where the rewriting rules
-such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
-We will discuss better
-bounds in the last section of this chapter.\\[-6.5mm]
-
-
-
-
-%----------------------------------------------------------------------------------------
-% SECTION ??
-%----------------------------------------------------------------------------------------
-
-\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
-To embark on getting the "closed forms" of regexes, we first
-define a few auxiliary definitions to make expressing them smoothly.
-
- \begin{center}
- \begin{tabular}{ccc}
- $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
-$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
-$\sflataux r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-The intuition behind $\sflataux{\_}$ is to break up nested regexes
-of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
-into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
-It will return the singleton list $[r]$ otherwise.
-
-$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps
-the output type a regular expression, not a list:
- \begin{center}
- \begin{tabular}{ccc}
- $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
-$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
-$\sflat r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-$\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the
- first element of the list of children of
-an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression
-$r_1 \cdot r_2 \backslash s$.
-
-With $\sflat{\_}$ and $\sflataux{\_}$ we make
-precise what "closed forms" we have for the sequence derivatives and their simplifications,
-in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
-and $\bderssimp{(r_1\cdot r_2)}{s}$,
-if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
-and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over
-the substring of $s$?
-First let's look at a series of derivatives steps on a sequence
-regular expression, (assuming) that each time the first
-component of the sequence is always nullable):
-\begin{center}
-
-$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
-$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad
- \ldots$
-
-\end{center}
-%TODO: cite indian paper
-Indianpaper have come up with a slightly more formal way of putting the above process:
-\begin{center}
-$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
-\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
-+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
-\end{center}
-where $\delta(b, r)$ will produce $r$
-if $b$ evaluates to true,
-and $\ZERO$ otherwise.
-
- But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
- equivalence. To make this intuition useful
- for a formal proof, we need something
-stronger than language equivalence.
-With the help of $\sflat{\_}$ we can state the equation in Indianpaper
-more rigorously:
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
-\end{lemma}
-
-The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
-
-\begin{center}
-\begin{tabular}{lcl}
-$\vsuf{[]}{\_} $ & $=$ & $[]$\\
-$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
- && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $
-\end{tabular}
-\end{center}
-It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
-and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
-the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
-In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing
-the entire result of $(r_1 \cdot r_2) \backslash s$,
-it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$.
-
-With this we can also add simplifications to both sides and get
-\begin{lemma}
-$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
-\end{lemma}
-Together with the idempotency property of $\rsimp$,
-%TODO: state the idempotency property of rsimp
-\begin{lemma}
-$\rsimp{r} = \rsimp{\rsimp{r}}$
-\end{lemma}
-it is possible to convert the above lemma to obtain a "closed form"
-for derivatives nested with simplification:
-\begin{lemma}
-$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
-\end{lemma}
-And now the reason we have (1) in section 1 is clear:
-$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$,
-is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
- as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
-
-%----------------------------------------------------------------------------------------
-% SECTION 3
-%----------------------------------------------------------------------------------------
-
-\section{interaction between $\distinctWith$ and $\flts$}
-Note that it is not immediately obvious that
-\begin{center}
-$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $.
-\end{center}
-The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
-duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
-
-
-%-----------------------------------
-% SECTION syntactic equivalence under simp
-%-----------------------------------
-\section{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
-by $\simp$.
-For example,
-\begin{center}
-$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) =
- \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
-\end{center}
-
-
-%----------------------------------------------------------------------------------------
-% SECTION ALTS CLOSED FORM
-%----------------------------------------------------------------------------------------
-\section{A Closed Form for \textit{ALTS}}
-Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
-
-
-There are a few key steps, one of these steps is
-\[
-rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))
-= rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))
-\]
-
-
-One might want to prove this by something a simple statement like:
-$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.
-
-For this to hold we want the $\textit{distinct}$ function to pick up
-the elements before and after derivatives correctly:
-$r \in rset \equiv (rder x r) \in (rder x rset)$.
-which essentially requires that the function $\backslash$ is an injective mapping.
-
-Unfortunately the function $\backslash c$ is not an injective mapping.
-
-\subsection{function $\backslash c$ is not injective (1-to-1)}
-\begin{center}
-The derivative $w.r.t$ character $c$ is not one-to-one.
-Formally,
- $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
-\end{center}
-This property is trivially true for the
-character regex example:
-\begin{center}
- $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
-\end{center}
-But apart from the cases where the derivative
-output is $\ZERO$, are there non-trivial results
-of derivatives which contain strings?
-The answer is yes.
-For example,
-\begin{center}
- Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
- where $a$ is not nullable.\\
- $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
- $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
-\end{center}
-We start with two syntactically different regexes,
-and end up with the same derivative result.
-This is not surprising as we have such
-equality as below in the style of Arden's lemma:\\
-\begin{center}
- $L(A^*B) = L(A\cdot A^* \cdot B + B)$
-\end{center}
-
-%----------------------------------------------------------------------------------------
-% SECTION 4
-%----------------------------------------------------------------------------------------
-\section{A Bound for the Star Regular Expression}
-We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
-the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then
-the property of the $\distinct$ function.
-Now we try to get a bound on $r^* \backslash s$ as well.
-Again, we first look at how a star's derivatives evolve, if they grow maximally:
-\begin{center}
-
-$r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad
-r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad
-(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''}
-\quad \ldots$
-
-\end{center}
-When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$,
-$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
-the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
-of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not
-count the possible size explosions of $r \backslash c$ themselves.
-
-Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
-$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $
-into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
-and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
-For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
-%TODO: definitions of and \hflataux \hflat
- \begin{center}
- \begin{tabular}{ccc}
- $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
-$\hflataux r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-
- \begin{center}
- \begin{tabular}{ccc}
- $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
-$\hflat r$ & $=$ & $ r$
-\end{tabular}
-\end{center}s
-Again these definitions are tailor-made for dealing with alternatives that have
-originated from a star's derivatives, so we don't attempt to open up all possible
-regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
-elements.
-We give a predicate for such "star-created" regular expressions:
-\begin{center}
-\begin{tabular}{lcr}
- & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
- $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
- \end{tabular}
- \end{center}
-
- These definitions allows us the flexibility to talk about
- regular expressions in their most convenient format,
- for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
- instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
- These definitions help express that certain classes of syntatically
- distinct regular expressions are actually the same under simplification.
- This is not entirely true for annotated regular expressions:
- %TODO: bsimp bders \neq bderssimp
- \begin{center}
- $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
- \end{center}
- For bit-codes, the order in which simplification is applied
- might cause a difference in the location they are placed.
- If we want something like
- \begin{center}
- $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
- \end{center}
- Some "canonicalization" procedure is required,
- which either pushes all the common bitcodes to nodes
- as senior as possible:
- \begin{center}
- $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
- \end{center}
- or does the reverse. However bitcodes are not of interest if we are talking about
- the $\llbracket r \rrbracket$ size of a regex.
- Therefore for the ease and simplicity of producing a
- proof for a size bound, we are happy to restrict ourselves to
- unannotated regular expressions, and obtain such equalities as
- \begin{lemma}
- $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
- \end{lemma}
-
- \begin{proof}
- By using the rewriting relation $\rightsquigarrow$
- \end{proof}
- %TODO: rsimp sflat
-And from this we obtain a proof that a star's derivative will be the same
-as if it had all its nested alternatives created during deriving being flattened out:
- For example,
- \begin{lemma}
- $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
- \end{lemma}
- \begin{proof}
- By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
- \end{proof}
-% The simplification of a flattened out regular expression, provided it comes
-%from the derivative of a star, is the same as the one nested.
-
-
-
-
-
-
-
-
-
-One might wonder the actual bound rather than the loose bound we gave
-for the convenience of an easier proof.
-How much can the regex $r^* \backslash s$ grow?
-As earlier graphs have shown,
-%TODO: reference that graph where size grows quickly
- they can grow at a maximum speed
- exponential $w.r.t$ the number of characters,
-but will eventually level off when the string $s$ is long enough.
-If they grow to a size exponential $w.r.t$ the original regex, our algorithm
-would still be slow.
-And unfortunately, we have concrete examples
-where such regexes grew exponentially large before levelling off:
-$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
-(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
- size that is exponential on the number $n$
-under our current simplification rules:
-%TODO: graph of a regex whose size increases exponentially.
-\begin{center}
-\begin{tikzpicture}
- \begin{axis}[
- height=0.5\textwidth,
- width=\textwidth,
- xlabel=number of a's,
- xtick={0,...,9},
- ylabel=maximum size,
- ymode=log,
- log basis y={2}
-]
- \addplot[mark=*,blue] table {re-chengsong.data};
- \end{axis}
-\end{tikzpicture}
-\end{center}
-
-For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
-to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
-(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
-The exponential size is triggered by that the regex
-$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
-inside the $(\ldots) ^*$ having exponentially many
-different derivatives, despite those difference being minor.
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
-will therefore contain the following terms (after flattening out all nested
-alternatives):
-\begin{center}
-$(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
-$(1 \leq m' \leq m )$
-\end{center}
-These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
- With each new input character taking the derivative against the intermediate result, more and more such distinct
- terms will accumulate,
-until the length reaches $L.C.M.(1, \ldots, n)$.
-$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms
-$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
-
-$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
- where $m' \neq m''$ \\
- as they are slightly different.
- This means that with our current simplification methods,
- we will not be able to control the derivative so that
- $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
- as there are already exponentially many terms.
- These terms are similar in the sense that the head of those terms
- are all consisted of sub-terms of the form:
- $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $.
- For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
- $n * (n + 1) / 2$ such terms.
- For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
- can be described by 6 terms:
- $a^*$, $a\cdot (aa)^*$, $ (aa)^*$,
- $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
-The total number of different "head terms", $n * (n + 1) / 2$,
- is proportional to the number of characters in the regex
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
-This suggests a slightly different notion of size, which we call the
-alphabetic width:
-%TODO:
-(TODO: Alphabetic width def.)
-
-
-Antimirov\parencite{Antimirov95} has proven that
-$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
-where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
-created by doing derivatives of $r$ against all possible strings.
-If we can make sure that at any moment in our lexing algorithm our
-intermediate result hold at most one copy of each of the
-subterms then we can get the same bound as Antimirov's.
-This leads to the algorithm in the next chapter.
-
-
-
-
-
-%----------------------------------------------------------------------------------------
-% SECTION 1
-%----------------------------------------------------------------------------------------
-
-\section{Idempotency of $\simp$}
-
-\begin{equation}
- \simp \;r = \simp\; \simp \; r
-\end{equation}
-This property means we do not have to repeatedly
-apply simplification in each step, which justifies
-our definition of $\blexersimp$.
-It will also be useful in future proofs where properties such as
-closed forms are needed.
-The proof is by structural induction on $r$.
-
-%-----------------------------------
-% SUBSECTION 1
-%-----------------------------------
-\subsection{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
-by $\simp$.
-For example,
-\begin{center}
-$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) =
- \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
-\end{center}
-