diff -r cb702fb4227f -r 2c907b118f78 ChengsongTanPhdThesis/Chapters/ChapterFinite.tex --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ChengsongTanPhdThesis/Chapters/ChapterFinite.tex Sat May 28 17:17:18 2022 +0100 @@ -0,0 +1,791 @@ +% Chapter Template + +\chapter{Finiteness Bound} % Main chapter title + +\label{ChapterFinite} +% In Chapter 4 \ref{Chapter4} we give the second guarantee +%of our bitcoded algorithm, that is a finite bound on the size of any +%regex's derivatives. + + +%----------------------------------- +% SECTION ? +%----------------------------------- +\section{preparatory properties for getting a finiteness bound} +Before we get to the proof that says the intermediate result of our lexer will +remain finitely bounded, which is an important efficiency/liveness guarantee, +we shall first develop a few preparatory properties and definitions to +make the process of proving that a breeze. + +We define rewriting relations for $\rrexp$s, which allows us to do the +same trick as we did for the correctness proof, +but this time we will have stronger equalities established. +\subsection{"hrewrite" relation} +List of 1-step rewrite rules for regular expressions simplification without bitcodes: +\begin{center} +\begin{tabular}{lcl} +$r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$ +\end{tabular} +\end{center} + +And we define an "grewrite" relation that works on lists: +\begin{center} +\begin{tabular}{lcl} +$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$ +\end{tabular} +\end{center} + + + +With these relations we prove +\begin{lemma} +$rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$ +\end{lemma} +which enables us to prove "closed forms" equalities such as +\begin{lemma} +$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ +\end{lemma} + +But the most involved part of the above lemma is proving the following: +\begin{lemma} +$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ +\end{lemma} +which is needed in proving things like +\begin{lemma} +$r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$ +\end{lemma} + +Fortunately this is provable by induction on the list $rs$ + + + +%----------------------------------- +% SECTION 2 +%----------------------------------- + +\section{Finiteness Property} +In this section let us describe our argument for why the size of the simplified +derivatives with the aggressive simplification function is finitely bounded. + Suppose +we have a size function for bitcoded regular expressions, written +$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree + +\begin{center} +\begin{tabular}{ccc} +$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\ +\end{tabular} +\end{center} +(TODO: COMPLETE this defn and for $rs$) + +The size is based on a recursive function on the structure of the regex, +not the bitcodes. +Therefore we may as well talk about size of an annotated regular expression +in their un-annotated form: +\begin{center} +$\asize(a) = \size(\erase(a))$. +\end{center} +(TODO: turn equals to roughly equals) + +But there is a minor nuisance here: +the erase function actually messes with the structure of the regular expression: +\begin{center} +\begin{tabular}{ccc} +$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\ +\end{tabular} +\end{center} +(TODO: complete this definition with singleton r in alts) + +An alternative regular expression with an empty list of children + is turned into an $\ZERO$ during the +$\erase$ function, thereby changing the size and structure of the regex. +These will likely be fixable if we really want to use plain $\rexp$s for dealing +with size, but we choose a more straightforward (or stupid) method by +defining a new datatype that is similar to plain $\rexp$s but can take +non-binary arguments for its alternative constructor, + which we call $\rrexp$ to denote +the difference between it and plain regular expressions. +\[ \rrexp ::= \RZERO \mid \RONE + \mid \RCHAR{c} + \mid \RSEQ{r_1}{r_2} + \mid \RALTS{rs} + \mid \RSTAR{r} +\] +For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, +but keep everything else intact. +It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved +(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton +$\ALTS$). +We denote the operation of erasing the bits and turning an annotated regular expression +into an $\rrexp{}$ as $\rerase{}$. +\begin{center} +\begin{tabular}{lcl} +$\rerase{\AZERO}$ & $=$ & $\RZERO$\\ +$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ +$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$ +\end{tabular} +\end{center} +%TODO: FINISH definition of rerase +Similarly we could define the derivative and simplification on +$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, +except that now they can operate on alternatives taking multiple arguments. + +\begin{center} +\begin{tabular}{lcr} +$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\ +(other clauses omitted) +\end{tabular} +\end{center} + +Now that $\rrexp$s do not have bitcodes on them, we can do the +duplicate removal without an equivalence relation: +\begin{center} +\begin{tabular}{lcl} +$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ + & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$ +\end{tabular} +\end{center} +%TODO: definition of rsimp (maybe only the alternative clause) + + +The reason why these definitions mirror precisely the corresponding operations +on annotated regualar expressions is that we can calculate sizes more easily: + +\begin{lemma} +$\rsize{\rerase a} = \asize a$ +\end{lemma} +This lemma says that the size of an r-erased regex is the same as the annotated regex. +this does not hold for plain $\rexp$s due to their ways of how alternatives are represented. +\begin{lemma} +$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ +\end{lemma} +Putting these two together we get a property that allows us to estimate the +size of an annotated regular expression derivative using its un-annotated counterpart: +\begin{lemma} +$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$ +\end{lemma} +Unless stated otherwise in this chapter all $\textit{rexp}$s without + bitcodes are seen as $\rrexp$s. + We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably + as the former suits people's intuitive way of stating a binary alternative + regular expression. + + +\begin{theorem} +For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$ +\end{theorem} + +\noindent +\begin{proof} +We prove this by induction on $r$. The base cases for $\AZERO$, +$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is +for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction +hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and +$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows +% +\begin{center} +\begin{tabular}{lcll} +& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\ +& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} :: + [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\ +& $\leq$ & + $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) :: + [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\ +& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket + + \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\ +& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + + \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\ +& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5) +\end{tabular} +\end{center} + + +\noindent +where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). +The reason why we could write the derivatives of a sequence this way is described in section 2. +The term (2) is used to control (1). +That is because one can obtain an overall +smaller regex list +by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. +Section 3 is dedicated to its proof. +In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is +bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller +than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands +for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). +We reason similarly for $\STAR$.\medskip +\end{proof} + +What guarantee does this bound give us? + +Whatever the regex is, it will not grow indefinitely. +Take our previous example $(a + aa)^*$ as an example: +\begin{center} +\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} +\begin{tikzpicture} +\begin{axis}[ + xlabel={number of $a$'s}, + x label style={at={(1.05,-0.05)}}, + ylabel={regex size}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax= 40, + ytick={0,10,...,40}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={$(a + aa)^*$}, + legend pos=north west, + legend cell align=left] +\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; +\end{axis} +\end{tikzpicture} +\end{tabular} +\end{center} +We are able to limit the size of the regex $(a + aa)^*$'s derivatives + with our simplification +rules very effectively. + + +As the graphs of some more randomly generated regexes show, the size of +the derivative might grow quickly at the start of the input, + but after a sufficiently long input string, the trend will stop. + + +%a few sample regular experessions' derivatives +%size change +%TODO: giving graphs showing a few regexes' size changes +%w;r;t the input characters number +%a*, aa*, aaa*, ..... +%randomly generated regexes +\begin{center} +\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} +\begin{tikzpicture} +\begin{axis}[ + xlabel={number of $a$'s}, + x label style={at={(1.05,-0.05)}}, + ylabel={regex size}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=1000, + ytick={0,100,...,1000}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={regex1}, + legend pos=north west, + legend cell align=left] +\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; +\end{axis} +\end{tikzpicture} + & +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + %ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=1000, + ytick={0,100,...,1000}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={regex2}, + legend pos=north west, + legend cell align=left] +\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; +\end{axis} +\end{tikzpicture} + & +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + %ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=1000, + ytick={0,100,...,1000}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={regex3}, + legend pos=north west, + legend cell align=left] +\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; +\end{axis} +\end{tikzpicture}\\ +\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.} +\end{tabular} +\end{center} + + + + + +\noindent +Clearly we give in this finiteness argument (Step (5)) a very loose bound that is +far from the actual bound we can expect. +In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound +is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$. +Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$ +inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function +$f(x) = x * 2^x$. +This means the bound we have will surge up at least +tower-exponentially with a linear increase of the depth. +For a regex of depth $n$, the bound +would be approximately $4^n$. +%TODO: change this to tower exponential! + +We can do better than this, but this does not improve +the finiteness property we are proving. If we are interested in a polynomial bound, +one would hope to obtain a similar tight bound as for partial +derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with + $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in +partial derivatives). +To obtain the exact same bound would mean +we need to introduce simplifications, such as + $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, +which exist for partial derivatives. + +However, if we introduce them in our +setting we would lose the POSIX property of our calculated values. +A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$. +If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer +would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of +what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information +in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$ +occurs, and apply them in the right order once we get +a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value. +This is unlike the simplification we had before, where the rewriting rules +such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value. +We will discuss better +bounds in the last section of this chapter.\\[-6.5mm] + + + + +%---------------------------------------------------------------------------------------- +% SECTION ?? +%---------------------------------------------------------------------------------------- + +\section{"Closed Forms" of regular expressions' derivatives w.r.t strings} +To embark on getting the "closed forms" of regexes, we first +define a few auxiliary definitions to make expressing them smoothly. + + \begin{center} + \begin{tabular}{ccc} + $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ +$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\ +$\sflataux r$ & $=$ & $ [r]$ +\end{tabular} +\end{center} +The intuition behind $\sflataux{\_}$ is to break up nested regexes +of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape +into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. +It will return the singleton list $[r]$ otherwise. + +$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps +the output type a regular expression, not a list: + \begin{center} + \begin{tabular}{ccc} + $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ +$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\ +$\sflat r$ & $=$ & $ [r]$ +\end{tabular} +\end{center} +$\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the + first element of the list of children of +an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression +$r_1 \cdot r_2 \backslash s$. + +With $\sflat{\_}$ and $\sflataux{\_}$ we make +precise what "closed forms" we have for the sequence derivatives and their simplifications, +in other words, how can we construct $(r_1 \cdot r_2) \backslash s$ +and $\bderssimp{(r_1\cdot r_2)}{s}$, +if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$) +and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over +the substring of $s$? +First let's look at a series of derivatives steps on a sequence +regular expression, (assuming) that each time the first +component of the sequence is always nullable): +\begin{center} + +$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ +$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad + \ldots$ + +\end{center} +%TODO: cite indian paper +Indianpaper have come up with a slightly more formal way of putting the above process: +\begin{center} +$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) + +\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots ++ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$ +\end{center} +where $\delta(b, r)$ will produce $r$ +if $b$ evaluates to true, +and $\ZERO$ otherwise. + + But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical + equivalence. To make this intuition useful + for a formal proof, we need something +stronger than language equivalence. +With the help of $\sflat{\_}$ we can state the equation in Indianpaper +more rigorously: +\begin{lemma} +$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ +\end{lemma} + +The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string: + +\begin{center} +\begin{tabular}{lcl} +$\vsuf{[]}{\_} $ & $=$ & $[]$\\ +$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ + && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ +\end{tabular} +\end{center} +It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable, +and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in +the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$. +In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing +the entire result of $(r_1 \cdot r_2) \backslash s$, +it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$. + +With this we can also add simplifications to both sides and get +\begin{lemma} +$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ +\end{lemma} +Together with the idempotency property of $\rsimp$, +%TODO: state the idempotency property of rsimp +\begin{lemma} +$\rsimp{r} = \rsimp{\rsimp{r}}$ +\end{lemma} +it is possible to convert the above lemma to obtain a "closed form" +for derivatives nested with simplification: +\begin{lemma} +$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ +\end{lemma} +And now the reason we have (1) in section 1 is clear: +$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, +is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$ + as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$. + +%---------------------------------------------------------------------------------------- +% SECTION 3 +%---------------------------------------------------------------------------------------- + +\section{interaction between $\distinctWith$ and $\flts$} +Note that it is not immediately obvious that +\begin{center} +$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $. +\end{center} +The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of +duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. + + +%----------------------------------- +% SECTION syntactic equivalence under simp +%----------------------------------- +\section{Syntactic Equivalence Under $\simp$} +We prove that minor differences can be annhilated +by $\simp$. +For example, +\begin{center} +$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = + \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ +\end{center} + + +%---------------------------------------------------------------------------------------- +% SECTION ALTS CLOSED FORM +%---------------------------------------------------------------------------------------- +\section{A Closed Form for \textit{ALTS}} +Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. + + +There are a few key steps, one of these steps is +\[ +rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {}))) += rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {})) +\] + + +One might want to prove this by something a simple statement like: +$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$. + +For this to hold we want the $\textit{distinct}$ function to pick up +the elements before and after derivatives correctly: +$r \in rset \equiv (rder x r) \in (rder x rset)$. +which essentially requires that the function $\backslash$ is an injective mapping. + +Unfortunately the function $\backslash c$ is not an injective mapping. + +\subsection{function $\backslash c$ is not injective (1-to-1)} +\begin{center} +The derivative $w.r.t$ character $c$ is not one-to-one. +Formally, + $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ +\end{center} +This property is trivially true for the +character regex example: +\begin{center} + $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ +\end{center} +But apart from the cases where the derivative +output is $\ZERO$, are there non-trivial results +of derivatives which contain strings? +The answer is yes. +For example, +\begin{center} + Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ + where $a$ is not nullable.\\ + $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ + $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ +\end{center} +We start with two syntactically different regexes, +and end up with the same derivative result. +This is not surprising as we have such +equality as below in the style of Arden's lemma:\\ +\begin{center} + $L(A^*B) = L(A\cdot A^* \cdot B + B)$ +\end{center} + +%---------------------------------------------------------------------------------------- +% SECTION 4 +%---------------------------------------------------------------------------------------- +\section{A Bound for the Star Regular Expression} +We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using +the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then +the property of the $\distinct$ function. +Now we try to get a bound on $r^* \backslash s$ as well. +Again, we first look at how a star's derivatives evolve, if they grow maximally: +\begin{center} + +$r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad +r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad +(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''} +\quad \ldots$ + +\end{center} +When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, +$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, +the number of terms in $r^* \backslash s$ will grow exponentially, causing the size +of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not +count the possible size explosions of $r \backslash c$ themselves. + +Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like +$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ +into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ +and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$). +For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$: +%TODO: definitions of and \hflataux \hflat + \begin{center} + \begin{tabular}{ccc} + $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ +$\hflataux r$ & $=$ & $ [r]$ +\end{tabular} +\end{center} + + \begin{center} + \begin{tabular}{ccc} + $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\ +$\hflat r$ & $=$ & $ r$ +\end{tabular} +\end{center}s +Again these definitions are tailor-made for dealing with alternatives that have +originated from a star's derivatives, so we don't attempt to open up all possible +regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 +elements. +We give a predicate for such "star-created" regular expressions: +\begin{center} +\begin{tabular}{lcr} + & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ + $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ + \end{tabular} + \end{center} + + These definitions allows us the flexibility to talk about + regular expressions in their most convenient format, + for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $ + instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$. + These definitions help express that certain classes of syntatically + distinct regular expressions are actually the same under simplification. + This is not entirely true for annotated regular expressions: + %TODO: bsimp bders \neq bderssimp + \begin{center} + $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ + \end{center} + For bit-codes, the order in which simplification is applied + might cause a difference in the location they are placed. + If we want something like + \begin{center} + $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ + \end{center} + Some "canonicalization" procedure is required, + which either pushes all the common bitcodes to nodes + as senior as possible: + \begin{center} + $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ + \end{center} + or does the reverse. However bitcodes are not of interest if we are talking about + the $\llbracket r \rrbracket$ size of a regex. + Therefore for the ease and simplicity of producing a + proof for a size bound, we are happy to restrict ourselves to + unannotated regular expressions, and obtain such equalities as + \begin{lemma} + $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ + \end{lemma} + + \begin{proof} + By using the rewriting relation $\rightsquigarrow$ + \end{proof} + %TODO: rsimp sflat +And from this we obtain a proof that a star's derivative will be the same +as if it had all its nested alternatives created during deriving being flattened out: + For example, + \begin{lemma} + $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ + \end{lemma} + \begin{proof} + By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$. + \end{proof} +% The simplification of a flattened out regular expression, provided it comes +%from the derivative of a star, is the same as the one nested. + + + + + + + + + +One might wonder the actual bound rather than the loose bound we gave +for the convenience of an easier proof. +How much can the regex $r^* \backslash s$ grow? +As earlier graphs have shown, +%TODO: reference that graph where size grows quickly + they can grow at a maximum speed + exponential $w.r.t$ the number of characters, +but will eventually level off when the string $s$ is long enough. +If they grow to a size exponential $w.r.t$ the original regex, our algorithm +would still be slow. +And unfortunately, we have concrete examples +where such regexes grew exponentially large before levelling off: +$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + +(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum + size that is exponential on the number $n$ +under our current simplification rules: +%TODO: graph of a regex whose size increases exponentially. +\begin{center} +\begin{tikzpicture} + \begin{axis}[ + height=0.5\textwidth, + width=\textwidth, + xlabel=number of a's, + xtick={0,...,9}, + ylabel=maximum size, + ymode=log, + log basis y={2} +] + \addplot[mark=*,blue] table {re-chengsong.data}; + \end{axis} +\end{tikzpicture} +\end{center} + +For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ +to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + +(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. +The exponential size is triggered by that the regex +$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ +inside the $(\ldots) ^*$ having exponentially many +different derivatives, despite those difference being minor. +$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ +will therefore contain the following terms (after flattening out all nested +alternatives): +\begin{center} +$(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ +$(1 \leq m' \leq m )$ +\end{center} +These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix). + With each new input character taking the derivative against the intermediate result, more and more such distinct + terms will accumulate, +until the length reaches $L.C.M.(1, \ldots, n)$. +$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms +$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ + +$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ + where $m' \neq m''$ \\ + as they are slightly different. + This means that with our current simplification methods, + we will not be able to control the derivative so that + $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$ + as there are already exponentially many terms. + These terms are similar in the sense that the head of those terms + are all consisted of sub-terms of the form: + $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. + For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most + $n * (n + 1) / 2$ such terms. + For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives + can be described by 6 terms: + $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, + $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. +The total number of different "head terms", $n * (n + 1) / 2$, + is proportional to the number of characters in the regex +$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. +This suggests a slightly different notion of size, which we call the +alphabetic width: +%TODO: +(TODO: Alphabetic width def.) + + +Antimirov\parencite{Antimirov95} has proven that +$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$. +where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms +created by doing derivatives of $r$ against all possible strings. +If we can make sure that at any moment in our lexing algorithm our +intermediate result hold at most one copy of each of the +subterms then we can get the same bound as Antimirov's. +This leads to the algorithm in the next chapter. + + + + + +%---------------------------------------------------------------------------------------- +% SECTION 1 +%---------------------------------------------------------------------------------------- + +\section{Idempotency of $\simp$} + +\begin{equation} + \simp \;r = \simp\; \simp \; r +\end{equation} +This property means we do not have to repeatedly +apply simplification in each step, which justifies +our definition of $\blexersimp$. +It will also be useful in future proofs where properties such as +closed forms are needed. +The proof is by structural induction on $r$. + +%----------------------------------- +% SUBSECTION 1 +%----------------------------------- +\subsection{Syntactic Equivalence Under $\simp$} +We prove that minor differences can be annhilated +by $\simp$. +For example, +\begin{center} +$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = + \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ +\end{center} +