diff -r 3c5e58d08939 -r ff7945a988a3 ChengsongTanPhdThesis/Chapters/Chapter2.tex --- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex Fri May 20 18:52:03 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex Thu May 26 20:51:40 2022 +0100 @@ -7,200 +7,16 @@ -%---------------------------------------------------------------------------------------- -% SECTION 1 -%---------------------------------------------------------------------------------------- -\section{Properties of $\backslash c$} - -To have a clear idea of what we can and -need to prove about the algorithms involving -Brzozowski's derivatives, there are a few -properties we need to be clear about . -\subsection{function $\backslash c$ is not 1-to-1} -\begin{center} -The derivative $w.r.t$ character $c$ is not one-to-one. -Formally, - $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ -\end{center} -This property is trivially true for the -character regex example: -\begin{center} - $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ -\end{center} -But apart from the cases where the derivative -output is $\ZERO$, are there non-trivial results -of derivatives which contain strings? -The answer is yes. -For example, -\begin{center} - Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ - where $a$ is not nullable.\\ - $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ - $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ -\end{center} -We start with two syntactically different regexes, -and end up with the same derivative result, which is -a "meaningful" regex because it contains strings. -We have rediscovered Arden's lemma:\\ -\begin{center} - $A^*B = A\cdot A^* \cdot B + B$ -\end{center} - - %----------------------------------- % SUBSECTION 1 %----------------------------------- -\subsection{Subsection 1} +\section{Specifications of Certain Functions to be Used} To be completed. - - - -%----------------------------------- -% SECTION 2 -%----------------------------------- - -\section{Finiteness Property} -In this section let us sketch our argument for why the size of the simplified -derivatives with the aggressive simplification function can be finitely bounded. - -For convenience, we use a new datatype which we call $\rrexp$ to denote -the difference between it and annotated regular expressions. -\[ \rrexp ::= \RZERO \mid \RONE - \mid \RCHAR{c} - \mid \RSEQ{r_1}{r_2} - \mid \RALTS{rs} - \mid \RSTAR{r} -\] -For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, -but keep everything else intact. -It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved -(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton -$\ALTS$). -We denote the operation of erasing the bits and turning an annotated regular expression -into an $\rrexp{}$ as $\rerase{}$. -\begin{center} -\begin{tabular}{lcr} -$\rerase{\AZERO}$ & $=$ & $\RZERO$\\ -$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ -$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$ -\end{tabular} -\end{center} -%TODO: FINISH definition of rerase -Similarly we could define the derivative and simplification on -$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, -except that now they can operate on alternatives taking multiple arguments. - -\begin{center} -\begin{tabular}{lcr} -$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\ -(other clauses omitted) -\end{tabular} -\end{center} - -Now that $\rrexp$s do not have bitcodes on them, we can do the -duplicate removal without an equivalence relation: -\begin{center} -\begin{tabular}{lcl} -$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ - & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$ -\end{tabular} -\end{center} -%TODO: definition of rsimp (maybe only the alternative clause) - - -The reason why these definitions mirror precisely the corresponding operations -on annotated regualar expressions is that we can calculate sizes more easily: - -\begin{lemma} -$\rsize{\rerase a} = \asize a$ -\end{lemma} -A similar equation holds for annotated regular expressions' simplification -and the plain regular expressions' simplification: -\begin{lemma} -$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ -\end{lemma} -Putting these two together we get a property that allows us to estimate the -size of an annotated regular expression derivative using its un-annotated counterpart: -\begin{lemma} -$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$ -\end{lemma} -Unless stated otherwise in this chapter all $\textit{rexp}$s without - bitcodes are seen as $\rrexp$s. - We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably - as the former suits people's intuitive way of stating a binary alternative - regular expression. - - Suppose -we have a size function for bitcoded regular expressions, written -$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree -(we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$ -there exists a bound $N$ -such that - -\begin{center} -$\forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N$ -\end{center} - -\noindent -We prove this by induction on $r$. The base cases for $\AZERO$, -$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is -for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction -hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and -$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows -% -\begin{center} -\begin{tabular}{lcll} -& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\ -& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} :: - [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\ -& $\leq$ & - $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) :: - [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\ -& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket + - \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\ -& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + - \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\ -& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5) -\end{tabular} -\end{center} - -% tell Chengsong about Indian paper of closed forms of derivatives - -\noindent -where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). -The reason why we could write the derivatives of a sequence this way is described in section 2. -The term (2) is used to control (1). -That is because one can obtain an overall -smaller regex list -by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. -Section 3 is dedicated to its proof. -In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is -bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller -than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands -for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). -We reason similarly for $\STAR$.\medskip - -\noindent -Clearly we give in this finiteness argument (Step (5)) a very loose bound that is -far from the actual bound we can expect. We can do better than this, but this does not improve -the finiteness property we are proving. If we are interested in a polynomial bound, -one would hope to obtain a similar tight bound as for partial -derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with - $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in -partial derivatives). Unfortunately to obtain the exact same bound would mean -we need to introduce simplifications, such as - $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, -which exist for partial derivatives. However, if we introduce them in our -setting we would lose the POSIX property of our calculated values. We leave better -bounds for future work.\\[-6.5mm] - - - %----------------------------------- % SECTION ? %----------------------------------- @@ -247,9 +63,324 @@ \begin{lemma} $r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$ \end{lemma} -. + Fortunately this is provable by induction on the list $rs$ + + +%----------------------------------- +% SECTION 2 +%----------------------------------- + +\section{Finiteness Property} +In this section let us describe our argument for why the size of the simplified +derivatives with the aggressive simplification function is finitely bounded. + Suppose +we have a size function for bitcoded regular expressions, written +$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree + +\begin{center} +\begin{tabular}{ccc} +$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\ +\end{tabular} +\end{center} +(TODO: COMPLETE this defn and for $rs$) + +The size is based on a recursive function on the structure of the regex, +not the bitcodes. +Therefore we may as well talk about size of an annotated regular expression +in their un-annotated form: +\begin{center} +$\asize(a) = \size(\erase(a))$. +\end{center} +(TODO: turn equals to roughly equals) + +But there is a minor nuisance here: +the erase function actually messes with the structure of the regular expression: +\begin{center} +\begin{tabular}{ccc} +$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\ +\end{tabular} +\end{center} +(TODO: complete this definition with singleton r in alts) + +An alternative regular expression with an empty list of children + is turned into an $\ZERO$ during the +$\erase$ function, thereby changing the size and structure of the regex. +These will likely be fixable if we really want to use plain $\rexp$s for dealing +with size, but we choose a more straightforward (or stupid) method by +defining a new datatype that is similar to plain $\rexp$s but can take +non-binary arguments for its alternative constructor, + which we call $\rrexp$ to denote +the difference between it and plain regular expressions. +\[ \rrexp ::= \RZERO \mid \RONE + \mid \RCHAR{c} + \mid \RSEQ{r_1}{r_2} + \mid \RALTS{rs} + \mid \RSTAR{r} +\] +For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, +but keep everything else intact. +It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved +(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton +$\ALTS$). +We denote the operation of erasing the bits and turning an annotated regular expression +into an $\rrexp{}$ as $\rerase{}$. +\begin{center} +\begin{tabular}{lcl} +$\rerase{\AZERO}$ & $=$ & $\RZERO$\\ +$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ +$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$ +\end{tabular} +\end{center} +%TODO: FINISH definition of rerase +Similarly we could define the derivative and simplification on +$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, +except that now they can operate on alternatives taking multiple arguments. + +\begin{center} +\begin{tabular}{lcr} +$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\ +(other clauses omitted) +\end{tabular} +\end{center} + +Now that $\rrexp$s do not have bitcodes on them, we can do the +duplicate removal without an equivalence relation: +\begin{center} +\begin{tabular}{lcl} +$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ + & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$ +\end{tabular} +\end{center} +%TODO: definition of rsimp (maybe only the alternative clause) + + +The reason why these definitions mirror precisely the corresponding operations +on annotated regualar expressions is that we can calculate sizes more easily: + +\begin{lemma} +$\rsize{\rerase a} = \asize a$ +\end{lemma} +This lemma says that the size of an r-erased regex is the same as the annotated regex. +this does not hold for plain $\rexp$s due to their ways of how alternatives are represented. +\begin{lemma} +$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ +\end{lemma} +Putting these two together we get a property that allows us to estimate the +size of an annotated regular expression derivative using its un-annotated counterpart: +\begin{lemma} +$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$ +\end{lemma} +Unless stated otherwise in this chapter all $\textit{rexp}$s without + bitcodes are seen as $\rrexp$s. + We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably + as the former suits people's intuitive way of stating a binary alternative + regular expression. + + +\begin{theorem} +For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$ +\end{theorem} + +\noindent +\begin{proof} +We prove this by induction on $r$. The base cases for $\AZERO$, +$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is +for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction +hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and +$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows +% +\begin{center} +\begin{tabular}{lcll} +& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\ +& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} :: + [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\ +& $\leq$ & + $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) :: + [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\ +& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket + + \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\ +& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + + \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\ +& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5) +\end{tabular} +\end{center} + + +\noindent +where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). +The reason why we could write the derivatives of a sequence this way is described in section 2. +The term (2) is used to control (1). +That is because one can obtain an overall +smaller regex list +by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. +Section 3 is dedicated to its proof. +In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is +bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller +than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands +for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). +We reason similarly for $\STAR$.\medskip +\end{proof} + +What guarantee does this bound give us? + +Whatever the regex is, it will not grow indefinitely. +Take our previous example $(a + aa)^*$ as an example: +\begin{center} +\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} +\begin{tikzpicture} +\begin{axis}[ + xlabel={number of $a$'s}, + x label style={at={(1.05,-0.05)}}, + ylabel={regex size}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax= 40, + ytick={0,10,...,40}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={$(a + aa)^*$}, + legend pos=north west, + legend cell align=left] +\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; +\end{axis} +\end{tikzpicture} +\end{tabular} +\end{center} +We are able to limit the size of the regex $(a + aa)^*$'s derivatives + with our simplification +rules very effectively. + + +As the graphs of some more randomly generated regexes show, the size of +the derivative might grow quickly at the start of the input, + but after a sufficiently long input string, the trend will stop. + + +%a few sample regular experessions' derivatives +%size change +%TODO: giving graphs showing a few regexes' size changes +%w;r;t the input characters number +%a*, aa*, aaa*, ..... +%randomly generated regexes +\begin{center} +\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} +\begin{tikzpicture} +\begin{axis}[ + xlabel={number of $a$'s}, + x label style={at={(1.05,-0.05)}}, + ylabel={regex size}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=1000, + ytick={0,100,...,1000}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={regex1}, + legend pos=north west, + legend cell align=left] +\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; +\end{axis} +\end{tikzpicture} + & +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + %ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=1000, + ytick={0,100,...,1000}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={regex2}, + legend pos=north west, + legend cell align=left] +\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; +\end{axis} +\end{tikzpicture} + & +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + %ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=1000, + ytick={0,100,...,1000}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={regex3}, + legend pos=north west, + legend cell align=left] +\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; +\end{axis} +\end{tikzpicture}\\ +\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.} +\end{tabular} +\end{center} + + + + + +\noindent +Clearly we give in this finiteness argument (Step (5)) a very loose bound that is +far from the actual bound we can expect. +In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound +is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$. +Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$ +inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function +$f(x) = x * 2^x$. +This means the bound we have will surge up at least +tower-exponentially with a linear increase of the depth. +For a regex of depth $n$, the bound +would be approximately $4^n$. +%TODO: change this to tower exponential! + +We can do better than this, but this does not improve +the finiteness property we are proving. If we are interested in a polynomial bound, +one would hope to obtain a similar tight bound as for partial +derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with + $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in +partial derivatives). +To obtain the exact same bound would mean +we need to introduce simplifications, such as + $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, +which exist for partial derivatives. + +However, if we introduce them in our +setting we would lose the POSIX property of our calculated values. +A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$. +If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer +would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of +what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information +in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$ +occurs, and apply them in the right order once we get +a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value. +This is unlike the simplification we had before, where the rewriting rules +such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value. +We will discuss better +bounds in the last section of this chapter.\\[-6.5mm] + + + + %---------------------------------------------------------------------------------------- % SECTION ?? %---------------------------------------------------------------------------------------- @@ -369,10 +500,62 @@ The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. + +%---------------------------------------------------------------------------------------- +% SECTION ALTS CLOSED FORM +%---------------------------------------------------------------------------------------- +\section{A Closed Form for \textit{ALTS}} +Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. + + +There are a few key steps, one of these steps is +$rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {}))) += rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))$ + +One might want to prove this by something a simple statement like: +$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$. + +For this to hold we want the $\textit{distinct}$ function to pick up +the elements before and after derivatives correctly: +$r \in rset \equiv (rder x r) \in (rder x rset)$. +which essentially requires that the function $\backslash$ is an injective mapping. + +Unfortunately the function $\backslash c$ is not an injective mapping. + +\subsection{function $\backslash c$ is not injective (1-to-1)} +\begin{center} +The derivative $w.r.t$ character $c$ is not one-to-one. +Formally, + $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ +\end{center} +This property is trivially true for the +character regex example: +\begin{center} + $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ +\end{center} +But apart from the cases where the derivative +output is $\ZERO$, are there non-trivial results +of derivatives which contain strings? +The answer is yes. +For example, +\begin{center} + Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ + where $a$ is not nullable.\\ + $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ + $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ +\end{center} +We start with two syntactically different regexes, +and end up with the same derivative result. +This is not surprising as we have such +equality as below in the style of Arden's lemma:\\ +\begin{center} + $L(A^*B) = L(A\cdot A^* \cdot B + B)$ +\end{center} + %---------------------------------------------------------------------------------------- % SECTION 4 %---------------------------------------------------------------------------------------- -\section{a bound for the star regular expression} +\section{A Bound for the Star Regular Expression} We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then the property of the $\distinct$ function. @@ -410,7 +593,7 @@ $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\ $\hflat r$ & $=$ & $ r$ \end{tabular} -\end{center} +\end{center}s Again these definitions are tailor-made for dealing with alternatives that have originated from a star's derivatives, so we don't attempt to open up all possible regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 @@ -452,7 +635,7 @@ proof for a size bound, we are happy to restrict ourselves to unannotated regular expressions, and obtain such equalities as \begin{lemma} - $\rsimp{(r_1 + r_2)} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ + $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ \end{lemma} \begin{proof} @@ -480,30 +663,104 @@ One might wonder the actual bound rather than the loose bound we gave -for the convenience of a easier proof. -How much can the regex $r^* \backslash s$ grow? As hinted earlier, +for the convenience of an easier proof. +How much can the regex $r^* \backslash s$ grow? +As earlier graphs have shown, +%TODO: reference that graph where size grows quickly they can grow at a maximum speed exponential $w.r.t$ the number of characters, -but will eventually level off when the string $s$ is long enough, -as suggested by the finiteness bound proof. -And unfortunately we have concrete examples +but will eventually level off when the string $s$ is long enough. +If they grow to a size exponential $w.r.t$ the original regex, our algorithm +would still be slow. +And unfortunately, we have concrete examples where such regexes grew exponentially large before levelling off: $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum -size that is exponential on the number $n$. -%TODO: give out a graph showing how the size of the regex grows and levels off at a size exponential to the original regex - + size that is exponential on the number $n$ +under our current simplification rules: +%TODO: graph of a regex whose size increases exponentially. +\begin{center} +\begin{tikzpicture} + \begin{axis}[ + height=0.5\textwidth, + width=\textwidth, + xlabel=number of a's, + xtick={0,...,9}, + ylabel=maximum size, + ymode=log, + log basis y={2} +] + \addplot[mark=*,blue] table {re-chengsong.data}; + \end{axis} +\end{tikzpicture} +\end{center} -While the tight upper bound will definitely be a lot lower than -the one we gave for the finiteness proof, -it will still be at least expoential, under our current simplification rules. +For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ +to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + +(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. +The exponential size is triggered by that the regex +$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ +inside the $(\ldots) ^*$ having exponentially many +different derivatives, despite those difference being minor. +$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ +will therefore contain the following terms (after flattening out all nested +alternatives): +\begin{center} +$(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ +$(1 \leq m' \leq m )$ +\end{center} +These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix). + With each new input character taking the derivative against the intermediate result, more and more such distinct + terms will accumulate, +until the length reaches $L.C.M.(1, \ldots, n)$. +$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms +$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ -This suggests stronger simplifications that keep the tight bound polynomial. +$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ + where $m' \neq m''$ \\ + as they are slightly different. + This means that with our current simplification methods, + we will not be able to control the derivative so that + $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$ + as there are already exponentially many terms. + These terms are similar in the sense that the head of those terms + are all consisted of sub-terms of the form: + $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. + For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most + $n * (n + 1) / 2$ such terms. + For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives + can be described by 6 terms: + $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, + $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. +The total number of different "head terms", $n * (n + 1) / 2$, + is proportional to the number of characters in the regex +$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. +This suggests a slightly different notion of size, which we call the +alphabetic width: +%TODO: +(TODO: Alphabetic width def.) + + +Antimirov\parencite{Antimirov95} has proven that +$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$. +where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms +created by doing derivatives of $r$ against all possible strings. +If we can make sure that at any moment in our lexing algorithm our +intermediate result hold at most one copy of each of the +subterms then we can get the same bound as Antimirov's. +This leads to the algorithm in the next section. %---------------------------------------------------------------------------------------- % SECTION 5 %---------------------------------------------------------------------------------------- -\section{A Stronger Version of Simplification} +\section{A Stronger Version of Simplification Inspired by Antimirov} %TODO: search for isabelle proofs of algorithms that check equivalence + + + + + + +