diff -r 3cbcd7cda0a9 -r 0497408a3598 ChengsongTanPhdThesis/Chapters/ChapterFinite.tex --- a/ChengsongTanPhdThesis/Chapters/ChapterFinite.tex Wed Jul 13 08:27:28 2022 +0100 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,877 +0,0 @@ -% Chapter Template - -\chapter{Finiteness Bound} % Main chapter title - -\label{ChapterFinite} -% In Chapter 4 \ref{Chapter4} we give the second guarantee -%of our bitcoded algorithm, that is a finite bound on the size of any -%regex's derivatives. - -In this chapter we give a guarantee in terms of time complexity: -given a regular expression $r$, for any string $s$ -our algorithm's internal data structure is finitely bounded. -To obtain such a proof, we need to -\begin{itemize} -\item -Define an new datatype for regular expressions that makes it easy -to reason about the size of an annotated regular expression. -\item -A set of equalities for this new datatype that enables one to -rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc. -by their children regexes $r_1$, $r_2$, and $r$. -\item -Using those equalities to actually get those rewriting equations, which we call -"closed forms". -\item -Bound the closed forms, thereby bounding the original -$\blexersimp$'s internal data structures. -\end{itemize} - -\section{the $\mathbf{r}$-rexp datatype and the size functions} - -We have a size function for bitcoded regular expressions, written -$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree - -\begin{center} -\begin{tabular}{ccc} -$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\ -\end{tabular} -\end{center} -(TODO: COMPLETE this defn and for $rs$) - -The size is based on a recursive function on the structure of the regex, -not the bitcodes. -Therefore we may as well talk about size of an annotated regular expression -in their un-annotated form: -\begin{center} -$\asize(a) = \size(\erase(a))$. -\end{center} -(TODO: turn equals to roughly equals) - -But there is a minor nuisance here: -the erase function actually messes with the structure of the regular expression: -\begin{center} -\begin{tabular}{ccc} -$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\ -\end{tabular} -\end{center} -(TODO: complete this definition with singleton r in alts) - -An alternative regular expression with an empty list of children - is turned into an $\ZERO$ during the -$\erase$ function, thereby changing the size and structure of the regex. -These will likely be fixable if we really want to use plain $\rexp$s for dealing -with size, but we choose a more straightforward (or stupid) method by -defining a new datatype that is similar to plain $\rexp$s but can take -non-binary arguments for its alternative constructor, - which we call $\rrexp$ to denote -the difference between it and plain regular expressions. -\[ \rrexp ::= \RZERO \mid \RONE - \mid \RCHAR{c} - \mid \RSEQ{r_1}{r_2} - \mid \RALTS{rs} - \mid \RSTAR{r} -\] -For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, -but keep everything else intact. -It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved -(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton -$\ALTS$). -We denote the operation of erasing the bits and turning an annotated regular expression -into an $\rrexp{}$ as $\rerase{}$. -\begin{center} -\begin{tabular}{lcl} -$\rerase{\AZERO}$ & $=$ & $\RZERO$\\ -$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ -$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$ -\end{tabular} -\end{center} -%TODO: FINISH definition of rerase -Similarly we could define the derivative and simplification on -$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, -except that now they can operate on alternatives taking multiple arguments. - -\begin{center} -\begin{tabular}{lcr} -$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\ -(other clauses omitted) -\end{tabular} -\end{center} - -Now that $\rrexp$s do not have bitcodes on them, we can do the -duplicate removal without an equivalence relation: -\begin{center} -\begin{tabular}{lcl} -$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ - & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$ -\end{tabular} -\end{center} -%TODO: definition of rsimp (maybe only the alternative clause) - - -The reason why these definitions mirror precisely the corresponding operations -on annotated regualar expressions is that we can calculate sizes more easily: - -\begin{lemma} -$\rsize{\rerase a} = \asize a$ -\end{lemma} -This lemma says that the size of an r-erased regex is the same as the annotated regex. -this does not hold for plain $\rexp$s due to their ways of how alternatives are represented. -\begin{lemma} -$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ -\end{lemma} -Putting these two together we get a property that allows us to estimate the -size of an annotated regular expression derivative using its un-annotated counterpart: -\begin{lemma} -$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$ -\end{lemma} -Unless stated otherwise in this chapter all $\textit{rexp}$s without - bitcodes are seen as $\rrexp$s. - We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably - as the former suits people's intuitive way of stating a binary alternative - regular expression. - - - -%----------------------------------- -% SECTION ? -%----------------------------------- -\section{preparatory properties for getting a finiteness bound} -Before we get to the proof that says the intermediate result of our lexer will -remain finitely bounded, which is an important efficiency/liveness guarantee, -we shall first develop a few preparatory properties and definitions to -make the process of proving that a breeze. - -We define rewriting relations for $\rrexp$s, which allows us to do the -same trick as we did for the correctness proof, -but this time we will have stronger equalities established. -\subsection{"hrewrite" relation} -List of 1-step rewrite rules for regular expressions simplification without bitcodes: -\begin{figure} -\caption{the "h-rewrite" rules} -\[ -r_1 \cdot \ZERO \rightarrow_h \ZERO \] - -\[ -\ZERO \cdot r_2 \rightarrow \ZERO -\] -\end{figure} -And we define an "grewrite" relation that works on lists: -\begin{center} -\begin{tabular}{lcl} -$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$ -\end{tabular} -\end{center} - - - -With these relations we prove -\begin{lemma} -$rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$ -\end{lemma} -which enables us to prove "closed forms" equalities such as -\begin{lemma} -$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\suffix \; s \; r_1 ))}$ -\end{lemma} - -But the most involved part of the above lemma is proving the following: -\begin{lemma} -$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ -\end{lemma} -which is needed in proving things like -\begin{lemma} -$r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$ -\end{lemma} - -Fortunately this is provable by induction on the list $rs$ - - - -%----------------------------------- -% SECTION 2 -%----------------------------------- - -\begin{theorem} -For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$ -\end{theorem} - -\noindent -\begin{proof} -We prove this by induction on $r$. The base cases for $\AZERO$, -$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is -for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction -hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and -$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows -% -\begin{center} -\begin{tabular}{lcll} -& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\ -& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} :: - [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\ -& $\leq$ & - $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) :: - [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\ -& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket + - \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\ -& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + - \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\ -& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5) -\end{tabular} -\end{center} - - -\noindent -where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). -The reason why we could write the derivatives of a sequence this way is described in section 2. -The term (2) is used to control (1). -That is because one can obtain an overall -smaller regex list -by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. -Section 3 is dedicated to its proof. -In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is -bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller -than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands -for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). -We reason similarly for $\STAR$.\medskip -\end{proof} - -What guarantee does this bound give us? - -Whatever the regex is, it will not grow indefinitely. -Take our previous example $(a + aa)^*$ as an example: -\begin{center} -\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} -\begin{tikzpicture} -\begin{axis}[ - xlabel={number of $a$'s}, - x label style={at={(1.05,-0.05)}}, - ylabel={regex size}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax= 40, - ytick={0,10,...,40}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={$(a + aa)^*$}, - legend pos=north west, - legend cell align=left] -\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; -\end{axis} -\end{tikzpicture} -\end{tabular} -\end{center} -We are able to limit the size of the regex $(a + aa)^*$'s derivatives - with our simplification -rules very effectively. - - -In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound -is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$. -Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$ -inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function -$f(x) = x * 2^x$. -This means the bound we have will surge up at least -tower-exponentially with a linear increase of the depth. -For a regex of depth $n$, the bound -would be approximately $4^n$. - -Test data in the graphs from randomly generated regular expressions -shows that the giant bounds are far from being hit. -%a few sample regular experessions' derivatives -%size change -%TODO: giving regex1_size_change.data showing a few regexes' size changes -%w;r;t the input characters number, where the size is usually cubic in terms of original size -%a*, aa*, aaa*, ..... -%randomly generated regexes -\begin{center} -\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} -\begin{tikzpicture} -\begin{axis}[ - xlabel={number of $a$'s}, - x label style={at={(1.05,-0.05)}}, - ylabel={regex size}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=1000, - ytick={0,100,...,1000}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={regex1}, - legend pos=north west, - legend cell align=left] -\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; -\end{axis} -\end{tikzpicture} - & -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - %ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=1000, - ytick={0,100,...,1000}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={regex2}, - legend pos=north west, - legend cell align=left] -\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; -\end{axis} -\end{tikzpicture} - & -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - %ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=1000, - ytick={0,100,...,1000}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={regex3}, - legend pos=north west, - legend cell align=left] -\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; -\end{axis} -\end{tikzpicture}\\ -\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.} -\end{tabular} -\end{center} - - - - - -\noindent -Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the -original size. -This suggests a link towrads "partial derivatives" - introduced by Antimirov \cite{Antimirov95}. - - \section{Antimirov's partial derivatives} - The idea behind Antimirov's partial derivatives -is to do derivatives in a similar way as suggested by Brzozowski, -but maintain a set of regular expressions instead of a single one: - -%TODO: antimirov proposition 3.1, needs completion - \begin{center} - \begin{tabular}{ccc} - $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\ -$\partial_x(\ONE)$ & $=$ & $\phi$ -\end{tabular} -\end{center} - -Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together -using the alternatives constructor, Antimirov cleverly chose to put them into -a set instead. This breaks the terms in a derivative regular expression up, -allowing us to understand what are the "atomic" components of it. -For example, To compute what regular expression $x^*(xx + y)^*$'s -derivative against $x$ is made of, one can do a partial derivative -of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$ -from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$. -To get all the "atomic" components of a regular expression's possible derivatives, -there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes -whatever character is available at the head of the string inside the language of a -regular expression, and gives back the character and the derivative regular expression -as a pair (which he called "monomial"): - \begin{center} - \begin{tabular}{ccc} - $\lf(\ONE)$ & $=$ & $\phi$\\ -$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\ - $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ - $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\ -\end{tabular} -\end{center} -%TODO: completion - -There is a slight difference in the last three clauses compared -with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular -expression $r$ with every element inside $\textit{rset}$ to create a set of -sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates -on a set of monomials (which Antimirov called "linear form") and a regular -expression, and returns a linear form: - \begin{center} - \begin{tabular}{ccc} - $l \bigodot (\ZERO)$ & $=$ & $\phi$\\ - $l \bigodot (\ONE)$ & $=$ & $l$\\ - $\phi \bigodot t$ & $=$ & $\phi$\\ - $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\ - $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\ - $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\ - $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ - $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\ -\end{tabular} -\end{center} -%TODO: completion - - Some degree of simplification is applied when doing $\bigodot$, for example, - $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$, - and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and - $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$, - and so on. - - With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regex $r$ with - an iterative procedure: - \begin{center} - \begin{tabular}{llll} -$\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\ - & $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\ - & $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\ -$\partial_{UNIV}(r)$ & $=$ & $\PD$ & -\end{tabular} -\end{center} - - - $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, - - -However, if we introduce them in our -setting we would lose the POSIX property of our calculated values. -A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$. -If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer -would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of -what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information -in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$ -occurs, and apply them in the right order once we get -a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value. -This is unlike the simplification we had before, where the rewriting rules -such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value. -We will discuss better -bounds in the last section of this chapter.\\[-6.5mm] - - - - -%---------------------------------------------------------------------------------------- -% SECTION ?? -%---------------------------------------------------------------------------------------- - -\section{"Closed Forms" of regular expressions' derivatives w.r.t strings} -To embark on getting the "closed forms" of regexes, we first -define a few auxiliary definitions to make expressing them smoothly. - - \begin{center} - \begin{tabular}{ccc} - $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ -$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\ -$\sflataux r$ & $=$ & $ [r]$ -\end{tabular} -\end{center} -The intuition behind $\sflataux{\_}$ is to break up nested regexes -of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape -into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. -It will return the singleton list $[r]$ otherwise. - -$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps -the output type a regular expression, not a list: - \begin{center} - \begin{tabular}{ccc} - $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ -$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\ -$\sflat r$ & $=$ & $ [r]$ -\end{tabular} -\end{center} -$\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the - first element of the list of children of -an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression -$r_1 \cdot r_2 \backslash s$. - -With $\sflat{\_}$ and $\sflataux{\_}$ we make -precise what "closed forms" we have for the sequence derivatives and their simplifications, -in other words, how can we construct $(r_1 \cdot r_2) \backslash s$ -and $\bderssimp{(r_1\cdot r_2)}{s}$, -if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$) -and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over -the substring of $s$? -First let's look at a series of derivatives steps on a sequence -regular expression, (assuming) that each time the first -component of the sequence is always nullable): -\begin{center} - -$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ -$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad - \ldots$ - -\end{center} -%TODO: cite indian paper -Indianpaper have come up with a slightly more formal way of putting the above process: -\begin{center} -$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) + -\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots -+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$ -\end{center} -where $\delta(b, r)$ will produce $r$ -if $b$ evaluates to true, -and $\ZERO$ otherwise. - - But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical - equivalence. To make this intuition useful - for a formal proof, we need something -stronger than language equivalence. -With the help of $\sflat{\_}$ we can state the equation in Indianpaper -more rigorously: -\begin{lemma} -$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ -\end{lemma} - -The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string: - -\begin{center} -\begin{tabular}{lcl} -$\vsuf{[]}{\_} $ & $=$ & $[]$\\ -$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ - && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ -\end{tabular} -\end{center} -It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable, -and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in -the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$. -In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing -the entire result of $(r_1 \cdot r_2) \backslash s$, -it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$. - -With this we can also add simplifications to both sides and get -\begin{lemma} -$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ -\end{lemma} -Together with the idempotency property of $\rsimp$, -%TODO: state the idempotency property of rsimp -\begin{lemma} -$\rsimp{r} = \rsimp{\rsimp{r}}$ -\end{lemma} -it is possible to convert the above lemma to obtain a "closed form" -for derivatives nested with simplification: -\begin{lemma} -$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ -\end{lemma} -And now the reason we have (1) in section 1 is clear: -$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, -is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$ - as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$. - -%---------------------------------------------------------------------------------------- -% SECTION 3 -%---------------------------------------------------------------------------------------- - -\section{interaction between $\distinctWith$ and $\flts$} -Note that it is not immediately obvious that -\begin{center} -$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $. -\end{center} -The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of -duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. - - -%----------------------------------- -% SECTION syntactic equivalence under simp -%----------------------------------- -\section{Syntactic Equivalence Under $\simp$} -We prove that minor differences can be annhilated -by $\simp$. -For example, -\begin{center} -$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = - \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ -\end{center} - - -%---------------------------------------------------------------------------------------- -% SECTION ALTS CLOSED FORM -%---------------------------------------------------------------------------------------- -\section{A Closed Form for \textit{ALTS}} -Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. - - -There are a few key steps, one of these steps is -\[ -rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {}))) -= rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {})) -\] - - -One might want to prove this by something a simple statement like: -$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$. - -For this to hold we want the $\textit{distinct}$ function to pick up -the elements before and after derivatives correctly: -$r \in rset \equiv (rder x r) \in (rder x rset)$. -which essentially requires that the function $\backslash$ is an injective mapping. - -Unfortunately the function $\backslash c$ is not an injective mapping. - -\subsection{function $\backslash c$ is not injective (1-to-1)} -\begin{center} -The derivative $w.r.t$ character $c$ is not one-to-one. -Formally, - $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ -\end{center} -This property is trivially true for the -character regex example: -\begin{center} - $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ -\end{center} -But apart from the cases where the derivative -output is $\ZERO$, are there non-trivial results -of derivatives which contain strings? -The answer is yes. -For example, -\begin{center} - Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ - where $a$ is not nullable.\\ - $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ - $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ -\end{center} -We start with two syntactically different regexes, -and end up with the same derivative result. -This is not surprising as we have such -equality as below in the style of Arden's lemma:\\ -\begin{center} - $L(A^*B) = L(A\cdot A^* \cdot B + B)$ -\end{center} - -%---------------------------------------------------------------------------------------- -% SECTION 4 -%---------------------------------------------------------------------------------------- -\section{A Bound for the Star Regular Expression} -We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using -the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then -the property of the $\distinct$ function. -Now we try to get a bound on $r^* \backslash s$ as well. -Again, we first look at how a star's derivatives evolve, if they grow maximally: -\begin{center} - -$r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad -r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad -(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''} -\quad \ldots$ - -\end{center} -When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, -$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, -the number of terms in $r^* \backslash s$ will grow exponentially, causing the size -of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not -count the possible size explosions of $r \backslash c$ themselves. - -Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like -$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ -into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ -and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$). -For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$: -%TODO: definitions of and \hflataux \hflat - \begin{center} - \begin{tabular}{ccc} - $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ -$\hflataux r$ & $=$ & $ [r]$ -\end{tabular} -\end{center} - - \begin{center} - \begin{tabular}{ccc} - $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\ -$\hflat r$ & $=$ & $ r$ -\end{tabular} -\end{center}s -Again these definitions are tailor-made for dealing with alternatives that have -originated from a star's derivatives, so we don't attempt to open up all possible -regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 -elements. -We give a predicate for such "star-created" regular expressions: -\begin{center} -\begin{tabular}{lcr} - & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ - $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ - \end{tabular} - \end{center} - - These definitions allows us the flexibility to talk about - regular expressions in their most convenient format, - for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $ - instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$. - These definitions help express that certain classes of syntatically - distinct regular expressions are actually the same under simplification. - This is not entirely true for annotated regular expressions: - %TODO: bsimp bders \neq bderssimp - \begin{center} - $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ - \end{center} - For bit-codes, the order in which simplification is applied - might cause a difference in the location they are placed. - If we want something like - \begin{center} - $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ - \end{center} - Some "canonicalization" procedure is required, - which either pushes all the common bitcodes to nodes - as senior as possible: - \begin{center} - $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ - \end{center} - or does the reverse. However bitcodes are not of interest if we are talking about - the $\llbracket r \rrbracket$ size of a regex. - Therefore for the ease and simplicity of producing a - proof for a size bound, we are happy to restrict ourselves to - unannotated regular expressions, and obtain such equalities as - \begin{lemma} - $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ - \end{lemma} - - \begin{proof} - By using the rewriting relation $\rightsquigarrow$ - \end{proof} - %TODO: rsimp sflat -And from this we obtain a proof that a star's derivative will be the same -as if it had all its nested alternatives created during deriving being flattened out: - For example, - \begin{lemma} - $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ - \end{lemma} - \begin{proof} - By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$. - \end{proof} -% The simplification of a flattened out regular expression, provided it comes -%from the derivative of a star, is the same as the one nested. - - - - - - - - - -One might wonder the actual bound rather than the loose bound we gave -for the convenience of an easier proof. -How much can the regex $r^* \backslash s$ grow? -As earlier graphs have shown, -%TODO: reference that graph where size grows quickly - they can grow at a maximum speed - exponential $w.r.t$ the number of characters, -but will eventually level off when the string $s$ is long enough. -If they grow to a size exponential $w.r.t$ the original regex, our algorithm -would still be slow. -And unfortunately, we have concrete examples -where such regexes grew exponentially large before levelling off: -$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + -(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum - size that is exponential on the number $n$ -under our current simplification rules: -%TODO: graph of a regex whose size increases exponentially. -\begin{center} -\begin{tikzpicture} - \begin{axis}[ - height=0.5\textwidth, - width=\textwidth, - xlabel=number of a's, - xtick={0,...,9}, - ylabel=maximum size, - ymode=log, - log basis y={2} -] - \addplot[mark=*,blue] table {re-chengsong.data}; - \end{axis} -\end{tikzpicture} -\end{center} - -For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ -to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + -(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. -The exponential size is triggered by that the regex -$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ -inside the $(\ldots) ^*$ having exponentially many -different derivatives, despite those difference being minor. -$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ -will therefore contain the following terms (after flattening out all nested -alternatives): -\begin{center} -$(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ -$(1 \leq m' \leq m )$ -\end{center} -These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix). - With each new input character taking the derivative against the intermediate result, more and more such distinct - terms will accumulate, -until the length reaches $L.C.M.(1, \ldots, n)$. -$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms -$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ - -$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ - where $m' \neq m''$ \\ - as they are slightly different. - This means that with our current simplification methods, - we will not be able to control the derivative so that - $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$ - as there are already exponentially many terms. - These terms are similar in the sense that the head of those terms - are all consisted of sub-terms of the form: - $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. - For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most - $n * (n + 1) / 2$ such terms. - For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives - can be described by 6 terms: - $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, - $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. -The total number of different "head terms", $n * (n + 1) / 2$, - is proportional to the number of characters in the regex -$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. -This suggests a slightly different notion of size, which we call the -alphabetic width: -%TODO: -(TODO: Alphabetic width def.) - - -Antimirov\parencite{Antimirov95} has proven that -$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$. -where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms -created by doing derivatives of $r$ against all possible strings. -If we can make sure that at any moment in our lexing algorithm our -intermediate result hold at most one copy of each of the -subterms then we can get the same bound as Antimirov's. -This leads to the algorithm in the next chapter. - - - - - -%---------------------------------------------------------------------------------------- -% SECTION 1 -%---------------------------------------------------------------------------------------- - -\section{Idempotency of $\simp$} - -\begin{equation} - \simp \;r = \simp\; \simp \; r -\end{equation} -This property means we do not have to repeatedly -apply simplification in each step, which justifies -our definition of $\blexersimp$. -It will also be useful in future proofs where properties such as -closed forms are needed. -The proof is by structural induction on $r$. - -%----------------------------------- -% SUBSECTION 1 -%----------------------------------- -\subsection{Syntactic Equivalence Under $\simp$} -We prove that minor differences can be annhilated -by $\simp$. -For example, -\begin{center} -$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = - \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ -\end{center} -