% Chapter Template+ −
+ −
\chapter{Finiteness Bound} % Main chapter title+ −
+ −
\label{ChapterFinite} + −
% In Chapter 4 \ref{Chapter4} we give the second guarantee+ −
%of our bitcoded algorithm, that is a finite bound on the size of any + −
%regex's derivatives. + −
+ −
In this chapter we give a guarantee in terms of time complexity:+ −
given a regular expression $r$, for any string $s$ + −
our algorithm's internal data structure is finitely bounded.+ −
To obtain such a proof, we need to + −
\begin{itemize}+ −
\item+ −
Define an new datatype for regular expressions that makes it easy+ −
to reason about the size of an annotated regular expression.+ −
\item+ −
A set of equalities for this new datatype that enables one to+ −
rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc.+ −
by their children regexes $r_1$, $r_2$, and $r$.+ −
\item+ −
Using those equalities to actually get those rewriting equations, which we call+ −
"closed forms".+ −
\item+ −
Bound the closed forms, thereby bounding the original+ −
$\blexersimp$'s internal data structures.+ −
\end{itemize}+ −
+ −
\section{the $\mathbf{r}$-rexp datatype and the size functions}+ −
+ −
We have a size function for bitcoded regular expressions, written+ −
$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree+ −
+ −
\begin{center}+ −
\begin{tabular}{ccc}+ −
$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\+ −
\end{tabular}+ −
\end{center}+ −
(TODO: COMPLETE this defn and for $rs$)+ −
+ −
The size is based on a recursive function on the structure of the regex,+ −
not the bitcodes.+ −
Therefore we may as well talk about size of an annotated regular expression + −
in their un-annotated form:+ −
\begin{center}+ −
$\asize(a) = \size(\erase(a))$. + −
\end{center}+ −
(TODO: turn equals to roughly equals)+ −
+ −
But there is a minor nuisance here:+ −
the erase function actually messes with the structure of the regular expression:+ −
\begin{center}+ −
\begin{tabular}{ccc}+ −
$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\+ −
\end{tabular}+ −
\end{center}+ −
(TODO: complete this definition with singleton r in alts)+ −
+ −
An alternative regular expression with an empty list of children+ −
is turned into an $\ZERO$ during the+ −
$\erase$ function, thereby changing the size and structure of the regex.+ −
These will likely be fixable if we really want to use plain $\rexp$s for dealing+ −
with size, but we choose a more straightforward (or stupid) method by + −
defining a new datatype that is similar to plain $\rexp$s but can take+ −
non-binary arguments for its alternative constructor,+ −
which we call $\rrexp$ to denote+ −
the difference between it and plain regular expressions. + −
\[ \rrexp ::= \RZERO \mid \RONE+ −
\mid \RCHAR{c} + −
\mid \RSEQ{r_1}{r_2}+ −
\mid \RALTS{rs}+ −
\mid \RSTAR{r} + −
\]+ −
For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, + −
but keep everything else intact.+ −
It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved+ −
(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton+ −
$\ALTS$).+ −
We denote the operation of erasing the bits and turning an annotated regular expression + −
into an $\rrexp{}$ as $\rerase{}$.+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\rerase{\AZERO}$ & $=$ & $\RZERO$\\+ −
$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\+ −
$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$+ −
\end{tabular}+ −
\end{center}+ −
%TODO: FINISH definition of rerase+ −
Similarly we could define the derivative and simplification on + −
$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, + −
except that now they can operate on alternatives taking multiple arguments.+ −
+ −
\begin{center}+ −
\begin{tabular}{lcr}+ −
$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\+ −
(other clauses omitted)+ −
\end{tabular}+ −
\end{center}+ −
+ −
Now that $\rrexp$s do not have bitcodes on them, we can do the + −
duplicate removal without an equivalence relation:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\+ −
& & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$+ −
\end{tabular}+ −
\end{center}+ −
%TODO: definition of rsimp (maybe only the alternative clause)+ −
+ −
+ −
The reason why these definitions mirror precisely the corresponding operations+ −
on annotated regualar expressions is that we can calculate sizes more easily:+ −
+ −
\begin{lemma}+ −
$\rsize{\rerase a} = \asize a$+ −
\end{lemma}+ −
This lemma says that the size of an r-erased regex is the same as the annotated regex.+ −
this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.+ −
\begin{lemma}+ −
$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$+ −
\end{lemma}+ −
Putting these two together we get a property that allows us to estimate the + −
size of an annotated regular expression derivative using its un-annotated counterpart:+ −
\begin{lemma}+ −
$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$+ −
\end{lemma}+ −
Unless stated otherwise in this chapter all $\textit{rexp}$s without+ −
bitcodes are seen as $\rrexp$s.+ −
We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably+ −
as the former suits people's intuitive way of stating a binary alternative+ −
regular expression.+ −
+ −
+ −
+ −
%-----------------------------------+ −
% SECTION ?+ −
%-----------------------------------+ −
\section{preparatory properties for getting a finiteness bound}+ −
Before we get to the proof that says the intermediate result of our lexer will+ −
remain finitely bounded, which is an important efficiency/liveness guarantee,+ −
we shall first develop a few preparatory properties and definitions to + −
make the process of proving that a breeze.+ −
+ −
We define rewriting relations for $\rrexp$s, which allows us to do the + −
same trick as we did for the correctness proof,+ −
but this time we will have stronger equalities established.+ −
\subsection{"hrewrite" relation}+ −
List of 1-step rewrite rules for regular expressions simplification without bitcodes:+ −
\begin{figure}+ −
\caption{the "h-rewrite" rules}+ −
\[+ −
r_1 \cdot \ZERO \rightarrow_h \ZERO \]+ −
+ −
\[+ −
\ZERO \cdot r_2 \rightarrow \ZERO + −
\]+ −
\end{figure}+ −
And we define an "grewrite" relation that works on lists:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$+ −
\end{tabular}+ −
\end{center}+ −
+ −
+ −
+ −
With these relations we prove+ −
\begin{lemma}+ −
$rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$+ −
\end{lemma}+ −
which enables us to prove "closed forms" equalities such as+ −
\begin{lemma}+ −
$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\suffix \; s \; r_1 ))}$+ −
\end{lemma}+ −
+ −
But the most involved part of the above lemma is proving the following:+ −
\begin{lemma}+ −
$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ + −
\end{lemma}+ −
which is needed in proving things like + −
\begin{lemma}+ −
$r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$+ −
\end{lemma}+ −
+ −
Fortunately this is provable by induction on the list $rs$+ −
+ −
+ −
+ −
%-----------------------------------+ −
% SECTION 2+ −
%-----------------------------------+ −
+ −
\begin{theorem}+ −
For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$+ −
\end{theorem}+ −
+ −
\noindent+ −
\begin{proof}+ −
We prove this by induction on $r$. The base cases for $\AZERO$,+ −
$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is+ −
for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction+ −
hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and+ −
$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows+ −
%+ −
\begin{center}+ −
\begin{tabular}{lcll}+ −
& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\+ −
& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::+ −
[\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\+ −
& $\leq$ &+ −
$\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::+ −
[$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\+ −
& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket ++ −
\llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\+ −
& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 ++ −
\llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\+ −
& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)+ −
\end{tabular}+ −
\end{center}+ −
+ −
+ −
\noindent+ −
where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).+ −
The reason why we could write the derivatives of a sequence this way is described in section 2.+ −
The term (2) is used to control (1). + −
That is because one can obtain an overall+ −
smaller regex list+ −
by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.+ −
Section 3 is dedicated to its proof.+ −
In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is + −
bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller+ −
than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands+ −
for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).+ −
We reason similarly for $\STAR$.\medskip+ −
\end{proof}+ −
+ −
What guarantee does this bound give us?+ −
+ −
Whatever the regex is, it will not grow indefinitely.+ −
Take our previous example $(a + aa)^*$ as an example:+ −
\begin{center}+ −
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={number of $a$'s},+ −
x label style={at={(1.05,-0.05)}},+ −
ylabel={regex size},+ −
enlargelimits=false,+ −
xtick={0,5,...,30},+ −
xmax=33,+ −
ymax= 40,+ −
ytick={0,10,...,40},+ −
scaled ticks=false,+ −
axis lines=left,+ −
width=5cm,+ −
height=4cm, + −
legend entries={$(a + aa)^*$}, + −
legend pos=north west,+ −
legend cell align=left]+ −
\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
\end{tabular}+ −
\end{center}+ −
We are able to limit the size of the regex $(a + aa)^*$'s derivatives+ −
with our simplification+ −
rules very effectively.+ −
+ −
+ −
In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound+ −
is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.+ −
Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$+ −
inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function+ −
$f(x) = x * 2^x$.+ −
This means the bound we have will surge up at least+ −
tower-exponentially with a linear increase of the depth.+ −
For a regex of depth $n$, the bound+ −
would be approximately $4^n$.+ −
+ −
Test data in the graphs from randomly generated regular expressions+ −
shows that the giant bounds are far from being hit.+ −
%a few sample regular experessions' derivatives+ −
%size change+ −
%TODO: giving regex1_size_change.data showing a few regexes' size changes + −
%w;r;t the input characters number, where the size is usually cubic in terms of original size+ −
%a*, aa*, aaa*, .....+ −
%randomly generated regexes+ −
\begin{center}+ −
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={number of $a$'s},+ −
x label style={at={(1.05,-0.05)}},+ −
ylabel={regex size},+ −
enlargelimits=false,+ −
xtick={0,5,...,30},+ −
xmax=33,+ −
ymax=1000,+ −
ytick={0,100,...,1000},+ −
scaled ticks=false,+ −
axis lines=left,+ −
width=5cm,+ −
height=4cm, + −
legend entries={regex1}, + −
legend pos=north west,+ −
legend cell align=left]+ −
\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
&+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={$n$},+ −
x label style={at={(1.05,-0.05)}},+ −
%ylabel={time in secs},+ −
enlargelimits=false,+ −
xtick={0,5,...,30},+ −
xmax=33,+ −
ymax=1000,+ −
ytick={0,100,...,1000},+ −
scaled ticks=false,+ −
axis lines=left,+ −
width=5cm,+ −
height=4cm, + −
legend entries={regex2}, + −
legend pos=north west,+ −
legend cell align=left]+ −
\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
&+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={$n$},+ −
x label style={at={(1.05,-0.05)}},+ −
%ylabel={time in secs},+ −
enlargelimits=false,+ −
xtick={0,5,...,30},+ −
xmax=33,+ −
ymax=1000,+ −
ytick={0,100,...,1000},+ −
scaled ticks=false,+ −
axis lines=left,+ −
width=5cm,+ −
height=4cm, + −
legend entries={regex3}, + −
legend pos=north west,+ −
legend cell align=left]+ −
\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};+ −
\end{axis}+ −
\end{tikzpicture}\\+ −
\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}+ −
\end{tabular} + −
\end{center} + −
+ −
+ −
+ −
+ −
+ −
\noindent+ −
Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the + −
original size.+ −
This suggests a link towrads "partial derivatives"+ −
introduced by Antimirov \cite{Antimirov95}.+ −
+ −
\section{Antimirov's partial derivatives}+ −
The idea behind Antimirov's partial derivatives+ −
is to do derivatives in a similar way as suggested by Brzozowski, + −
but maintain a set of regular expressions instead of a single one:+ −
+ −
%TODO: antimirov proposition 3.1, needs completion+ −
\begin{center} + −
\begin{tabular}{ccc}+ −
$\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\+ −
$\partial_x(\ONE)$ & $=$ & $\phi$+ −
\end{tabular}+ −
\end{center}+ −
+ −
Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together+ −
using the alternatives constructor, Antimirov cleverly chose to put them into+ −
a set instead. This breaks the terms in a derivative regular expression up, + −
allowing us to understand what are the "atomic" components of it.+ −
For example, To compute what regular expression $x^*(xx + y)^*$'s + −
derivative against $x$ is made of, one can do a partial derivative+ −
of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$+ −
from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.+ −
To get all the "atomic" components of a regular expression's possible derivatives,+ −
there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes+ −
whatever character is available at the head of the string inside the language of a+ −
regular expression, and gives back the character and the derivative regular expression+ −
as a pair (which he called "monomial"):+ −
\begin{center} + −
\begin{tabular}{ccc}+ −
$\lf(\ONE)$ & $=$ & $\phi$\\+ −
$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\+ −
$\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\+ −
$\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\+ −
\end{tabular}+ −
\end{center}+ −
%TODO: completion+ −
+ −
There is a slight difference in the last three clauses compared+ −
with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular + −
expression $r$ with every element inside $\textit{rset}$ to create a set of + −
sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates + −
on a set of monomials (which Antimirov called "linear form") and a regular + −
expression, and returns a linear form:+ −
\begin{center} + −
\begin{tabular}{ccc}+ −
$l \bigodot (\ZERO)$ & $=$ & $\phi$\\+ −
$l \bigodot (\ONE)$ & $=$ & $l$\\+ −
$\phi \bigodot t$ & $=$ & $\phi$\\+ −
$\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\+ −
$\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\+ −
$\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\+ −
$\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\+ −
$\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\+ −
\end{tabular}+ −
\end{center}+ −
%TODO: completion+ −
+ −
Some degree of simplification is applied when doing $\bigodot$, for example,+ −
$l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,+ −
and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and+ −
$\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,+ −
and so on.+ −
+ −
With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regex $r$ with + −
an iterative procedure:+ −
\begin{center} + −
\begin{tabular}{llll}+ −
$\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\+ −
& $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\+ −
& $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\+ −
$\partial_{UNIV}(r)$ & $=$ & $\PD$ & + −
\end{tabular}+ −
\end{center}+ −
+ −
+ −
$(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,+ −
+ −
+ −
However, if we introduce them in our+ −
setting we would lose the POSIX property of our calculated values. + −
A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.+ −
If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer + −
would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of+ −
what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information+ −
in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$+ −
occurs, and apply them in the right order once we get + −
a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.+ −
This is unlike the simplification we had before, where the rewriting rules + −
such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.+ −
We will discuss better+ −
bounds in the last section of this chapter.\\[-6.5mm]+ −
+ −
+ −
+ −
+ −
%----------------------------------------------------------------------------------------+ −
% SECTION ??+ −
%----------------------------------------------------------------------------------------+ −
+ −
\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}+ −
To embark on getting the "closed forms" of regexes, we first+ −
define a few auxiliary definitions to make expressing them smoothly.+ −
+ −
\begin{center} + −
\begin{tabular}{ccc}+ −
$\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\+ −
$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\+ −
$\sflataux r$ & $=$ & $ [r]$+ −
\end{tabular}+ −
\end{center}+ −
The intuition behind $\sflataux{\_}$ is to break up nested regexes + −
of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape+ −
into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.+ −
It will return the singleton list $[r]$ otherwise.+ −
+ −
$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps+ −
the output type a regular expression, not a list:+ −
\begin{center} + −
\begin{tabular}{ccc}+ −
$\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\+ −
$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\+ −
$\sflat r$ & $=$ & $ [r]$+ −
\end{tabular}+ −
\end{center}+ −
$\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the+ −
first element of the list of children of+ −
an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression+ −
$r_1 \cdot r_2 \backslash s$.+ −
+ −
With $\sflat{\_}$ and $\sflataux{\_}$ we make + −
precise what "closed forms" we have for the sequence derivatives and their simplifications,+ −
in other words, how can we construct $(r_1 \cdot r_2) \backslash s$+ −
and $\bderssimp{(r_1\cdot r_2)}{s}$,+ −
if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)+ −
and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over + −
the substring of $s$?+ −
First let's look at a series of derivatives steps on a sequence + −
regular expression, (assuming) that each time the first+ −
component of the sequence is always nullable):+ −
\begin{center}+ −
+ −
$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\+ −
$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad+ −
\ldots$+ −
+ −
\end{center}+ −
%TODO: cite indian paper+ −
Indianpaper have come up with a slightly more formal way of putting the above process:+ −
\begin{center}+ −
$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) ++ −
\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots+ −
+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$+ −
\end{center}+ −
where $\delta(b, r)$ will produce $r$+ −
if $b$ evaluates to true, + −
and $\ZERO$ otherwise.+ −
+ −
But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical+ −
equivalence. To make this intuition useful + −
for a formal proof, we need something+ −
stronger than language equivalence.+ −
With the help of $\sflat{\_}$ we can state the equation in Indianpaper+ −
more rigorously:+ −
\begin{lemma}+ −
$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$+ −
\end{lemma}+ −
+ −
The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\vsuf{[]}{\_} $ & $=$ & $[]$\\+ −
$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\+ −
&& $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $+ −
\end{tabular}+ −
\end{center} + −
It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,+ −
and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in+ −
the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.+ −
In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing + −
the entire result of $(r_1 \cdot r_2) \backslash s$, + −
it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$.+ −
+ −
With this we can also add simplifications to both sides and get+ −
\begin{lemma}+ −
$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$+ −
\end{lemma}+ −
Together with the idempotency property of $\rsimp$,+ −
%TODO: state the idempotency property of rsimp+ −
\begin{lemma}+ −
$\rsimp{r} = \rsimp{\rsimp{r}}$+ −
\end{lemma}+ −
it is possible to convert the above lemma to obtain a "closed form"+ −
for derivatives nested with simplification:+ −
\begin{lemma}+ −
$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$+ −
\end{lemma}+ −
And now the reason we have (1) in section 1 is clear:+ −
$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, + −
is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$+ −
as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.+ −
+ −
%----------------------------------------------------------------------------------------+ −
% SECTION 3+ −
%----------------------------------------------------------------------------------------+ −
+ −
\section{interaction between $\distinctWith$ and $\flts$}+ −
Note that it is not immediately obvious that + −
\begin{center}+ −
$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $.+ −
\end{center}+ −
The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of + −
duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. + −
+ −
+ −
%-----------------------------------+ −
% SECTION syntactic equivalence under simp+ −
%-----------------------------------+ −
\section{Syntactic Equivalence Under $\simp$}+ −
We prove that minor differences can be annhilated+ −
by $\simp$.+ −
For example,+ −
\begin{center}+ −
$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = + −
\simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$+ −
\end{center}+ −
+ −
+ −
%----------------------------------------------------------------------------------------+ −
% SECTION ALTS CLOSED FORM+ −
%----------------------------------------------------------------------------------------+ −
\section{A Closed Form for \textit{ALTS}}+ −
Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.+ −
+ −
+ −
There are a few key steps, one of these steps is+ −
\[+ −
rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))+ −
= rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))+ −
\]+ −
+ −
+ −
One might want to prove this by something a simple statement like: + −
$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.+ −
+ −
For this to hold we want the $\textit{distinct}$ function to pick up+ −
the elements before and after derivatives correctly:+ −
$r \in rset \equiv (rder x r) \in (rder x rset)$.+ −
which essentially requires that the function $\backslash$ is an injective mapping.+ −
+ −
Unfortunately the function $\backslash c$ is not an injective mapping.+ −
+ −
\subsection{function $\backslash c$ is not injective (1-to-1)}+ −
\begin{center}+ −
The derivative $w.r.t$ character $c$ is not one-to-one.+ −
Formally,+ −
$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$+ −
\end{center}+ −
This property is trivially true for the+ −
character regex example:+ −
\begin{center}+ −
$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$+ −
\end{center}+ −
But apart from the cases where the derivative+ −
output is $\ZERO$, are there non-trivial results+ −
of derivatives which contain strings?+ −
The answer is yes.+ −
For example,+ −
\begin{center}+ −
Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\+ −
where $a$ is not nullable.\\+ −
$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\+ −
$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$+ −
\end{center}+ −
We start with two syntactically different regexes,+ −
and end up with the same derivative result.+ −
This is not surprising as we have such + −
equality as below in the style of Arden's lemma:\\+ −
\begin{center}+ −
$L(A^*B) = L(A\cdot A^* \cdot B + B)$+ −
\end{center}+ −
+ −
%----------------------------------------------------------------------------------------+ −
% SECTION 4+ −
%----------------------------------------------------------------------------------------+ −
\section{A Bound for the Star Regular Expression}+ −
We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using+ −
the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then + −
the property of the $\distinct$ function.+ −
Now we try to get a bound on $r^* \backslash s$ as well.+ −
Again, we first look at how a star's derivatives evolve, if they grow maximally: + −
\begin{center}+ −
+ −
$r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad+ −
r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad + −
(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''}+ −
\quad \ldots$+ −
+ −
\end{center}+ −
When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, + −
$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,+ −
the number of terms in $r^* \backslash s$ will grow exponentially, causing the size+ −
of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not + −
count the possible size explosions of $r \backslash c$ themselves.+ −
+ −
Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like+ −
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ + −
into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$+ −
and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).+ −
For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:+ −
%TODO: definitions of and \hflataux \hflat+ −
\begin{center} + −
\begin{tabular}{ccc}+ −
$\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\+ −
$\hflataux r$ & $=$ & $ [r]$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\begin{center} + −
\begin{tabular}{ccc}+ −
$\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\+ −
$\hflat r$ & $=$ & $ r$+ −
\end{tabular}+ −
\end{center}s+ −
Again these definitions are tailor-made for dealing with alternatives that have+ −
originated from a star's derivatives, so we don't attempt to open up all possible + −
regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2+ −
elements.+ −
We give a predicate for such "star-created" regular expressions:+ −
\begin{center}+ −
\begin{tabular}{lcr}+ −
& & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\+ −
$\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$+ −
\end{tabular}+ −
\end{center}+ −
+ −
These definitions allows us the flexibility to talk about + −
regular expressions in their most convenient format,+ −
for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $+ −
instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.+ −
These definitions help express that certain classes of syntatically + −
distinct regular expressions are actually the same under simplification.+ −
This is not entirely true for annotated regular expressions: + −
%TODO: bsimp bders \neq bderssimp+ −
\begin{center}+ −
$(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$+ −
\end{center}+ −
For bit-codes, the order in which simplification is applied+ −
might cause a difference in the location they are placed.+ −
If we want something like+ −
\begin{center}+ −
$\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$+ −
\end{center}+ −
Some "canonicalization" procedure is required,+ −
which either pushes all the common bitcodes to nodes+ −
as senior as possible:+ −
\begin{center}+ −
$_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $+ −
\end{center}+ −
or does the reverse. However bitcodes are not of interest if we are talking about+ −
the $\llbracket r \rrbracket$ size of a regex.+ −
Therefore for the ease and simplicity of producing a+ −
proof for a size bound, we are happy to restrict ourselves to + −
unannotated regular expressions, and obtain such equalities as+ −
\begin{lemma}+ −
$\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$+ −
\end{lemma}+ −
+ −
\begin{proof}+ −
By using the rewriting relation $\rightsquigarrow$+ −
\end{proof}+ −
%TODO: rsimp sflat+ −
And from this we obtain a proof that a star's derivative will be the same+ −
as if it had all its nested alternatives created during deriving being flattened out:+ −
For example,+ −
\begin{lemma}+ −
$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$+ −
\end{lemma}+ −
\begin{proof}+ −
By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.+ −
\end{proof}+ −
% The simplification of a flattened out regular expression, provided it comes+ −
%from the derivative of a star, is the same as the one nested.+ −
+ −
+ −
+ −
+ −
+ −
+ −
+ −
+ −
+ −
One might wonder the actual bound rather than the loose bound we gave+ −
for the convenience of an easier proof.+ −
How much can the regex $r^* \backslash s$ grow? + −
As earlier graphs have shown,+ −
%TODO: reference that graph where size grows quickly+ −
they can grow at a maximum speed+ −
exponential $w.r.t$ the number of characters, + −
but will eventually level off when the string $s$ is long enough.+ −
If they grow to a size exponential $w.r.t$ the original regex, our algorithm+ −
would still be slow.+ −
And unfortunately, we have concrete examples+ −
where such regexes grew exponentially large before levelling off:+ −
$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + + −
(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum+ −
size that is exponential on the number $n$ + −
under our current simplification rules:+ −
%TODO: graph of a regex whose size increases exponentially.+ −
\begin{center}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
height=0.5\textwidth,+ −
width=\textwidth,+ −
xlabel=number of a's,+ −
xtick={0,...,9},+ −
ylabel=maximum size,+ −
ymode=log,+ −
log basis y={2}+ −
]+ −
\addplot[mark=*,blue] table {re-chengsong.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$+ −
to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + + −
(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.+ −
The exponential size is triggered by that the regex+ −
$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$+ −
inside the $(\ldots) ^*$ having exponentially many+ −
different derivatives, despite those difference being minor.+ −
$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$+ −
will therefore contain the following terms (after flattening out all nested + −
alternatives):+ −
\begin{center}+ −
$(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\+ −
$(1 \leq m' \leq m )$+ −
\end{center}+ −
These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).+ −
With each new input character taking the derivative against the intermediate result, more and more such distinct+ −
terms will accumulate, + −
until the length reaches $L.C.M.(1, \ldots, n)$.+ −
$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms + −
$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\+ −
+ −
$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\+ −
where $m' \neq m''$ \\+ −
as they are slightly different.+ −
This means that with our current simplification methods,+ −
we will not be able to control the derivative so that+ −
$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$+ −
as there are already exponentially many terms.+ −
These terms are similar in the sense that the head of those terms+ −
are all consisted of sub-terms of the form: + −
$(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $.+ −
For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most+ −
$n * (n + 1) / 2$ such terms. + −
For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives+ −
can be described by 6 terms:+ −
$a^*$, $a\cdot (aa)^*$, $ (aa)^*$, + −
$aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.+ −
The total number of different "head terms", $n * (n + 1) / 2$,+ −
is proportional to the number of characters in the regex + −
$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.+ −
This suggests a slightly different notion of size, which we call the + −
alphabetic width:+ −
%TODO:+ −
(TODO: Alphabetic width def.)+ −
+ −
+ −
Antimirov\parencite{Antimirov95} has proven that + −
$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.+ −
where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms+ −
created by doing derivatives of $r$ against all possible strings.+ −
If we can make sure that at any moment in our lexing algorithm our + −
intermediate result hold at most one copy of each of the + −
subterms then we can get the same bound as Antimirov's.+ −
This leads to the algorithm in the next chapter.+ −
+ −
+ −
+ −
+ −
+ −
%----------------------------------------------------------------------------------------+ −
% SECTION 1+ −
%----------------------------------------------------------------------------------------+ −
+ −
\section{Idempotency of $\simp$}+ −
+ −
\begin{equation}+ −
\simp \;r = \simp\; \simp \; r + −
\end{equation}+ −
This property means we do not have to repeatedly+ −
apply simplification in each step, which justifies+ −
our definition of $\blexersimp$.+ −
It will also be useful in future proofs where properties such as + −
closed forms are needed.+ −
The proof is by structural induction on $r$.+ −
+ −
%-----------------------------------+ −
% SUBSECTION 1+ −
%-----------------------------------+ −
\subsection{Syntactic Equivalence Under $\simp$}+ −
We prove that minor differences can be annhilated+ −
by $\simp$.+ −
For example,+ −
\begin{center}+ −
$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = + −
\simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$+ −
\end{center}+ −
+ −