% Chapter Template
\chapter{Finiteness Bound} % Main chapter title
\label{Finite}
% In Chapter 4 \ref{Chapter4} we give the second guarantee
%of our bitcoded algorithm, that is a finite bound on the size of any
%regex's derivatives.
%(this is cahpter 5 now)
In this chapter we give a bound in terms of the size of
the calculated derivatives:
given an annotated regular expression $a$, for any string $s$
our algorithm $\blexersimp$'s derivatives
are finitely bounded
by a constant that only depends on $a$.
Formally we show that there exists an integer $N_a$ such that
\begin{center}
$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
\end{center}
\noindent
where the size ($\llbracket \_ \rrbracket$) of
an annotated regular expression is defined
in terms of the number of nodes in its
tree structure (its recursive definition is given in the next page).
We believe this size bound
is important in the context of POSIX lexing because
\marginpar{Addressing Gerog comment: "how does this relate to backtracking?"}
\begin{itemize}
\item
It is a stepping stone towards the goal
of eliminating ``catastrophic backtracking''.
The derivative-based lexing algorithm avoids backtracking
by a trade-off between space and time.
Backtracking algorithms
save other possibilities on a stack when exploring one possible
path of matching. Catastrophic backtracking typically occurs
when the number of steps increase exponentially with respect
to input. In other words, the runtime is $O((c_r)^n)$ of the input
string length $n$, where the base of the exponent is determined by the
regular expression $r$.
%so that they
%can be traversed in the future in a DFS manner,
%different matchings are stored as sub-expressions
%in a regular expression derivative.
Derivatives saves these possibilities as sub-expressions
and traverse those during future derivatives. If we denote the size
of intermediate derivatives as $S_{r,n}$ (where the subscripts
$r,n$ indicate that $S$ depends on them), then the runtime of
derivative-based approaches would be $O(S_{r,n} * n)$.
We observe that if $S_{r,n}$ continously grows with $n$ (for example
growing exponentially fast), then this
is equally bad as catastrophic backtracking.
Our finiteness bound seeks to find a constant upper bound $C$ of $\S_{r,n}$,
so that the complexity of the algorithm can be seen as linear ($O(C * n)$).
Even if $C$ is still large in our current work, it is still a constant
rather than ever-increasing number with respect to input length $n$.
More importantly this $C$ constant can potentially
be shrunken as we optimize our simplification procedure.
%and showing the potential
%improvements can be by the notion of partial derivatives.
%If the internal data structures used by our algorithm
%grows beyond a finite bound, then clearly
%the algorithm (which traverses these structures) will
%be slow.
%The next step is to refine the bound $N_a$ so that it
%is not just finite but polynomial in $\llbracket a\rrbracket$.
\item
Having the finite bound formalised
gives us higher confidence that
our simplification algorithm $\simp$ does not ``misbehave''
like $\textit{simpSL}$ does.
The bound is universal for a given regular expression,
which is an advantage over work which
only gives empirical evidence on
some test cases (see for example Verbatim work \cite{Verbatimpp}).
\end{itemize}
\noindent
We then extend our $\blexersimp$
to support bounded repetitions ($r^{\{n\}}$).
We update our formalisation of
the correctness and finiteness properties to
include this new construct.
We show that we can out-compete other verified lexers such as
Verbatim++ on bounded regular expressions.
In the next section we describe in more detail
what the finite bound means in our algorithm
and why the size of the internal data structures of
a typical derivative-based lexer such as
Sulzmann and Lu's needs formal treatment.
\section{Formalising Size Bound of Derivatives}
\noindent
In our lexer ($\blexersimp$),
we take an annotated regular expression as input,
and repeately take derivative of and simplify it.
\begin{figure}
\begin{center}
\begin{tabular}{lcl}
$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\
$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$.
\end{tabular}
\end{center}
\caption{The size function of bitcoded regular expressions}\label{brexpSize}
\end{figure}
\begin{figure}
\begin{tikzpicture}[scale=2,
every node/.style={minimum size=11mm},
->,>=stealth',shorten >=1pt,auto,thick
]
\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$};
\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};
\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$};
\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$};
\node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 12mm]{$a_2$};
\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$};
\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$};
\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$};
\draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
\node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$};
\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode};
\end{tikzpicture}
\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks}
\end{figure}
\noindent
Each time
a derivative is taken, the regular expression might grow.
However, the simplification that is immediately afterwards will often shrink it so that
the overall size of the derivatives stays relatively small.
This intuition is depicted by the relative size
change between the black and blue nodes:
After $\simp$ the node shrinks.
Our proof states that all the blue nodes
stay below a size bound $N_a$ determined by the input $a$.
\noindent
Sulzmann and Lu's assumed a similar picture of their algorithm,
though in fact their algorithm's size might be better depicted by the following graph:
\begin{figure}[H]
\begin{tikzpicture}[scale=2,
every node/.style={minimum size=11mm},
->,>=stealth',shorten >=1pt,auto,thick
]
\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$};
\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};
\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$};
\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$};
\node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 17mm]{$a_2$};
\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$};
\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$};
\node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$};
\draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$};
\node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$};
\draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$};
\node (rnn) [right = of rns, minimum size = 1mm]{};
\draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$};
\end{tikzpicture}
\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}
\end{figure}
\noindent
The picture means that in some cases their lexer (where they use $\simpsulz$
as the simplification function)
will have a size explosion, causing the running time
of each derivative step to grow continuously (for example
in \ref{SulzmannLuLexerTime}).
They tested out the run time of their
lexer on particular examples such as $(a+b+ab)^*$
and claimed that their algorithm is linear w.r.t to the input.
With our mechanised proof, we avoid this type of unintentional
generalisation.
Before delving into the details of the formalisation,
we are going to provide an overview of it in the following subsection.
\subsection{Overview of the Proof}
A high-level overview of the main components of the finiteness proof
is as follows:
\begin{figure}[H]
\begin{tikzpicture}[scale=1,font=\bf,
node/.style={
rectangle,rounded corners=3mm,
ultra thick,draw=black!50,minimum height=18mm,
minimum width=20mm,
top color=white,bottom color=black!20}]
\node (0) at (-5,0)
[node, text width=1.8cm, text centered]
{$\llbracket \bderssimp{a}{s} \rrbracket$};
\node (A) at (0,0)
[node,text width=1.6cm, text centered]
{$\llbracket \rderssimp{r}{s} \rrbracket_r$};
\node (B) at (3,0)
[node,text width=3.0cm, anchor=west, minimum width = 40mm]
{$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$};
\node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$};
\draw [->,line width=0.5mm] (0) --
node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A);
\draw [->,line width=0.5mm] (A) --
node [above,pos=0.35] {$\quad =\ldots=$} (B);
\draw [->,line width=0.5mm] (B) --
node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C);
\end{tikzpicture}
%\caption{
\end{figure}
\noindent
We explain the steps one by one:
\begin{itemize}
\item
We first introduce the operations such as
derivatives, simplification, size calculation, etc.
associated with $\rrexp$s, which we have introduced
in chapter \ref{Bitcoded2}. As promised we will discuss
why they are needed in \ref{whyRerase}.
The operations on $\rrexp$s are identical to those on
annotated regular expressions except that they dispense with
bitcodes. This means that all proofs about size of $\rrexp$s will apply to
annotated regular expressions, because the size of a regular
expression is independent of the bitcodes.
\item
We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$,
where $\textit{ClosedForm}(r, s)$ is entirely
given as the derivatives of their children regular
expressions.
We call the right-hand-side the \emph{Closed Form}
of the derivative $\rderssimp{r}{s}$.
\item
Formally we give an estimate of
$\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$.
The key observation is that $\distinctBy$'s output is
a list with a constant length bound.
\end{itemize}
We will expand on these steps in the next sections.\\
\section{The $\textit{Rrexp}$ Datatype}
The first step is to define
$\textit{rrexp}$s.
They are annotated regular expressions without bitcodes,
allowing a more convenient size bound proof.
%Of course, the bits which encode the lexing information
%would grow linearly with respect
%to the input, which should be taken into accounte when we wish to tackle the runtime complexity.
%But for the sake of the structural size
%we can safely ignore them.\\
The datatype
definition of the $\rrexp$, called
\emph{r-regular expressions},
was initially defined in \ref{rrexpDef}.
The reason for the prefix $r$ is
to make a distinction
with basic regular expressions.
We give here again the definition of $\rrexp$.
\[ \rrexp ::= \RZERO \mid \RONE
\mid \RCHAR{c}
\mid \RSEQ{r_1}{r_2}
\mid \RALTS{rs}
\mid \RSTAR{r}
\]
The size of an r-regular expression is
written $\llbracket r\rrbracket_r$,
whose definition mirrors that of an annotated regular expression.
\begin{center}
\begin{tabular}{lcl}
$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\
$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\
$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$.
\end{tabular}
\end{center}
\noindent
The $r$ in the subscript of $\llbracket \rrbracket_r$ is to
differentiate with the same operation for annotated regular expressions.
Similar subscripts will be added for operations like $\rerase{}$:
\begin{center}
\begin{tabular}{lcl}
$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$
\end{tabular}
\end{center}
\subsection{Why a New Datatype?}\label{whyRerase}
\marginpar{\em added label so this section can be referenced by other parts of the thesis
so that interested readers can jump to/be reassured that there will explanations.}
Originally the erase operation $(\_)_\downarrow$ was
used by Ausaf et al. in their proofs related to $\blexer$.
This function was not part of the lexing algorithm, and the sole purpose was to
bridge the gap between the $r$
%$\textit{rexp}$
(un-annotated) and $\textit{arexp}$ (annotated)
regular expression datatypes so as to leverage the correctness
theorem of $\lexer$.%to establish the correctness of $\blexer$.
For example, lemma \ref{retrieveStepwise} %and \ref{bmkepsRetrieve}
uses $\erase$ to convert an annotated regular expression $a$ into
a plain one so that it can be used by $\inj$ to create the desired value
$\inj\; (a)_\downarrow \; c \; v$.
Ideally $\erase$ should only remove the auxiliary information not related to the
structure--the
bitcodes. However there exists a complication
where the alternative constructors have different arity for $\textit{arexp}$
and $\textit{r}$:
\begin{center}
\begin{tabular}{lcl}
$\textit{r}$ & $::=$ & $\ldots \;|\; (\_ + \_) \; ::\; "\textit{r} \Rightarrow \textit{r} \Rightarrow \textit{r}" | \ldots$\\
$\textit{arexp}$ & $::=$ & $\ldots\; |\; (\Sigma \_ ) \; ::\; "\textit{arexp} \; list \Rightarrow \textit{arexp}" | \ldots$
\end{tabular}
\end{center}
\noindent
To convert between the two
$\erase$ has to recursively disassemble a list into nested binary applications of the
$(\_ + \_)$ operator,
handling corner cases like empty or
singleton alternative lists:
%becomes $r$ during the
%$\erase$ function.
%The annotated regular expression $\sum[a, b, c]$ would turn into
%$(a+(b+c))$.
\begin{center}
\begin{tabular}{lcl}
$ (_{bs}\sum [])_\downarrow $ & $\dn$ & $\ZERO$\\
$ (_{bs}\sum [a])_\downarrow$ & $\dn$ & $a$\\
$ (_{bs}\sum a_1 :: a_2)_\downarrow$ & $\dn$ & $(a_1)_\downarrow + (a_2)_\downarrow)$\\
$ (_{bs}\sum a :: as)_\downarrow$ & $\dn$ & $a_\downarrow + (\erase \; _{[]} \sum as)$
\end{tabular}
\end{center}
\noindent
These operations inevitably change the structure and size of
an annotated regular expression. For example,
$a_1 = \sum _{Z}[x]$ has size 2, but $(a_1)_\downarrow = x$
only has size 1.
%adding unnecessary
%complexities to the size bound proof.
%The reason we
%define a new datatype is that
%the $\erase$ function
%does not preserve the structure of annotated
%regular expressions.
%We initially started by using
%plain regular expressions and tried to prove
%lemma \ref{rsizeAsize},
%however the $\erase$ function messes with the structure of the
%annotated regular expression.
%The $+$ constructor
%of basic regular expressions is only binary, whereas $\sum$
%takes a list. Therefore we need to convert between
%annotated and normal regular expressions as follows:
For example, if we define the size of a basic plain regular expression
in the usual way,
\begin{center}
\begin{tabular}{lcl}
$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
$\llbracket r_1 + r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$
\end{tabular}
\end{center}
\noindent
Then the property
\begin{center}
$\llbracket a \rrbracket \stackrel{?}{=} \llbracket a_\downarrow \rrbracket_p$
\end{center}
does not hold.
%With $\textit{rerase}$, however,
%only the bitcodes are thrown away.
That leads to us defining the new regular expression datatype without
bitcodes but with a list alternative constructor, and defining a new erase function
in a strictly structure-preserving manner:
\begin{center}
\begin{tabular}{lcl}
$\textit{rrexp}$ & $::=$ & $\ldots\; |\; (\sum \_ ) \; ::\; "\textit{rrexp} \; list \Rightarrow \textit{rrexp}" | \ldots$\\
$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
\end{tabular}
\end{center}
\noindent
%But
%Everything about the structure remains intact.
%Therefore it does not change the size
%of an annotated regular expression and we have:
\noindent
One might be able to prove an inequality such as
$\llbracket a \rrbracket \leq \llbracket a_\downarrow \rrbracket_p $
and then estimate $\llbracket a_\downarrow \rrbracket_p$,
but we found our approach more straightforward.\\
\subsection{Functions for R-regular Expressions}
The downside of our approach is that we need to redefine
several functions for $\rrexp$.
In this section we shall define the r-regular expression version
of $\bder$, and $\textit{bsimp}$ related functions.
We use $r$ as the prefix or subscript to differentiate
with the bitcoded version.
%For example,$\backslash_r$, $\rdistincts$, and $\rsimp$
%as opposed to $\backslash$, $\distinctBy$, and $\bsimp$.
%As promised, they are much simpler than their bitcoded counterparts.
%The operations on r-regular expressions are
%almost identical to those of the annotated regular expressions,
%except that no bitcodes are used. For example,
The derivative operation for an r-regular expression is\\
\begin{center}
\begin{tabular}{@{}lcl@{}}
$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\
$(\ONE)\,\backslash_r c$ & $\dn$ &
$\textit{if}\;c=d\; \;\textit{then}\;
\ONE\;\textit{else}\;\ZERO$\\
$(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ &
$\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\
$(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ &
$\textit{if}\;(\textit{rnullable}\,r_1)$\\
& &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\
& &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\
& &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\
$(r^*)\,\backslash_r c$ & $\dn$ &
$( r\,\backslash_r c)\cdot
(_{[]}r^*))$
\end{tabular}
\end{center}
\noindent
where we omit the definition of $\textit{rnullable}$.
The generalisation from the derivatives w.r.t a character to
derivatives w.r.t strings is given as
\begin{center}
\begin{tabular}{lcl}
$r \backslash_{rs} []$ & $\dn$ & $r$\\
$r \backslash_{rs} c::s$ & $\dn$ & $(r\backslash_r c) \backslash_{rs} s$
\end{tabular}
\end{center}
The function $\distinctBy$ for r-regular expressions does not need
a function checking equivalence because
there are no bit annotations.
Therefore we have
\begin{center}
\begin{tabular}{lcl}
$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
$\rdistinct{r :: rs}{rset}$ & $\dn$ &
$\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
& & $\textit{else}\; \;
r::\rdistinct{rs}{(rset \cup \{r\})}$
\end{tabular}
\end{center}
%TODO: definition of rsimp (maybe only the alternative clause)
\noindent
%We would like to make clear
%a difference between our $\rdistincts$ and
%the Isabelle $\textit {distinct}$ predicate.
%In Isabelle $\textit{distinct}$ is a function that returns a boolean
%rather than a list.
%It tests if all the elements of a list are unique.\\
With $\textit{rdistinct}$ in place,
the flatten function for $\rrexp$ is as follows:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\textit{rflts} \; (\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $as \; @ \; \textit{rflts} \; as' $ \\
$\textit{rflts} \; \ZERO :: as'$ & $\dn$ & $ \textit{rflts} \; \textit{as'} $ \\
$\textit{rflts} \; a :: as'$ & $\dn$ & $a :: \textit{rflts} \; \textit{as'}$ \quad(otherwise)
\end{tabular}
\end{center}
\noindent
The function
$\rsimpalts$ corresponds to $\textit{bsimp}_{ALTS}$:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\rsimpalts \;\; nil$ & $\dn$ & $\RZERO$\\
$\rsimpalts \;\; r::nil$ & $\dn$ & $r$\\
$\rsimpalts \;\; rs$ & $\dn$ & $\sum rs$\\
\end{tabular}
\end{center}
\noindent
Similarly, we have $\rsimpseq$ which corresponds to $\textit{bsimp}_{SEQ}$:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\rsimpseq \;\; \RZERO \; \_ $ & $=$ & $\RZERO$\\
$\rsimpseq \;\; \_ \; \RZERO $ & $=$ & $\RZERO$\\
$\rsimpseq \;\; \RONE \cdot r_2$ & $\dn$ & $r_2$\\
$\rsimpseq \;\; r_1 r_2$ & $\dn$ & $r_1 \cdot r_2$\\
\end{tabular}
\end{center}
and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\textit{rsimp} \; (r_1\cdot r_2)$ & $\dn$ & $ \textit{rsimp}_{SEQ} \; bs \;(\textit{rsimp} \; r_1) \; (\textit{rsimp} \; r_2) $ \\
$\textit{rsimp} \; (_{bs}\sum \textit{rs})$ & $\dn$ & $\textit{rsimp}_{ALTS} \; \textit{bs} \; (\textit{rdistinct} \; ( \textit{rflts} ( \textit{map} \; rsimp \; rs)) \; \rerases \; \varnothing) $ \\
$\textit{rsimp} \; r$ & $\dn$ & $\textit{r} \qquad \textit{otherwise}$
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{@{}lcl@{}}
$r\backslash_{rsimp} \, c$ & $\dn$ & $\rsimp \; (r\backslash_r \, c)$
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{@{}lcl@{}}
$r \backslash_{rsimps} \; \; c\!::\!s $ & $\dn$ & $(r \backslash_{rsimp}\, c) \backslash_{rsimps}\, s$ \\
$r \backslash_{rsimps} [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}
\noindent
We do not define an r-regular expression version of $\blexersimp$,
as our proof does not depend on it.
Now we are ready to introduce how r-regular expressions allow
us to prove the size bound on bitcoded regular expressions.
\subsection{Using R-regular Expressions to Bound Bit-coded Regular Expressions}
Everything about the size of annotated regular expressions after the application
of function $\bsimp$ and $\backslash_{simps}$
can be calculated via the size of r-regular expressions after the application
of $\rsimp$ and $\backslash_{rsimps}$:
\begin{lemma}\label{sizeRelations}
The following equalities hold:
\begin{itemize}
\item
$\rsize{\rerase a} = \asize a$
\item
$\asize{\bsimps \; a} = \rsize{\rsimp{ \rerase{a}}}$
\item
$\asize{\bderssimp{a}{s}} = \rsize{\rderssimp{\rerase{a}}{s}}$
\end{itemize}
\end{lemma}
\begin{proof}
First part follows from the definition of $(\_)_{\downarrow_r}$.
The second part is by induction on the inductive cases
of $\textit{bsimp}$.
The third part is by induction on the string $s$,
where the inductive step follows from part one.
\end{proof}
\noindent
With lemma \ref{sizeRelations},
we will be able to focus on
estimating only
$\rsize{\rderssimp{\rerase{a}}{s}}$
in later parts because
\begin{center}
$\rsize{\rderssimp{\rerase{a}}{s}} \leq N_r \quad$
implies
$\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$.
\end{center}
%From now on we
%Unless stated otherwise in the rest of this
%chapter all regular expressions without
%bitcodes are seen as r-regular expressions ($\rrexp$s).
%For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,
%we use the notation $r_1 + r_2$
%for brevity.
%-----------------------------------
% SUB SECTION ROADMAP RREXP BOUND
%-----------------------------------
%\subsection{Roadmap to a Bound for $\textit{Rrexp}$}
%The way we obtain the bound for $\rrexp$s is by two steps:
%\begin{itemize}
% \item
% First, we rewrite $r\backslash s$ into something else that is easier
% to bound. This step is crucial for the inductive case
% $r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way,
% but after simplification, they will always be equal or smaller to a form consisting of an alternative
% list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application.
% \item
% Then, for such a sum list of regular expressions $f\; (g\; (\sum rs))$, we can control its size
% by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only
% reduce the size of a regular expression, not adding to it.
%\end{itemize}
%
%\section{Step One: Closed Forms}
%We transform the function application $\rderssimp{r}{s}$
%into an equivalent
%form $f\; (g \; (\sum rs))$.
%The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$.
%This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the
%right hand side the "closed form" of $r\backslash s$.
%
%\begin{quote}\it
% Claim: For regular expressions $r_1 \cdot r_2$, we claim that
%\end{quote}
%\noindent
%We explain in detail how we reached those claims.
If we attempt to prove
\begin{center}
$\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$
\end{center}
using a naive induction on the structure of $r$,
then we are stuck at the inductive cases such as
$r_1\cdot r_2$.
The inductive hypotheses are:
\begin{center}
1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t.
\;\;\forall s. \llbracket r_1 \backslash_{rsimps} s \rrbracket_r \leq N_{r_1}. $\\
2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t.
\;\; \forall s. \llbracket r_2 \backslash_{rsimps} s \rrbracket_r \leq N_{r_2}. $
\end{center}
The inductive step to prove would be
\begin{center}
$\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s.
\llbracket (r_1 \cdot r_2) \backslash_{rsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$
\end{center}
The problem is that it is not clear what
$(r_1\cdot r_2) \backslash_{rsimps} s$ looks like,
and therefore $N_{r_1}$ and $N_{r_2}$ in the
inductive hypotheses cannot be directly used.
%We have already seen that $(r_1 \cdot r_2)\backslash s$
%and $(r^*)\backslash s$ can grow in a wild way.
The point however, is that they will be equivalent to a list of
terms $\sum rs$, where each term in $rs$ will
be made of $r_1 \backslash s' $, $r_2\backslash s'$,
and $r \backslash s'$ with $s' \in \textit{SubString} \; s$ (which stands
for the set of substrings of $s$).
The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$
in the simplification, which prevents the $rs$ from growing indefinitely.
Based on this idea, we develop a proof in two steps.
First, we show the below equality (where
$f$ and $g$ are functions that do not increase the size of the input)
\begin{center}
$r\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$,
\end{center}
where $r = r_1 \cdot r_2$ or $r = r_0^*$ and so on.
For example, for $r_1 \cdot r_2$ we have the equality as
\begin{center}
$ \rderssimp{r_1 \cdot r_2}{s} =
\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r_2}{\_} \;(\vsuf{s}{r_1})))}$
\end{center}
We call the right-hand-side the
\emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
Second, we will bound the closed form of r-regular expressions
using some estimation techniques
and then apply
lemma \ref{sizeRelations} to show that the bitcoded regular expressions
in our $\blexersimp$ are finitely bounded.
We will describe in detail the first step of the proof
in the next section.
\section{Closed Forms}
In this section we introduce in detail
how to express the string derivatives
of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string
rather than a single character) in a different way than
our previous definition.
In previous chapters, the derivative of a
regular expression $r$ w.r.t a string $s$
was recursively defined on the string:
\begin{center}
$r \backslash_s (c::s) \dn (r \backslash c) \backslash_s s$
\end{center}
The problem is that
this definition does not provide much information
on what $r \backslash_s s$ looks like.
If we are interested in the size of a derivative like
$(r_1 \cdot r_2)\backslash s$,
we have to somehow get a more concrete form to begin.
We call such more concrete representations the ``closed forms'' of
string derivatives as opposed to their original definitions.
The terminology ``closed form'' is borrowed from mathematics,
which usually describe expressions that are solely comprised of finitely many
well-known and easy-to-compute operations such as
additions, multiplications, and exponential functions.
We start by proving some basic identities
involving the simplification functions for r-regular expressions.
After that we introduce the rewrite relations
$\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$
$\rightsquigarrow_f$ and $\rightsquigarrow_g$.
These relations involve similar techniques as in chapter \ref{Bitcoded2}
for annotated regular expressions.
Finally, we use these identities to establish the
closed forms of the alternative regular expression,
the sequence regular expression, and the star regular expression.
%$r_1\cdot r_2$, $r^*$ and $\sum rs$.
\subsection{Some Basic Identities}
In what follows we will often convert between lists
and sets.
We use Isabelle's $set$ to refer to the
function that converts a list $rs$ to the set
containing all the elements in $rs$.
\subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication}
The $\textit{rdistinct}$ function, as its name suggests, will
de-duplicate an r-regular expression list.
It will also remove any elements that
are already in the accumulator set.
\begin{lemma}\label{rdistinctDoesTheJob}
%The function $\textit{rdistinct}$ satisfies the following
%properties:
Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its
recursive definition here. Its Isabelle counterpart would be $\textit{distinct}$.}
for testing
whether a list's elements are unique. Then the following
properties about $\textit{rdistinct}$ hold:
\begin{itemize}
\item
If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$.
\item
%If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
$\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$.
\item
$\textit{set} \; (\rdistinct{rs}{acc})
= (\textit{set} \; rs) - acc$
\end{itemize}
\end{lemma}
\noindent
\begin{proof}
The first part is by an induction on $rs$.
The second and third parts can be proven by using the
inductive cases of $\textit{rdistinct}$.
\end{proof}
\noindent
%$\textit{rdistinct}$ will out all regular expression terms
%that are in the accumulator, therefore
Concatenating a list $rs_a$ at the front of another
list $rs$ whose elements are all from the accumulator, and then calling $\textit{rdistinct}$
on the merged list, the output will be as if we had called $\textit{rdistinct}$
without the prepending of $rs$:
\begin{lemma}\label{rdistinctConcat}
The elements appearing in the accumulator will always be removed.
More precisely,
\begin{itemize}
\item
If $rs \subseteq rset$, then
$\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$.
\item
More generally, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$,
then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$
\end{itemize}
\end{lemma}
\begin{proof}
By induction on $rs$ and using \ref{rdistinctDoesTheJob}.
\end{proof}
\noindent
On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,
then expanding the accumulator to include that element will not cause the output list to change:
\begin{lemma}\label{rdistinctOnDistinct}
The accumulator can be augmented to include elements not appearing in the input list,
and the output will not change.
\begin{itemize}
\item
If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{(\{r\} \cup acc)}$.
\item
Particularly, if $\;\;\textit{isDistinct} \; rs$, then we have\\
\[ \rdistinct{rs}{\varnothing} = rs \]
\end{itemize}
\end{lemma}
\begin{proof}
The first half is by induction on $rs$. The second half is a corollary of the first.
\end{proof}
\noindent
The function $\textit{rdistinct}$ removes duplicates from anywhere in a list.
Despite being seemingly obvious,
the induction technique is not as straightforward.
\begin{lemma}\label{distinctRemovesMiddle}
The two properties hold if $r \in rs$:
\begin{itemize}
\item
$\rdistinct{rs}{rset} = \rdistinct{(rs @ [r])}{rset}$\\
and\\
$\rdistinct{(ab :: rs @ [ab])}{rset'} = \rdistinct{(ab :: rs)}{rset'}$
\item
$\rdistinct{ (rs @ rs') }{rset} = \rdistinct{rs @ [r] @ rs'}{rset}$\\
and\\
$\rdistinct{(ab :: rs @ [ab] @ rs'')}{rset'} =
\rdistinct{(ab :: rs @ rs'')}{rset'}$
\end{itemize}
\end{lemma}
\noindent
\begin{proof}
By induction on $rs$. All other variables are allowed to be arbitrary.
The second part of the lemma requires the first.
Note that for each part, the two sub-propositions need to be proven
at the same time,
so that the induction goes through.
\end{proof}
\noindent
This allows us to prove a few more equivalence relations involving
$\textit{rdistinct}$ (they will be useful later):
\begin{lemma}\label{rdistinctConcatGeneral}
\mbox{}
\begin{itemize}
\item
$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{((\rdistinct{rs}{\varnothing})@ rs')}{\varnothing}$
\item
$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{(\rdistinct{rs}{\varnothing} @ rs')}{\varnothing}$
\item
If $rset' \subseteq rset$, then $\rdistinct{rs}{rset} =
\rdistinct{(\rdistinct{rs}{rset'})}{rset}$. As a corollary
of this,
\item
$\rdistinct{(rs @ rs')}{rset} = \rdistinct{
(\rdistinct{rs}{\varnothing}) @ rs')}{rset}$. This
gives another corollary use later:
\item
If $a \in rset$, then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{
(\rdistinct{(a :: rs)}{\varnothing} @ rs')}{rset} $,
\end{itemize}
\end{lemma}
\begin{proof}
By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}.
\end{proof}
\noindent
The next lemma is a more general form of \ref{rdistinctConcat};
It says that
$\textit{rdistinct}$ is composable w.r.t list concatenation:
\begin{lemma}\label{distinctRdistinctAppend}
If $\;\; \textit{isDistinct} \; rs_1$,
and $(set \; rs_1) \cap acc = \varnothing$,
then applying $\textit{rdistinct}$ on $rs_1 @ rs_a$ does not
have an effect on $rs_1$:
\[\textit{rdistinct}\; (rs_1 @ rsa)\;\, acc
= rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1))\]
\end{lemma}
\begin{proof}
By an induction on
$rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary.
\end{proof}
\noindent
$\textit{rdistinct}$ needs to be applied only once, and
applying it multiple times does not make any difference:
\begin{corollary}\label{distinctOnceEnough}
$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; ( (rdistinct \;
rs \; \{ \}) @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$
\end{corollary}
\begin{proof}
By lemma \ref{distinctRdistinctAppend}.
\end{proof}
\subsubsection{The Properties of $\textit{Rflts}$}
We give in this subsection some properties
involving $\backslash_r$, $\backslash_{rsimps}$, $\textit{rflts}$ and
$\textit{rsimp}_{ALTS} $, together with any non-trivial lemmas that lead to them.
These will be helpful in later closed-form proofs, when
we want to transform derivative terms which have
%the ways in which multiple functions involving
%those are composed together
interleaving derivatives and simplifications applied to them.
\noindent
%When the function $\textit{Rflts}$
%is applied to the concatenation of two lists; the output can be calculated by first applying the
%functions on two lists separately and then concatenating them together.
$\textit{Rflts}$ is composable in terms of concatenation:
\begin{lemma}\label{rfltsProps}
The function $\rflts$ has the properties below:\\
\begin{itemize}
\item
$\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$
\item
If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$
\item
$\rflts \; (rs @ [\RZERO]) = \rflts \; rs$
\item
$\rflts \; (rs' @ [\RALTS{rs}]) = \rflts \; rs'@rs$
\item
$\rflts \; (rs @ [\RONE]) = \rflts \; rs @ [\RONE]$
\item
If $r \neq \RZERO$ and $\nexists rs'. r = \RALTS{rs'}$ then $\rflts \; (rs @ [r])
= (\rflts \; rs) @ [r]$
\item
If $r = \RALTS{rs}$ and $r \in rs'$ then for all $r_1 \in rs.
r_1 \in \rflts \; rs'$.
\item
$\rflts \; (rs_a @ \RZERO :: rs_b) = \rflts \; (rs_a @ rs_b)$
\end{itemize}
\end{lemma}
\noindent
\begin{proof}
By induction on $rs_1$ in the first sub-lemma, and induction on $r$ in the second part,
and induction on $rs$, $rs'$, $rs$, $rs'$, $rs_a$ in the third, fourth, fifth, sixth and
last sub-lemma.
\end{proof}
\noindent
Now we introduce the property that the operations
derivative and $\rsimpalts$
commute, this will be used later on when deriving the closed form for
the alternative regular expression:
\begin{lemma}\label{rderRsimpAltsCommute}
$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$
\end{lemma}
\begin{proof}
By induction on $rs$.
\end{proof}
\noindent
\subsubsection{The $RL$ Function: Language Interpretation for $\textit{Rrexp}$s}
Much like the definition of $L$ on plain regular expressions, one can also
define the language interpretation for $\rrexp$s.
\begin{center}
\begin{tabular}{lcl}
$RL \; (\ZERO_r)$ & $\dn$ & $\phi$\\
$RL \; (\ONE_r)$ & $\dn$ & $\{[]\}$\\
$RL \; (c)$ & $\dn$ & $\{[c]\}$\\
$RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\
$RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\
$RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$
\end{tabular}
\end{center}
\noindent
The main use of $RL$ is to establish some connections between $\rsimp{}$
and $\rnullable{}$:
\begin{lemma}
The following properties hold:
\begin{itemize}
\item
If $\rnullable{r}$, then $\rsimp{r} \neq \RZERO$.
\item
$\rnullable{r \backslash s} \quad $ if and only if $\quad \rnullable{\rderssimp{r}{s}}$.
\end{itemize}
\end{lemma}
\begin{proof}
The first part is by induction on $r$.
The second part is true because property
\[ RL \; r = RL \; (\rsimp{r})\] holds.
\end{proof}
\subsubsection{Simplified $\textit{Rrexp}$s are Good}
We formalise the notion of ``good" regular expressions,
which means regular expressions that
are fully simplified in terms of our $\textit{rsimp}$ function.
For alternative regular expressions that means they
do not contain any nested alternatives, un-eliminated $\RZERO$s
or duplicate elements (for example,
$r_1 + (r_2 + r_3)$, $\RZERO + r$ and $ \sum [r, r, \ldots]$).
The clauses for $\good$ are:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\good\; \RZERO$ & $\dn$ & $\textit{false}$\\
$\good\; \RONE$ & $\dn$ & $\textit{true}$\\
$\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\
$\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\
$\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\
$\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ &
$\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\
& & $\land \; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \land \; \, \textit{nonAlt}\; r')$\\
$\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\
$\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\
$\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\
$\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\
$\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\
\end{tabular}
\end{center}
\noindent
We omit the recursive definition of the predicate $\textit{nonAlt}$,
which evaluates to true when the regular expression is not an
alternative, and false otherwise.
The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that
its non-empty argument list of expressions are all good themselves, and $\textit{nonAlt}$,
and unique:
\begin{lemma}\label{rsimpaltsGood}
If $rs \neq []$ and for all $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$,
then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$.
\end{lemma}
\noindent
We also note that
if a regular expression $r$ is good, then $\rflts$ on the singleton
list $[r]$ will not break goodness:
\begin{lemma}\label{flts2}
If $\good \; r$, then forall $r' \in \rflts \; [r]. \; \good \; r'$ and $\textit{nonAlt} \; r'$.
\end{lemma}
\begin{proof}
By an induction on $r$.
\end{proof}
\noindent
The other observation we make about $\rsimp{r}$ is that it never
comes with nested alternatives, which we describe as the $\nonnested$
property:
\begin{center}
\begin{tabular}{lcl}
$\nonnested \; \, \sum []$ & $\dn$ & $\btrue$\\
$\nonnested \; \, \sum ((\sum rs_1) :: rs_2)$ & $\dn$ & $\bfalse$\\
$\nonnested \; \, \sum (r :: rs)$ & $\dn$ & $\nonnested (\sum rs)$\\
$\nonnested \; \, r $ & $\dn$ & $\btrue$
\end{tabular}
\end{center}
\noindent
The $\rflts$ function
always opens up nested alternatives,
which enables $\rsimp$ to be non-nested:
\begin{lemma}\label{nonnestedRsimp}
It is always the case that
\begin{center}
$\nonnested \; (\rsimp{r})$
\end{center}
\end{lemma}
\begin{proof}
By induction on $r$.
\end{proof}
\noindent
With this we can prove that a regular expression
after simplification and flattening and de-duplication,
will not contain any alternative regular expression directly:
\begin{lemma}\label{nonaltFltsRd}
If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$
then $\textit{nonAlt} \; x$.
\end{lemma}
\begin{proof}
By \ref{nonnestedRsimp}.
\end{proof}
\noindent
The other fact we know is that once $\rsimp{}$ has finished
processing an alternative regular expression, it will not
contain any $\RZERO$s. This is because all the recursive
calls to the simplification on the children regular expressions
make the children good, and $\rflts$ will not delete
any $\RZERO$s out of a good regular expression list,
and $\rdistinct{}$ will not ``mess'' with the result.
\begin{lemma}\label{flts3Obv}
The following are true:
\begin{itemize}
\item
If for all $r \in rs. \, \good \; r $ or $r = \RZERO$,
then for all $r \in \rflts\; rs. \, \good \; r$.
\item
If $x \in \rdistinct{\rflts\; (\map \; rsimp{}\; rs)}{\varnothing}$
and for all $y$ such that $\llbracket y \rrbracket_r$ less than
$\llbracket rs \rrbracket_r + 1$, either
$\good \; (\rsimp{y})$ or $\rsimp{y} = \RZERO$,
then $\good \; x$.
\end{itemize}
\end{lemma}
\begin{proof}
The first part is by induction, where the inductive cases
are the inductive cases of $\rflts$.
The second part is a corollary from the first part.
\end{proof}
This leads to good structural property of $\rsimp{}$,
that after simplification, a regular expression is
either good or $\RZERO$:
\begin{lemma}\label{good1}
For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$.
\end{lemma}
\begin{proof}
By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$.
Lemma \ref{rsimpMono} says that
$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to
$\llbracket r \rrbracket_r$.
Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case,
The inductive hypothesis applies to the children regular expressions
$r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied
by that as well.
The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used
to ensure that goodness is preserved at the topmost level.
\end{proof}
We shall prove that any good regular expression is
a fixed-point for $\textit{rsimp}$.
First we prove an auxiliary lemma:
\begin{lemma}\label{goodaltsNonalt}
If $\good \; \sum rs$, then $\rflts\; rs = rs$.
\end{lemma}
\begin{proof}
By an induction on $\sum rs$. The inductive rules are the cases
for $\good$.
\end{proof}
\noindent
Now we are ready to prove that good regular expressions are invariant
with respect to $\rsimp{}$:
\begin{lemma}\label{test}
If $\good \;r$ then $\rsimp{r} = r$.
\end{lemma}
\begin{proof}
By an induction on the inductive cases of $\good$, using lemmas
\ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}.
The lemma \ref{goodaltsNonalt} is used in the alternative
case where 2 or more elements are present in the list.
\end{proof}
\noindent
Below we show a property involving $\rflts$, $\textit{rdistinct}$,
$\rsimp{}$ and $\rsimp_{ALTS}$,
which requires $\ref{good1}$ to go through smoothly:
\begin{lemma}\label{flattenRsimpalts}
An application of $\rsimp_{ALTS}$ can be ``absorbed'',
if its output is concatenated with a list and then applied to $\rflts$.
\begin{center}
$\rflts \; ( (\rsimp_{ALTS} \;
(\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) ::
\map \; \rsimp{} \; rs' ) =
\rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ (
\map \; \rsimp{rs'}))$
\end{center}
\end{lemma}
\begin{proof}
By \ref{good1}.
\end{proof}
\noindent
We are also ready to prove that $\textit{rsimp}$ is idempotent.
\subsubsection{$\rsimp$ is Idempotent}
The idempotency of $\rsimp$ is very useful in
manipulating regular expression terms into desired
forms so that key steps allowing further rewriting to closed forms
are possible.
\begin{lemma}\label{rsimpIdem}
$\rsimp{r} = \rsimp{(\rsimp{r})}$
\end{lemma}
\begin{proof}
By \ref{test} and \ref{good1}.
\end{proof}
\noindent
This property means we do not have to repeatedly
apply simplification in each step, which justifies
our definition of $\blexersimp$.
This is in contrast to the work of Sulzmann and Lu where
the simplification is applied in a fixpoint manner.
On the other hand, we can repeat the same $\rsimp{}$ applications
on regular expressions as many times as we want, if we have at least
one simplification applied to it, and apply it wherever we need to:
\begin{corollary}\label{headOneMoreSimp}
The following properties hold, directly from \ref{rsimpIdem}:
\begin{itemize}
\item
$\map \; \rsimp{(r :: rs)} = \map \; \rsimp{} \; (\rsimp{r} :: rs)$
\item
$\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$
\end{itemize}
\end{corollary}
\noindent
This will be useful in the later closed-form proof's rewriting steps.
Similarly, we state the following useful facts below:
\begin{lemma}
The following equalities hold if $r = \rsimp{r'}$ for some $r'$:
\begin{itemize}
\item
If $r = \sum rs$ then $\rsimpalts \; rs = \sum rs$.
\item
If $r = \sum rs$ then $\rdistinct{rs}{\varnothing} = rs$.
\item
$\rsimpalts \; (\rdistinct{\rflts \; [r]}{\varnothing}) = r$.
\end{itemize}
\end{lemma}
\begin{proof}
By application of lemmas \ref{rsimpIdem} and \ref{good1}.
\end{proof}
\noindent
With the idempotency of $\textit{rsimp}$ and its corollaries,
we can start proving some key equalities leading to the
closed forms.
Next we present a few equivalent terms under $\textit{rsimp}$.
To make the notation more concise
We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$.
%\begin{center}
%\begin{tabular}{lcl}
% $a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$
%\end{tabular}
%\end{center}
%\noindent
%\vspace{0em}
\begin{lemma}
The following equivalence hold:
\begin{itemize}
\item
$\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$
\item
$\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$
\item
$\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$
\item
$\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$
\item
$\RALTS{rs} \sequal \RALTS{\map \; \rsimp{} \; rs}$
\end{itemize}
\end{lemma}
\begin{proof}
By induction on the lists involved.
\end{proof}
\noindent
The above allows us to prove
two similar equalities (which are a bit more involved).
It says that we could flatten the elements
before simplification and still get the same result.
\begin{lemma}\label{simpFlatten3}
One can flatten the inside $\sum$ of a $\sum$ if it is being
simplified. Concretely,
\begin{itemize}
\item
If for all $r \in rs, rs', rs''$, we have $\good \; r $
or $r = \RZERO$, then $\sum (rs' @ rs @ rs'') \sequal
\sum (rs' @ [\sum rs] @ rs'')$ holds. As a corollary,
\item
$\sum (rs' @ [\sum rs] @ rs'') \sequal \sum (rs' @ rs @ rs'')$
\end{itemize}
\end{lemma}
\begin{proof}
By rewriting steps involving the use of \ref{test} and \ref{rdistinctConcatGeneral}.
The second sub-lemma is a corollary of the previous.
\end{proof}
%Rewriting steps not put in--too long and complicated-------------------------------
\begin{comment}
\begin{center}
$\rsimp{\sum (rs' @ rs @ rs'')} \stackrel{def of bsimp}{=}$ \\
$\rsimpalts \; (\rdistinct{\rflts \; ((\map \; \rsimp{}\; rs') @ (\map \; \rsimp{} \; rs ) @ (\map \; \rsimp{} \; rs''))}{\varnothing})$ \\
$\stackrel{by \ref{test}}{=}
\rsimpalts \; (\rdistinct{(\rflts \; rs' @ \rflts \; rs @ \rflts \; rs'')}{
\varnothing})$\\
$\stackrel{by \ref{rdistinctConcatGeneral}}{=}
\rsimpalts \; (\rdistinct{\rflts \; rs'}{\varnothing} @ \rdistinct{(
\rflts\; rs @ \rflts \; rs'')}{\rflts \; rs'})$\\
\end{center}
\end{comment}
%Rewriting steps not put in--too long and complicated-------------------------------
\noindent
We need more equalities like the above to enable a closed form lemma,
for which we need to introduce a few rewrite relations
to help
us obtain them.
\subsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$}
Inspired by the success we had in the correctness proof
in \ref{Bitcoded2},
we follow suit here, defining atomic simplification
steps as ``small-step'' rewriting steps. This allows capturing
similarities between terms that would be otherwise
hard to express.
We use $\hrewrite$ for one-step atomic rewrite of
regular expression simplification,
$\frewrite$ for rewrite of list of regular expressions that
include all operations carried out in $\rflts$, and $\grewrite$ for
rewriting a list of regular expressions possible in both $\rflts$ and $\textit{rdistinct}$.
Their reflexive transitive closures are used to denote zero or many steps,
as was the case in the previous chapter.
As we have already
done something similar, the presentation about
these rewriting rules will be more concise than that in \ref{Bitcoded2}.
To differentiate between the rewriting steps for annotated regular expressions
and $\rrexp$s, we add characters $h$ and $g$ below the squig arrow symbol
to mean atomic simplification transitions
of $\rrexp$s and $\rrexp$ lists, respectively.
\begin{figure}[H]
\begin{center}
\begin{mathpar}
\inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\}
\inferrule[RSEQ0R]{}{r_1 \cdot \RZERO \hrewrite \RZERO\\}
\inferrule[RSEQ1]{}{(\RONE \cdot r) \hrewrite r\\}\\
\inferrule[RSEQL]{ r_1 \hrewrite r_2}{r_1 \cdot r_3 \hrewrite r_2 \cdot r_3\\}
\inferrule[RSEQR]{ r_3 \hrewrite r_4}{r_1 \cdot r_3 \hrewrite r_1 \cdot r_4\\}\\
\inferrule[RALTSChild]{r \hrewrite r'}{\sum (rs_1 @ [r] @ rs_2) \hrewrite \sum (rs_1 @ [r'] @ rs_2)\\}
\inferrule[RALTS0]{}{\sum (rs_a @ [\RZERO] @ rs_b) \hrewrite \sum (rs_a @ rs_b)}
\inferrule[RALTSNested]{}{\sum (rs_a @ [\sum rs_1] @ rs_b) \hrewrite \sum (rs_a @ rs_1 @ rs_b)}
\inferrule[RALTSNil]{}{ \sum [] \hrewrite \RZERO\\}
\inferrule[RALTSSingle]{}{ \sum [r] \hrewrite r\\}
\inferrule[RALTSDelete]{\\ r_1 = r_2}{\sum rs_a @ [r_1] @ rs_b @ [r_2] @ rsc \hrewrite \sum rs_a @ [r_1] @ rs_b @ rs_c}
\end{mathpar}
\end{center}
\caption{List of one-step rewrite rules for r-regular expressions ($\hrewrite$)}\label{hRewrite}
\end{figure}
Like $\rightsquigarrow_s$, it is
convenient to define rewrite rules for a list of regular expressions,
where each element can rewrite in many steps to the other (scf stands for
li\emph{s}t \emph{c}losed \emph{f}orm). This relation is similar to the
$\stackrel{s*}{\rightsquigarrow}$ for annotated regular expressions.
\begin{figure}[H]
\begin{center}
\begin{mathpar}
\inferrule{}{[] \scfrewrites [] }
\inferrule{r \hrewrites r' \\ rs \scfrewrites rs'}{r :: rs \scfrewrites r' :: rs'}
\end{mathpar}
\end{center}
\caption{List of one-step rewrite rules for a list of r-regular expressions}\label{scfRewrite}
\end{figure}
%frewrite
List of one-step rewrite rules for flattening
a list of regular expressions($\frewrite$):
\begin{figure}[H]
\begin{center}
\begin{mathpar}
\inferrule{}{\RZERO :: rs \frewrite rs \\}
\inferrule{}{(\sum rs) :: rs_a \frewrite rs @ rs_a \\}
\inferrule{rs_1 \frewrite rs_2}{r :: rs_1 \frewrite r :: rs_2}
\end{mathpar}
\end{center}
\caption{List of one-step rewrite rules characterising the $\rflts$ operation on a list}\label{fRewrites}
\end{figure}
Lists of one-step rewrite rules for flattening and de-duplicating
a list of regular expressions ($\grewrite$):
\begin{figure}[H]
\begin{center}
\begin{mathpar}
\inferrule{}{\RZERO :: rs \grewrite rs \\}
\inferrule{}{(\sum rs) :: rs_a \grewrite rs @ rs_a \\}
\inferrule{rs_1 \grewrite rs_2}{r :: rs_1 \grewrite r :: rs_2}
\inferrule[dB]{}{rs_a @ [a] @ rs_b @[a] @ rs_c \grewrite rs_a @ [a] @ rsb @ rsc}
\end{mathpar}
\end{center}
\caption{List of one-step rewrite rules characterising the $\rflts$ and $\textit{rdistinct}$
operations}\label{gRewrite}
\end{figure}
\noindent
We define
two separate list rewriting relations $\frewrite$ and $\grewrite$.
The rewriting steps that take place during
flattening are characterised by $\frewrite$.
The rewrite relation $\grewrite$ characterises both flattening and de-duplicating.
Sometimes $\grewrites$ is slightly too powerful
so we would rather use $\frewrites$ to prove
%because we only
equalities related to $\rflts$.
%certain equivalence under the rewriting steps of $\frewrites$.
For example, when proving the closed-form for the alternative regular expression,
one of the equalities needed is:
\begin{center}
$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
\sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
$
\end{center}
\noindent
Proving this is by first showing
\begin{lemma}\label{earlyLaterDerFrewrites}
$\map \; (\_ \backslash x) \; (\rflts \; rs) \frewrites
\rflts \; (\map \; (\_ \backslash x) \; rs)$
\end{lemma}
\noindent
and then the equivalence between two terms
that can reduce in many steps to each other:
\begin{lemma}\label{frewritesSimpeq}
If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal
\sum (\rDistinct \; rs_2 \; \varnothing)$.
\end{lemma}
\noindent
These two lemmas can both be proven using a straightforward induction (and
the proofs for them are therefore omitted).
Now the above equalities can be derived with ease:
\begin{corollary}
$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
\sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
$
\end{corollary}
\begin{proof}
By lemmas \ref{earlyLaterDerFrewrites} and \ref{frewritesSimpeq}.
\end{proof}
But this trick will not work for $\grewrites$.
For example, a rewriting step in proving
closed forms is:
\begin{center}
$\rsimp{(\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})))}$\\
$=$ \\
$\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $
\noindent
\end{center}
For this, one would hope to have a rewriting relation between the two lists involved,
similar to \ref{earlyLaterDerFrewrites}. However, it turns out that
\begin{center}
$\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \;
(\_ \backslash x) \; rs) \; ( rset \backslash x)$
\end{center}
\noindent
does $\mathbf{not}$ hold in general.
For this rewriting step we will introduce some slightly more cumbersome
proof technique later.
The point is that $\frewrite$
allows us to prove equivalence in a straightforward way that is
not possible for $\grewrite$.
\subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}
In this part, we present lemmas stating
pairs of r-regular expressions and r-regular expression lists
where one can rewrite from one in many steps to the other.
Most of the proofs to these lemmas are straightforward, using
an induction on the corresponding rewriting relations.
These proofs will therefore be omitted when this is the case.
We present in the following lemma a few pairs of terms that are rewritable via
$\grewrites$:
\begin{lemma}\label{gstarRdistinctGeneral}
\mbox{}
\begin{itemize}
\item
$rs_1 @ rs \grewrites rs_1 @ (\rDistinct \; rs \; rs_1)$
\item
$rs \grewrites \rDistinct \; rs \; \varnothing$
\item
$rs_a @ (\rDistinct \; rs \; rs_a) \grewrites rs_a @ (\rDistinct \;
rs \; (\{\RZERO\} \cup rs_a))$
\item
$rs \;\; @ \;\; \rDistinct \; rs_a \; rset \grewrites rs @ \rDistinct \; rs_a \;
(rest \cup rs)$
\end{itemize}
\end{lemma}
\noindent
If a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other,
then they are equivalent under $\rsimp{}$:
\begin{lemma}\label{grewritesSimpalts}
\mbox{}
If $rs_1 \grewrites rs_2$, then
we have the following equivalence:
\begin{itemize}
\item
$\sum rs_1 \sequal \sum rs_2$
\item
$\rsimpalts \; rs_1 \sequal \rsimpalts \; rs_2$
\end{itemize}
\end{lemma}
\noindent
Here are a few connecting lemmas showing that
if a list of regular expressions can be rewritten using $\grewrites$ or $\frewrites $ or
$\scfrewrites$,
then an alternative constructor taking the list can also be rewritten using $\hrewrites$:
\begin{lemma}
\begin{itemize}
\item
If $rs \grewrites rs'$ then $\sum rs \hrewrites \sum rs'$.
\item
If $rs \grewrites rs'$ then $\sum rs \hrewrites \rsimpalts \; rs'$
\item
If $rs_1 \scfrewrites rs_2$ then $\sum (rs @ rs_1) \hrewrites \sum (rs @ rs_2)$
\item
If $rs_1 \scfrewrites rs_2$ then $\sum rs_1 \hrewrites \sum rs_2$
\end{itemize}
\end{lemma}
\noindent
Now comes the core of the proof,
which says that once two lists are rewritable to each other,
then they are equivalent under $\textit{rsimp}$:
\begin{lemma}
If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$.
\end{lemma}
\noindent
Similar to what we did in chapter \ref{Bitcoded2},
we prove that if one can rewrite from one r-regular expression ($r$)
to the other ($r'$), after taking derivatives one can still rewrite
the first ($r\backslash c$) to the other ($r'\backslash c$).
\begin{lemma}\label{interleave}
If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
\end{lemma}
\noindent
This allows us to prove more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$.
\begin{lemma}\label{insideSimpRemoval}
$\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})} $
\end{lemma}
\noindent
\begin{proof}
By \ref{interleave} and \ref{rsimpIdem}.
\end{proof}
\noindent
And this unlocks more equivalent terms:
\begin{lemma}\label{Simpders}
As corollaries of \ref{insideSimpRemoval}, we have
\begin{itemize}
\item
If $s \neq []$ then $\rderssimp{r}{s} = \rsimp{( r \backslash_{rs} s)}$.
\item
$\rsimpalts \; (\map \; (\_ \backslash_r x) \;
(\rdistinct{rs}{\varnothing})) \sequal
\rsimpalts \; (\rDistinct \;
(\map \; (\_ \backslash_r x) rs) \;\varnothing )$
\end{itemize}
\end{lemma}
\begin{proof}
Part 1 is by lemma \ref{insideSimpRemoval},
part 2 is by lemma \ref{insideSimpRemoval} .%and \ref{distinctDer}.
\end{proof}
\noindent
\subsection{Closed Forms for $\sum rs$, $r_1\cdot r_2$ and $r^*$}
Lemma \ref{Simpders} leads to our first closed form,
which is for the alternative regular expression:
\begin{theorem}\label{altsClosedForm}
\mbox{}
\begin{center}
$\rderssimp{(\sum rs)}{s} \sequal
\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
\end{center}
\end{theorem}
\noindent
\begin{proof}
By a reverse induction on the string $s$.
One rewriting step, as we mentioned earlier,
involves
\begin{center}
$\rsimpalts \; (\map \; (\_ \backslash x) \;
(\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \;
(\lambda r. \rderssimp{r}{xs}))))}{\varnothing}))
\sequal
\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \;
(\rflts \; (\map \; (\rsimp{} \; \circ \;
(\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}) $.
\end{center}
This can be proven by a combination of
\ref{grewritesSimpalts}, \ref{gstarRdistinctGeneral}, \ref{rderRsimpAltsCommute}, and
\ref{insideSimpRemoval}.
\end{proof}
\noindent
This closed form has a variant which can be more convenient in later proofs:
\begin{corollary}\label{altsClosedForm1}
If $s \neq []$ then
$\rderssimp \; (\sum \; rs) \; s =
\rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$.
\end{corollary}
\noindent
The harder closed forms are the sequence and star ones.
Before we obtain them, some preliminary definitions
are needed to make proof statements concise.
\subsubsection{Closed Form for Sequence Regular Expressions}
For the sequence regular expression,
let's first look at a series of derivative steps on it
(assuming that each time when a derivative is taken,
the head of the sequence is always nullable):
\begin{center}
\begin{tabular}{llll}
$r_1 \cdot r_2$ &
$\longrightarrow_{\backslash c}$ &
$r_1\backslash c \cdot r_2 + r_2 \backslash c$ &
$ \longrightarrow_{\backslash c'} $ \\
\\
$(r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc'$ &
$\longrightarrow_{\backslash c''} $ &
$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'')
+ r_2 \backslash cc'c''$ &
$ \longrightarrow_{\backslash c''} \quad \ldots$\\
\end{tabular}
\end{center}
Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expressed as
a giant alternative taking a list of terms
$[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,
where the head of the list is always the term
representing a match involving only $r_1$, and the tail of the list consisting of
terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$.
This intuition is also echoed by Murugesan and Sundaram \cite{Murugesan2014},
where they gave
a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:
\begin{center}
\begin{tabular}{lc}
$L \; [ (r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) ]$ & $ =$\\
\\
\rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1) \cdot r_2 +
(\delta\; (\nullable \; r_1) \; (r_2 \backslash_r c_1) )) \backslash_r
(c_2 :: \ldots c_n) ]$ &
$=$\\
\\
\rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1c_2 \cdot r_2 +
(\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2)))
$ & \\
\\
$\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) ))
\backslash_r (c_3 \ldots c_n) ]$ & $\ldots$ \\
\end{tabular}
\end{center}
\noindent
The $\delta$ function
returns $r$ when the boolean condition
$b$ evaluates to true and
$\ZERO_r$ otherwise:
\begin{center}
\begin{tabular}{lcl}
$\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\
& $\dn$ & $\ZERO_r \quad otherwise$
\end{tabular}
\end{center}
\noindent
Note that the term
\begin{center}
\begin{tabular}{lc}
\rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 +
(\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2)))
$ & \\
\\
$\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) ))
\backslash_r (c_3 \ldots c_n)$ &\\
\end{tabular}
\end{center}
\noindent
does not faithfully
represent what the intermediate derivatives would actually look like
when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not
nullable in the head of the sequence.
For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable,
the regular expression would not look like
\[
r_1 \backslash_r c_1c_2
\]
instead of
\[
(r_1 \backslash_r c_1c_2 + \ZERO_r ) + \ZERO_r.
\]
The redundant $\ZERO_r$s will not be created in the
first place.
In a closed-form one needs to take into account this (because
closed forms require exact equality rather than language equivalence)
and only generate the
$r_2 \backslash_r s''$ terms satisfying the property
\begin{center}
$\exists s'. such \; that \; s'@s'' = s \;\; \land \;\;
r_1 \backslash s' \; is \; nullable$.
\end{center}
Given the arguments $s$ and $r_1$, we denote the list of strings
$s''$ satisfying the above property as $\vsuf{s}{r_1}$.
The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{
Perhaps a better name for it would be ``NullablePrefixSuffix''
to differentiate with the list of \emph{all} prefixes of $s$, but
that is a bit too long for a function name and we are yet to find
a more concise and easy-to-understand name.}
\begin{center}
\begin{tabular}{lcl}
$\vsuf{[]}{\_} $ & $=$ & $[]$\\
$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} \; (\rnullable{r_1}) \; \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
&& $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $
\end{tabular}
\end{center}
\noindent
The list starts with shorter suffixes
and ends with longer ones (in other words, the string elements $s''$
in the list $\vsuf{s}{r_1}$ are sorted
in the same order as that of the terms $r_2\backslash s''$
appearing in $(r_1\cdot r_2)\backslash s$).
In essence, $\vsuf{\_}{\_}$ is doing a
"virtual derivative" of $r_1 \cdot r_2$, but instead of producing
the entire result $(r_1 \cdot r_2) \backslash s$,
it only stores strings,
with each string $s''$ representing a term such that $r_2 \backslash s''$
is occurring in $(r_1\cdot r_2)\backslash s$.
With $\textit{Suffix}$ we are ready to express the
sequence regular expression's closed form,
but before doing so
more definitions are needed.
The first thing is the flattening function $\sflat{\_}$,
which takes an alternative regular expression and produces a flattened version
of that alternative regular expression.
It is needed to convert
a left-associative nested sequence of alternatives into
a flattened list:
\[
\sum(\ldots ((r_1 + r_2) + r_3) + \ldots)
\stackrel{\sflat{\_}}{\rightarrow}
\sum[r_1, r_2, r_3, \ldots]
\]
\noindent
The definitions of $\sflat{\_}$ and helper functions
$\sflataux{\_}$ and $\llparenthesis \_ \rrparenthesis''$ are given below.
\begin{center}
\begin{tabular}{lcl}
$\sflataux{\sum r :: rs}$ & $\dn$ & $\sflataux{r} @ rs$\\
$\sflataux{\sum []}$ & $ \dn $ & $ []$\\
$\sflataux r$ & $\dn$ & $ [r]$
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{lcl}
$\sflat{(\sum r :: rs)}$ & $\dn$ & $\sum (\sflataux{r} @ rs)$\\
$\sflat{\sum []}$ & $ \dn $ & $ \sum []$\\
$\sflat r$ & $\dn$ & $ r$
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{lcl}
$\sflataux{[]}'$ & $ \dn $ & $ []$\\
$\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\
$\sflataux{r :: rs}'$ & $\dn$ & $ r::rs$
\end{tabular}
\end{center}
\noindent
$\sflataux{\_}$ breaks up nested alternative regular expressions
of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
It will return the singleton list $[r]$ otherwise.
$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps
the output type a regular expression, not a list.
$\sflataux{\_}$ and $\sflat{\_}$ are only recursive on the
first element of the list.
$\sflataux{\_}'$ takes a list of regular expressions as input, and outputs
a list of regular expressions.
The use of $\sflataux{\_}$ and $\sflataux{\_}'$ is clear once we have
$\textit{createdBySequence}$ defined:
\begin{center}
\begin{mathpar}
\inferrule{\mbox{}}{\textit{createdBySequence}\; (r_1 \cdot r_2)}
\inferrule{\textit{createdBySequence} \; r_1}{\textit{createdBySequence} \;
(r_1 + r_2)}
\end{mathpar}
\end{center}
\noindent
The predicate $\textit{createdBySequence}$ is used to describe the shape of
the derivative regular expressions $(r_1\cdot r_2) \backslash s$:
\begin{lemma}\label{recursivelyDerseq}
It is always the case that
\begin{center}
$\textit{createdBySequence} \; ( (r_1\cdot r_2) \backslash_r s) $
\end{center}
holds.
\end{lemma}
\begin{proof}
By a reverse induction on the string $s$, where the inductive cases are $[]$
and $xs @ [x]$.
\end{proof}
\noindent
If we have a regular expression $r$ whose shape
fits into those described by $\textit{createdBySequence}$,
then we can convert between
$r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with
$\sflataux{\_}'$:
\begin{lemma}\label{sfauIdemDer}
If $\textit{createdBySequence} \; r$, then
\begin{center}
$\sflataux{ r \backslash_r c} =
\llparenthesis (\map \; (\_ \backslash_r c) \; (\sflataux{r}) ) \rrparenthesis''$
\end{center}
holds.
\end{lemma}
\begin{proof}
By a simple induction on the inductive cases of $\textit{createdBySequence}.
$
\end{proof}
Now we are ready to express
the shape of $r_1 \cdot r_2 \backslash s$
\begin{lemma}\label{seqSfau0}
$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2
:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$
\end{lemma}
\begin{proof}
By a reverse induction on the string $s$, where the inductive cases
are $[]$ and $xs @ [x]$.
For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2)
\backslash_r xs)$ holds from lemma \ref{recursivelyDerseq},
which can be used to prove
\[
\map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\;
\map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1}))
\]
=
\[
\map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1})
\]
using lemma \ref{sfauIdemDer}.
This equality enables the inductive case to go through.
\end{proof}
\noindent
This lemma says that $(r_1\cdot r_2)\backslash s$
can be flattened into a list whose head and tail meet the description
we gave earlier.
%Note that this lemma does $\mathbf{not}$ depend on any
%specific definitions we used,
%allowing people investigating derivatives to get an alternative
%view of what $r_1 \cdot r_2$ is.
We now use $\textit{createdBySequence}$ and
$\sflataux{\_}$ to describe an intuition
behind the sequence closed form.
If two regular expressions only differ in the way their
alternatives are nested, then we should be able to get the same result
once we apply simplification to both of them:
\begin{lemma}\label{sflatRsimpeq}
If $r$ is created from a sequence through
a series of derivatives
(i.e. if $\textit{createdBySequence} \; r$ holds),
and that $\sflataux{r} = rs$,
then we have
that
\begin{center}
$\textit{rsimp} \; r = \textit{rsimp} \; (\sum \; rs)$
\end{center}
holds.
\end{lemma}
\begin{proof}
By an induction on the inductive cases of $\textit{createdBySequence}$.
\end{proof}
Now we are ready for the closed form
for the sequence regular expressions (without the inner applications
of simplifications):
\begin{lemma}\label{seqClosedFormGeneral}
$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }
=\rsimp{(\sum ( (r_1 \backslash s) \cdot r_2 ::
\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
\end{lemma}
\begin{proof}
We know that
$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2
:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$
holds
by lemma \ref{seqSfau0}.
This allows the theorem to go through because of lemma \ref{sflatRsimpeq}.
\end{proof}
Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
it is possible to convert the above lemma to obtain the
proper closed form for $\backslash_{rsimps}$ rather than $\backslash_r$:
for derivatives nested with simplification:
\begin{theorem}\label{seqClosedForm}
$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 )
:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$
\end{theorem}
\begin{proof}
By a case analysis of the string $s$.
When $s$ is an empty list, the rewrite is straightforward.
When $s$ is a non-empty list, the
lemmas \ref{seqClosedFormGeneral} and \ref{Simpders} apply,
making the proof go through.
\end{proof}
\subsubsection{Closed Forms for Star Regular Expressions}
The closed form for the star regular expression involves similar tricks
for the sequence regular expression.
The $\textit{Suffix}$ function is now replaced by something
slightly more complex, because the growth pattern of star
regular expressions' derivatives is a bit different:
\begin{center}
\begin{tabular}{lclc}
$r^* $ & $\longrightarrow_{\backslash c}$ &
$(r\backslash c) \cdot r^*$ & $\longrightarrow_{\backslash c'}$\\
\\
$r \backslash cc' \cdot r^* + r \backslash c' \cdot r^*$ &
$\longrightarrow_{\backslash c''}$ &
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') +
(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)$ &
$\longrightarrow_{\backslash c'''}$ \\
\\
$\ldots$\\
\end{tabular}
\end{center}
When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$,
$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly
in the sequence case.
The good news is that the function $\textit{rsimp}$ will again
ignore the difference between different nesting patterns of alternatives,
and the exponentially growing star derivative like
\begin{center}
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') +
(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $
\end{center}
can be treated as
\begin{center}
$\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'',
r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
\end{center}
which can be de-duplicated by $\rDistinct$
and therefore bounded finitely.
%and then de-duplicate terms of the form ($s'$ being a substring of $s$).
%This allows us to use a similar technique as $r_1 \cdot r_2$ case,
Now the crux of this section is finding a suitable description
for $rs$ where
\begin{center}
$\rderssimp{r^*}{s} = \rsimp{\sum rs}$.
\end{center}
holds.
In addition, the list $rs$
shall be in the form of
$\map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss$.
The $Ss$ is a list of strings, and for example in the sequence
closed form it is specified as $\textit{Suffix} \; s \; r_1$.
To get $Ss$ for the star regular expression,
we need to introduce $\starupdate$ and $\starupdates$:
\begin{center}
\begin{tabular}{lcl}
$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
& & $\textit{if} \;
(\rnullable \; (r \backslash_{rs} s))$ \\
& & $\textit{then} \;\; (s @ [c]) :: [c] :: (
\starupdate \; c \; r \; Ss)$ \\
& & $\textit{else} \;\; (s @ [c]) :: (
\starupdate \; c \; r \; Ss)$
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{lcl}
$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
$\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; (
\starupdate \; c \; r \; Ss)$
\end{tabular}
\end{center}
\noindent
Assuming we have that
\begin{center}
$\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$.
\end{center}
holds.
The idea of $\starupdate$ and $\starupdates$
is to update $Ss$ when another
derivative is taken on $\rderssimp{r^*}{s}$
w.r.t a character $c$ and a string $s'$
respectively.
Both $\starupdate$ and $\starupdates$ take three arguments as input:
the new character $c$ or string $s$ to take derivative with,
the regular expression
$r$ under the star $r^*$, and the
list of strings $Ss$ for the derivative $r^* \backslash s$
up until this point
such that
\begin{center}
$(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$
\end{center}
is satisfied.
Functions $\starupdate$ and $\starupdates$ characterise what the
star derivatives will look like once ``straightened out'' into lists.
The helper functions for such operations will be similar to
$\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence.
We use similar symbols to denote them, with a $*$ subscript to mark the difference.
\begin{center}
\begin{tabular}{lcl}
$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
$\hflataux{r}$ & $\dn$ & $[r]$
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{lcl}
$\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\
$\hflat{r}$ & $\dn$ & $r$
\end{tabular}
\end{center}
\noindent
These definitions are tailor-made for dealing with alternatives that have
originated from a star's derivatives.
A typical star derivative always has the structure of a balanced binary tree:
\begin{center}
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') +
(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $
\end{center}
All of the nested structures of alternatives
generated from derivatives are binary, and therefore
$\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives.
$\hflat{\_}$ ``untangles'' like the following:
\[
\sum ((r_1 + r_2) + (r_3 + r_4)) + \ldots \;
\stackrel{\hflat{\_}}{\longrightarrow} \;
\RALTS{[r_1, r_2, \ldots, r_n]}
\]
Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$,
with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists
and merges each of the element lists to form a flattened list.}:
\begin{lemma}\label{stupdateInduct1}
\mbox
For a list of strings $Ss$, the following hold.
\begin{itemize}
\item
If we do a derivative on the terms
$r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$),
the result will be the same as if we apply $\starupdate$ to $Ss$.
\begin{center}
\begin{tabular}{c}
$\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x)
\circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\;
$\\
\\
$=$ \\
\\
$\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \;
(\starupdate \; x \; r \; Ss)$
\end{tabular}
\end{center}
\item
$\starupdates$ is ``composable'' w.r.t a derivative.
It piggybacks the character $x$ to the tail of the string
$xs$.
\begin{center}
\begin{tabular}{c}
$\textit{concat} \; (\map \; \hflataux{\_} \;
(\map \; (\_\backslash_r x) \;
(\map \; (\lambda s.\;\; (r \backslash_r s) \cdot
(r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\
\\
$=$\\
\\
$\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \;
(\starupdates \; (xs @ [x]) \; r \; Ss)$
\end{tabular}
\end{center}
\end{itemize}
\end{lemma}
\begin{proof}
Part 1 is by induction on $Ss$.
Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values.
\end{proof}
Like $\textit{createdBySequence}$, we need
a predicate for ``star-created'' regular expressions:
\begin{center}
\begin{mathpar}
\inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} }
\inferrule{ \textit{createdByStar} \; r_1\; \land \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) }
\end{mathpar}
\end{center}
\noindent
All regular expressions created by taking derivatives of
$r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate:
\begin{lemma}\label{starDersCbs}
$\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds.
\end{lemma}
\begin{proof}
By a reverse induction on $s$.
\end{proof}
If a regular expression conforms to the shape of a star's derivative,
then we can push an application of $\hflataux{\_}$ inside a derivative of it:
\begin{lemma}\label{hfauPushin}
If $\textit{createdByStar} \; r$ holds, then
\begin{center}
$\hflataux{r \backslash_r c} = \textit{concat} \; (
\map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$
\end{center}
holds.
\end{lemma}
\begin{proof}
By an induction on the inductive cases of $\textit{createdByStar}$.
\end{proof}
%This is not entirely true for annotated regular expressions:
%%TODO: bsimp bders \neq bderssimp
%\begin{center}
% $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
%\end{center}
%For bit-codes, the order in which simplification is applied
%might cause a difference in the location they are placed.
%If we want something like
%\begin{center}
% $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
%\end{center}
%Some "canonicalization" procedure is required,
%which either pushes all the common bitcodes to nodes
%as senior as possible:
%\begin{center}
% $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
%\end{center}
%or does the reverse. However bitcodes are not of interest if we are talking about
%the $\llbracket r \rrbracket$ size of a regex.
%Therefore for the ease and simplicity of producing a
%proof for a size bound, we are happy to restrict ourselves to
%unannotated regular expressions, and obtain such equalities as
%TODO: rsimp sflat
% The simplification of a flattened out regular expression, provided it comes
%from the derivative of a star, is the same as the one nested.
Now we introduce an inductive property
for $\starupdate$ and $\hflataux{\_}$.
\begin{lemma}\label{starHfauInduct}
If we do derivatives of $r^*$
with a string that starts with $c$,
then flatten it out,
we obtain a list
of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$,
where $sS = \starupdates \; s \; r \; [[c]]$. Namely,
\begin{center}
$\hflataux{(( (\rder{c}{r_0})\cdot(r_0^*))\backslash_{rs} s)} =
\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \;
(\starupdates \; s \; r_0 \; [[c]])$
\end{center}
holds.
\end{lemma}
\begin{proof}
By an induction on $s$, the inductive cases
being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used.
\end{proof}
\noindent
The function $\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
\begin{lemma}\label{hflatauxGrewrites}
$a :: rs \grewrites \hflataux{a} @ rs$
\end{lemma}
\begin{proof}
By induction on $a$. $rs$ is set to take arbitrary values.
\end{proof}
It is also not surprising that
two regular expressions differing only in terms
of the
nesting of parentheses are equivalent w.r.t. $\textit{rsimp}$:
\begin{lemma}\label{cbsHfauRsimpeq1}
$\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$
\end{lemma}
\begin{proof}
By using the rewriting relation $\rightsquigarrow$
\end{proof}
And from this we obtain the following fact: a
regular expression created by star
is the same as its flattened version, up to equivalence under $\textit{bsimp}$.
For example,
\begin{lemma}\label{hfauRsimpeq2}
$\textit{createdByStar} \; r \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
\end{lemma}
\begin{proof}
By structural induction on $r$, where the induction rules
are these of $\createdByStar{\_}$.
Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case.
\end{proof}
%Here is a corollary that states the lemma in
%a more intuitive way:
%\begin{corollary}
% $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
% (r^*))\; (\starupdates \; c\; r\; [[c]])$
%\end{corollary}
%\noindent
%Note that this is also agnostic of the simplification
%function we defined, and is therefore of more general interest.
Together with the rewriting relation
\begin{lemma}\label{starClosedForm6Hrewrites}
We have the following set of rewriting relations or equalities:
\begin{itemize}
\item
$\textit{rsimp} \; (r^* \backslash_r (c::s))
\sequal
\sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; (
\starupdates \; s \; r \; [ c::[]] ) ) )$
\item
$r \backslash_{rsimps} (c::s) = \textit{rsimp} \; ( (
\sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \;
(\starupdates \;s \; r \; [ c::[] ])))))$
\item
$\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss))
\sequal
\sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \;
r^*) \; Ss) )$
\item
$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
\scfrewrites
\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
\item
$( ( \sum ( ( \map \ (\lambda s. \;\;
(\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \;
s \; r \; [ c::[] ])))))$\\
$\sequal$\\
$( ( \sum ( ( \map \ (\lambda s. \;\;
( r \backslash_{rsimps} s)) \cdot r^*) \; (\starupdates \;
s \; r \; [ c::[] ]))))$\\
\end{itemize}
\end{lemma}
\begin{proof}
Part 1 leads to part 2.
The rest of them are routine.
\end{proof}
\noindent
Next the closed form for star regular expressions can be derived:
\begin{theorem}\label{starClosedForm}
$\rderssimp{r^*}{c::s} =
\rsimp{
(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \;
(\starupdates \; s\; r \; [[c]])
)
)
}
$
\end{theorem}
\begin{proof}
By an induction on $s$.
The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites}
and \ref{hfauRsimpeq2}
are used.
In \ref{starClosedForm6Hrewrites}, the equalities are
used to link the LHS and RHS.
\end{proof}
%----------------------------------------------------------------------------------------
% SECTION ??
%----------------------------------------------------------------------------------------
%-----------------------------------
% SECTION syntactic equivalence under simp
%-----------------------------------
%----------------------------------------------------------------------------------------
% SECTION ALTS CLOSED FORM
%----------------------------------------------------------------------------------------
%\section{A Closed Form for \textit{ALTS}}
%Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
%
%
%There are a few key steps, one of these steps is
%
%
%
%One might want to prove this by something a simple statement like:
%
%For this to hold we want the $\textit{distinct}$ function to pick up
%the elements before and after derivatives correctly:
%$r \in rset \equiv (rder x r) \in (rder x rset)$.
%which essentially requires that the function $\backslash$ is an injective mapping.
%
%Unfortunately the function $\backslash c$ is not an injective mapping.
%
%\subsection{function $\backslash c$ is not injective (1-to-1)}
%\begin{center}
% The derivative $w.r.t$ character $c$ is not one-to-one.
% Formally,
% $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
%\end{center}
%This property is trivially true for the
%character regex example:
%\begin{center}
% $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
%\end{center}
%But apart from the cases where the derivative
%output is $\ZERO$, are there non-trivial results
%of derivatives which contain strings?
%The answer is yes.
%For example,
%\begin{center}
% Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
% where $a$ is not nullable.\\
% $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
% $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
%\end{center}
%We start with two syntactically different regular expressions,
%and end up with the same derivative result.
%This is not surprising as we have such
%equality as below in the style of Arden's lemma:\\
%\begin{center}
% $L(A^*B) = L(A\cdot A^* \cdot B + B)$
%\end{center}
\section{Bounding Closed Forms}
In this section, we introduce how we formalised the bound
on closed forms.
We first show that in general the number of regular expressions up to a certain
size is finite.
Then we prove that functions such as $\rflts$
will not cause the size of r-regular expressions to grow.
Putting this together with a general bound
on the finiteness of distinct regular expressions
up to a specific size, we obtain a bound on
the closed forms.
\subsection{Finiteness of Distinct Regular Expressions}
We define the set of regular expressions whose size is no more than
a certain size $N$ as $\textit{sizeNregex} \; N$:
\[
\textit{sizeNregex} \; N \dn \{r\; \mid \; \llbracket r \rrbracket_r \leq N \}
\]
We have that $\textit{sizeNregex} \; N$ is always a finite set:
\begin{lemma}\label{finiteSizeN}
$\textit{finite} \; (\textit{sizeNregex} \; N)$ holds.
\end{lemma}
\begin{proof}
By splitting the set $\textit{sizeNregex} \; (N + 1)$ into
subsets by their categories:
$\{\ZERO_r, \ONE_r, c\}$, $\{r^* \mid r \in \textit{sizeNregex} \; N\}$,
and so on. Each of these subsets is finitely bounded.
\end{proof}
\noindent
From this we get a corollary that
if forall $r \in rs$, $\rsize{r} \leq N$, then the output of
$\rdistinct{rs}{\varnothing}$ is a list of regular
expressions of finite size depending on $N$ only.
\begin{corollary}\label{finiteSizeNCorollary}
$\rsize{\rdistinct{rs}{\varnothing}} \leq c_N * N$ holds,
where the constant $c_N$ is equal to $\textit{card} \; (\textit{sizeNregex} \;
N)$.
\end{corollary}
\begin{proof}
For all $r$ in
$\textit{set} \; (\rdistinct{rs}{\varnothing})$,
it is always the case that $\rsize{r} \leq N$.
In addition, the list length is bounded by
$c_N$, yielding the desired bound.
\end{proof}
\noindent
This fact will be handy in estimating the closed form sizes.
%We have proven that the size of the
%output of $\textit{rdistinct} \; rs' \; \varnothing$
%is bounded by a constant $N * c_N$ depending only on $N$,
%provided that each of $rs'$'s element
%is bounded by $N$.
\subsection{$\textit{rsimp}$ Does Not Increase the Size}
Although it seems evident, we need a series
of non-trivial lemmas to establish that functions such as $\rflts$
do not cause the regular expressions to grow.
\begin{lemma}\label{rsimpMonoLemmas}
\mbox{}
\begin{itemize}
\item
\[
\llbracket \rsimpalts \; rs \rrbracket_r \leq
\llbracket \sum \; rs \rrbracket_r
\]
\item
\[
\llbracket \rsimpseq \; r_1 \; r_2 \rrbracket_r \leq
\llbracket r_1 \cdot r_2 \rrbracket_r
\]
\item
\[
\llbracket \rflts \; rs \rrbracket_r \leq
\llbracket rs \rrbracket_r
\]
\item
\[
\llbracket \rDistinct \; rs \; ss \rrbracket_r \leq
\llbracket rs \rrbracket_r
\]
\item
If all elements $a$ in the set $as$ satisfy the property
that $\llbracket \textit{rsimp} \; a \rrbracket_r \leq
\llbracket a \rrbracket_r$, then we have
\[
\llbracket \; \rsimpalts \; (\textit{rdistinct} \;
(\textit{rflts} \; (\textit{map}\;\textit{rsimp} as)) \{\})
\rrbracket \leq
\llbracket \; \sum \; (\rDistinct \; (\rflts \;(\map \;
\textit{rsimp} \; x))\; \{ \} ) \rrbracket_r
\]
\end{itemize}
\end{lemma}
\begin{proof}
Points 1, 3, and 4 can be proven by an induction on $rs$.
Point 2 is by case analysis on $r_1$ and $r_2$.
The last part is a corollary of the previous ones.
\end{proof}
\noindent
With the lemmas for each inductive case in place, we are ready to get
the non-increasing property as a corollary:
\begin{corollary}\label{rsimpMono}
$\llbracket \textit{rsimp} \; r \rrbracket_r \leq \llbracket r \rrbracket_r$
\end{corollary}
\begin{proof}
By \ref{rsimpMonoLemmas}.
\end{proof}
\subsection{Estimating the Closed Forms' sizes}
We recap the closed forms we obtained
earlier:
\begin{itemize}
\item
$\rderssimp{(\sum rs)}{s} \sequal
\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
\item
$\rderssimp{(r_1 \cdot r_2)}{s} \sequal \sum ((r_1 \backslash s) \cdot r_2 )
:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1}))$
\item
$\rderssimp{r^*}{c::s} =
\rsimp{
(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \;
(\starupdates \; s\; r \; [[c]])
)
)
}
$
\end{itemize}
\noindent
The closed forms on the left-hand-side
are all of the same shape: $\rsimp{ (\sum rs)} $.
Such regular expression will be bounded by the size of $\sum rs'$,
where every element in $rs'$ is distinct, and each element
can be described by some inductive sub-structures
(for example when $r = r_1 \cdot r_2$ then $rs'$
will be solely comprised of $r_1 \backslash s'$
and $r_2 \backslash s''$, $s'$ and $s''$ being
sub-strings of $s$).
which will each have a size upper bound
according to the inductive hypothesis, which controls $r \backslash s$.
We elaborate the above reasoning by a series of lemmas
below, where straightforward proofs are omitted.
%We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
We show that $\textit{rdistinct}$ and $\rflts$
working together is at least as
good as $\textit{rdistinct}$ alone, which can be written as
\begin{center}
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
\leq
\llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $.
\end{center}
We need this so that we know the outcome of our real
simplification is better than or equal to a rough estimate,
and therefore can be bounded by that estimate.
This is a bit harder to establish compared to proving
$\textit{flts}$ does not make a list larger (which can
be proven using routine induction):
\begin{center}
$\llbracket \textit{rflts}\; rs \rrbracket_r \leq
\llbracket \textit{rs} \rrbracket_r$
\end{center}
We cannot simply prove how each helper function
reduces the size and then put them together:
From
\begin{center}
$\llbracket \textit{rflts}\; rs \rrbracket_r \leq
\llbracket \textit{rs} \rrbracket_r$
\end{center}
and
\begin{center}
$\llbracket \textit{rdistinct} \; rs \; \varnothing \leq
\llbracket rs \rrbracket_r$
\end{center}
one cannot infer
\begin{center}
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
\leq
\llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $.
\end{center}
What we can infer is that
\begin{center}
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
\leq
\llbracket rs \rrbracket_r$
\end{center}
but this estimate is too rough and $\llbracket rs \rrbracket_r$ is unbounded.
The way we
get around this is by first proving a more general lemma
(so that the inductive case goes through):
\begin{lemma}\label{fltsSizeReductionAlts}
If we have three accumulator sets:
$noalts\_set$, $alts\_set$ and $corr\_set$,
satisfying:
\begin{itemize}
\item
$\forall r \in noalts\_set. \; \nexists xs.\; r = \sum xs$
\item
$\forall r \in alts\_set. \; \exists xs. \; r = \sum xs
\; \textit{and} \; set \; xs \subseteq corr\_set$
\end{itemize}
then we have that
\begin{center}
\begin{tabular}{lcl}
$\llbracket (\textit{rdistinct} \; (\textit{rflts} \; as) \;
(noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\
$\llbracket (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup
\{ \ZERO_r \} )) \rrbracket_r$ & & \\
\end{tabular}
\end{center}
holds.
\end{lemma}
\noindent
We split the accumulator into two parts: the part
which contains alternative regular expressions ($alts\_set$), and
the part without any of them($noalts\_set$).
This is because $\rflts$ opens up the alternatives in $as$,
causing the accumulators on both sides of the inequality
to diverge slightly.
If we want to compare the accumulators that are not
perfectly in sync, we need to consider the alternatives and non-alternatives
separately.
The set $corr\_set$ is the corresponding set
of $alts\_set$ with all elements under the alternative constructor
spilled out.
\begin{proof}
By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}.
\end{proof}
By setting all three sets to the empty set, one gets the desired size estimate:
\begin{corollary}\label{interactionFltsDB}
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
\leq
\llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $.
\end{corollary}
\begin{proof}
By using the lemma \ref{fltsSizeReductionAlts}.
\end{proof}
\noindent
The intuition for why this is true
is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:
\begin{lemma}\label{altsSimpControl}
$\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$
\end{lemma}
\begin{proof}
By using corollary \ref{interactionFltsDB}.
\end{proof}
\noindent
This is a key lemma in establishing the bounds of all the
closed forms.
With this we are now ready to control the sizes of
$(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$.
\begin{theorem}\label{rBound}
For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
\end{theorem}
\noindent
\begin{proof}
We prove this by induction on $r$. The base cases for $\RZERO$,
$\RONE $ and $\RCHAR{c}$ are straightforward.
In the sequence $r_1 \cdot r_2$ case,
the inductive hypotheses state
$\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and
$\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$.
When the string $s$ is not empty, we can reason as follows
%
\begin{center}
\begin{tabular}{lcll}
& & $ \llbracket \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\
& $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash_{rsimps} s \cdot r_2 \; \; :: \; \;
\map \; (r_2\backslash_{rsimps} \_)\; (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\
& $\leq$ & $\llbracket \rdistinct{(r_1 \backslash_{rsimps} s \cdot r_2 \; \; :: \; \;
\map \; (r_2\backslash_{rsimps} \_)\; (\vsuf{s}{r}))}{\varnothing} \rrbracket_r + 1$ & (2) \\
& $\leq$ & $2 + N_1 + \rsize{r_2} + (N_2 * (card\;(\sizeNregex \; N_2)))$ & (3)\\
\end{tabular}
\end{center}
\noindent
(1) is by theorem \ref{seqClosedForm}.
(2) is by \ref{altsSimpControl}.
(3) is by \ref{finiteSizeNCorollary}.
Combining the cases when $s = []$ and $s \neq []$, we get (4):
\begin{center}
\begin{tabular}{lcll}
$\rsize{(r_1 \cdot r_2) \backslash_r s}$ & $\leq$ &
$max \; (2 + N_1 +
\llbracket r_2 \rrbracket_r +
N_2 * (card\; (\sizeNregex \; N_2))) \; \rsize{r_1\cdot r_2}$ & (4)
\end{tabular}
\end{center}
We reason similarly for $\STAR$.
The inductive hypothesis is
$\exists N. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N$.
Let $n_r = \llbracket r^* \rrbracket_r$.
When $s = c :: cs$ is not empty,
\begin{center}
\begin{tabular}{lcll}
& & $ \llbracket \rderssimp{r^* }{c::cs} \rrbracket_r $\\
& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\lambda s. (r \backslash_{rsimps} s) \cdot r^*) \; (\starupdates\;
cs \; r \; [[c]] )) )} \rrbracket_r $ & (5) \\
& $\leq$ & $\llbracket
\rdistinct{
(\map \;
(\lambda s. (r \backslash_{rsimps} s) \cdot r^*) \;
(\starupdates\; cs \; r \; [[c]] )
)}
{\varnothing} \rrbracket_r + 1$ & (6) \\
& $\leq$ & $1 + (\textit{card} (\sizeNregex \; (N + n_r)))
* (1 + (N + n_r)) $ & (7)\\
\end{tabular}
\end{center}
\noindent
(5) is by theorem \ref{starClosedForm}.
(6) is by \ref{altsSimpControl}.
(7) is by corollary \ref{finiteSizeNCorollary}.
Combining with the case when $s = []$, one obtains
\begin{center}
\begin{tabular}{lcll}
$\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r)))
* (1 + (N + n_r)) $ & (8)\\
\end{tabular}
\end{center}
\noindent
The alternative case is slightly less involved.
The inductive hypothesis
is equivalent to $\exists N. \forall r \in (\map \; (\_ \backslash_r s) \; rs). \rsize{r} \leq N$.
In the case when $s = c::cs$, we have
\begin{center}
\begin{tabular}{lcll}
& & $ \llbracket \rderssimp{\sum rs }{c::cs} \rrbracket_r $\\
& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\_ \backslash_{rsimps} s) \; rs) )} \rrbracket_r $ & (9) \\
& $\leq$ & $\llbracket (\sum (\map \; (\_ \backslash_{rsimps} s) \; rs) ) \rrbracket_r $ & (10) \\
& $\leq$ & $1 + N * (length \; rs) $ & (11)\\
\end{tabular}
\end{center}
\noindent
(9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis.
Combining with the case when $s = []$, we obtain
\begin{center}
\begin{tabular}{lcll}
$\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$
& (12)\\
\end{tabular}
\end{center}
We have all the inductive cases proven.
\end{proof}
This leads to our main result on the size bound:
\begin{corollary}
For any annotated regular expression $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$
\end{corollary}
\begin{proof}
By lemma \ref{sizeRelations} and theorem \ref{rBound}.
\end{proof}
\noindent
%-----------------------------------
% SECTION 2
%-----------------------------------
\section{Bounded Repetitions}
We have promised in chapter \ref{Introduction}
that our lexing algorithm can potentially be extended
to handle bounded repetitions
in natural and elegant ways.
Now we fulfill our promise by adding support for
the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$.
We add clauses in our derivatives-based lexing algorithms (with simplifications)
introduced in chapter \ref{Bitcoded2}.
\subsection{Augmented Definitions}
There are a number of definitions that need to be augmented.
The most notable one would be the POSIX rules for $r^{\{n\}}$:
\begin{center}
\begin{mathpar}
\inferrule{\forall v \in vs_1. \vdash v:r \land
|v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\
\textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \;
(vs_1 @ vs_2) : r^{\{n\}} }
\end{mathpar}
\end{center}
As Ausaf had pointed out \cite{Ausaf},
sometimes empty iterations have to be taken to get
a match with exactly $n$ repetitions,
and hence the $vs_2$ part.
Another important definition would be the size:
\begin{center}
\begin{tabular}{lcl}
$\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ &
$\llbracket r \rrbracket_r + n$\\
\end{tabular}
\end{center}
\noindent
Arguably we should use $\log \; n$ for the size because
the number of digits increases logarithmically w.r.t $n$.
For simplicity we choose to add the counter directly to the size.
The derivative w.r.t a bounded regular expression
is given as
\begin{center}
\begin{tabular}{lcl}
$r^{\{n\}} \backslash_r c$ & $\dn$ &
$r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\
& & $\RZERO \;\quad \quad\quad \quad
\textit{otherwise}$\\
\end{tabular}
\end{center}
\noindent
For brevity, we sometimes use NTIMES to refer to bounded
regular expressions.
The $\mkeps$ function clause for NTIMES would be
\begin{center}
\begin{tabular}{lcl}
$\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \;
(\textit{replicate} \; n\; (\mkeps \; r))$
\end{tabular}
\end{center}
\noindent
The injection looks like
\begin{center}
\begin{tabular}{lcl}
$\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ &
$\dn$ & $\Stars \;
((\inj \; r \;c \;v ) :: vs)$
\end{tabular}
\end{center}
\noindent
\subsection{Proofs for the Augmented Lexing Algorithm}
We need to maintain two proofs with the additional $r^{\{n\}}$
construct: the
correctness proof in chapter \ref{Bitcoded2},
and the finiteness proof in chapter \ref{Finite}.
\subsubsection{Correctness Proof Augmentation}
The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitions
have been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}.
As they have commented, once the definitions are in place,
the proofs given for the basic regular expressions will extend to
bounded regular expressions, and there are no ``surprises''.
We confirm this point because the correctness theorem would also
extend without surprise to $\blexersimp$.
The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on
do not need to be changed,
and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to
add one more line which can be solved by the Sledgehammer tool
to solve the $r^{\{n\}}$ inductive case.
\subsubsection{Finiteness Proof Augmentation}
The bounded repetitions are
very similar to stars, and therefore the treatment
is similar, with minor changes to handle some slight complications
when the counter reaches 0.
The exponential growth is similar:
\begin{center}
\begin{tabular}{ll}
$r^{\{n\}} $ & $\longrightarrow_{\backslash c}$\\
$(r\backslash c) \cdot
r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c'}$\\
\\
$r \backslash cc' \cdot r^{\{n - 2\}}* +
r \backslash c' \cdot r^{\{n - 1\}}*$ &
$\longrightarrow_{\backslash c''}$\\
\\
$(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
r \backslash c''\cdot r^{\{n-1\}}) +
(r \backslash c'c'' \cdot r^{\{n-2\}}* +
r \backslash c'' \cdot r^{\{n-1\}}*)$ &
$\longrightarrow_{\backslash c'''}$ \\
\\
$\ldots$\\
\end{tabular}
\end{center}
Again, we assume that $r\backslash c$, $r \backslash cc'$ and so on
are all nullable.
The flattened list of terms for $r^{\{n\}} \backslash_{rs} s$
\begin{center}
$[r_1 \backslash cc'c'' \cdot r^{\{n-3\}}*,\;
r \backslash c''\cdot r^{\{n-1\}}, \;
r \backslash c'c'' \cdot r^{\{n-2\}}*, \;
r \backslash c'' \cdot r^{\{n-1\}}*,\; \ldots ]$
\end{center}
that comes from
\begin{center}
$(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
r \backslash c''\cdot r^{\{n-1\}}) +
(r \backslash c'c'' \cdot r^{\{n-2\}}* +
r \backslash c'' \cdot r^{\{n-1\}}*)+ \ldots$
\end{center}
are made of sequences with different tails, where the counters
might differ.
The observation for maintaining the bound is that
these counters never exceed $n$, the original
counter. With the number of counters staying finite,
$\rDistinct$ will deduplicate and keep the list finite.
We introduce this idea as a lemma once we describe all
the necessary helper functions.
Similar to the star case, we want
\begin{center}
$\rderssimp{r^{\{n\}}}{s} = \rsimp{\sum rs}$.
\end{center}
where $rs$
shall be in the form of
$\map \; f \; Ss$, where $f$ is a function and
$Ss$ a list of objects to act on.
For star, the object's datatype is string.
The list of strings $Ss$
is generated using functions
$\starupdate$ and $\starupdates$.
The function that takes a string and returns a regular expression
is the anonymous function $
(\lambda s'. \; r\backslash s' \cdot r^{\{m\}})$.
In the NTIMES setting,
the $\starupdate$ and $\starupdates$ functions are replaced by
$\textit{nupdate}$ and $\textit{nupdates}$:
\begin{center}
\begin{tabular}{lcl}
$\nupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
$\nupdate \; c \; r \;
(\Some \; (s, \; n + 1) \; :: \; Ss)$ & $\dn$ & %\\
$\textit{if} \;
(\rnullable \; (r \backslash_{rs} s))$ \\
& & $\;\;\textit{then}
\;\; \Some \; (s @ [c], n + 1) :: \Some \; ([c], n) :: (
\nupdate \; c \; r \; Ss)$ \\
& & $\textit{else} \;\; \Some \; (s @ [c], n+1) :: (
\nupdate \; c \; r \; Ss)$\\
$\nupdate \; c \; r \; (\textit{None} :: Ss)$ & $\dn$ &
$(\None :: \nupdate \; c \; r \; Ss)$\\
& & \\
%\end{tabular}
%\end{center}
%\begin{center}
%\begin{tabular}{lcl}
$\nupdates \; [] \; r \; Ss$ & $\dn$ & $Ss$\\
$\nupdates \; (c :: cs) \; r \; Ss$ & $\dn$ & $\nupdates \; cs \; r \; (
\nupdate \; c \; r \; Ss)$
\end{tabular}
\end{center}
\noindent
which take into account when a subterm
\begin{center}
$r \backslash_s s \cdot r^{\{n\}}$
\end{center}
counter $n$
is 0, and therefore expands to
\begin{center}
$r \backslash_s (s@[c]) \cdot r^{\{n\}} \;+
\; \ZERO$
\end{center}
after taking a derivative.
The object now has type
\begin{center}
$\textit{option} \;(\textit{string}, \textit{nat})$
\end{center}
and therefore the function for converting such an option into
a regular expression term is called $\opterm$:
\begin{center}
\begin{tabular}{lcl}
$\opterm \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
& & $\;\;\Some \; (s, n) \Rightarrow
(r\backslash_{rs} s)\cdot r^{\{n\}}$\\
& & $\;\;\None \Rightarrow
\ZERO$\\
\end{tabular}
\end{center}
\noindent
Put together, the list $\map \; f \; Ss$ is instantiated as
\begin{center}
$\map \; (\opterm \; r) \; (\nupdates \; s \; r \;
[\Some \; ([c], n)])$.
\end{center}
For the closed form to be bounded, we would like
simplification to be applied to each term in the list.
Therefore we introduce some variants of $\opterm$,
which help conveniently express the rewriting steps
needed in the closed form proof.
We have $\optermOsimp$, $\optermosimp$ and $\optermsimp$
with slightly different spellings because they help the proof to go through:
\begin{center}
\begin{tabular}{lcl}
$\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
& & $\;\;\Some \; (s, n) \Rightarrow
\textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\
& & $\;\;\None \Rightarrow
\ZERO$\\
\\
$\optermosimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
& & $\;\;\Some \; (s, n) \Rightarrow
(\textit{rsimp} \; (r\backslash_{rs} s))
\cdot r^{\{n\}}$\\
& & $\;\;\None \Rightarrow
\ZERO$\\
\\
$\optermsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
& & $\;\;\Some \; (s, n) \Rightarrow
(r\backslash_{rsimps} s)\cdot r^{\{n\}}$\\
& & $\;\;\None \Rightarrow
\ZERO$\\
\end{tabular}
\end{center}
For a list of
$\textit{option} \;(\textit{string}, \textit{nat})$ elements,
we define the highest power for it recursively:
\begin{center}
\begin{tabular}{lcl}
$\hpa \; [] \; n $ & $\dn$ & $n$\\
$\hpa \; (\None :: os) \; n $ & $\dn$ & $\hpa \; os \; n$\\
$\hpa \; (\Some \; (s, n) :: os) \; m$ & $\dn$ &
$\hpa \;os \; (\textit{max} \; n\; m)$\\
\\
$\hpower \; rs $ & $\dn$ & $\hpa \; rs \; 0$\\
\end{tabular}
\end{center}
Now the intuition that an NTIMES regular expression's power
does not increase can be easily expressed as
\begin{lemma}\label{nupdatesMono2}
$\hpower \; (\nupdates \;s \; r \; [\Some \; ([c], n)]) \leq n$
\end{lemma}
\begin{proof}
Note that the power is non-increasing after a $\nupdate$ application:
\begin{center}
$\hpa \;\; (\nupdate \; c \; r \; Ss)\;\; m \leq
\hpa\; \; Ss \; m$.
\end{center}
This is also the case for $\nupdates$:
\begin{center}
$\hpa \;\; (\nupdates \; s \; r \; Ss)\;\; m \leq
\hpa\; \; Ss \; m$.
\end{center}
Therefore we have that
\begin{center}
$\hpower \;\; (\nupdates \; s \; r \; Ss) \leq
\hpower \;\; Ss$
\end{center}
which leads to the lemma being proven.
\end{proof}
We also define the inductive rules for
the shape of derivatives of the NTIMES regular expressions:\\[-3em]
\begin{center}
\begin{mathpar}
\inferrule{\mbox{}}{\cbn \;\ZERO}
\inferrule{\mbox{}}{\cbn \; \; r_a \cdot (r^{\{n\}})}
\inferrule{\cbn \; r_1 \;\; \; \cbn \; r_2}{\cbn \; r_1 + r_2}
\inferrule{\cbn \; r}{\cbn \; r + \ZERO}
\end{mathpar}
\end{center}
\noindent
A derivative of NTIMES fits into the shape described by $\cbn$:
\begin{lemma}\label{ntimesDersCbn}
$\cbn \; ((r' \cdot r^{\{n\}}) \backslash_{rs} s)$ holds.
\end{lemma}
\begin{proof}
By a reverse induction on $s$.
For the inductive case, note that if $\cbn \; r$ holds,
then $\cbn \; (r\backslash_r c)$ holds.
\end{proof}
\noindent
In addition, for $\cbn$-shaped regular expressions, one can flatten
them:
\begin{lemma}\label{ntimesHfauPushin}
If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} =
\textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \;
(\hflataux{r})})$
\end{lemma}
\begin{proof}
By an induction on the inductive cases of $\cbn$.
\end{proof}
\noindent
This time we do not need to define the flattening functions for NTIMES only,
because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already.
\begin{lemma}\label{ntimesHfauInduct}
$\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} =
\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$
\end{lemma}
\begin{proof}
By a reverse induction on $s$.
The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used.
\end{proof}
\noindent
We have a recursive property for NTIMES with $\nupdate$
similar to that for STAR,
and one for $\nupdates $ as well:
\begin{lemma}\label{nupdateInduct1}
\mbox{}
\begin{itemize}
\item
\begin{center}
$\textit{concat} \; (\map \; (\hflataux{\_} \circ (
\opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \;
c \; r \; Ss)$\\
\end{center}
holds.
\item
\begin{center}
$\textit{concat} \; (\map \; \hflataux{\_}\;
\map \; (\_\backslash_r x) \;
(\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\
$=$\\
$\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$
\end{center}
holds.
\end{itemize}
\end{lemma}
\begin{proof}
(i) is by an induction on $Ss$.
(ii) is by an induction on $xs$.
\end{proof}
\noindent
The $\nString$ predicate is defined for conveniently
expressing that there are no empty strings in the
$\Some \;(s, n)$ elements generated by $\nupdate$:
\begin{center}
\begin{tabular}{lcl}
$\nString \; \None$ & $\dn$ & $ \textit{true}$\\
$\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\
$\nString \; (\Some \; (c::s, n))$ & $\dn$ & $ \textit{true}$\\
\end{tabular}
\end{center}
\begin{lemma}\label{nupdatesNonempty}
If for all elements $o \in \textit{set} \; Ss$,
$\nString \; o$ holds, then we have that
for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$,
$\nString \; o'$ holds.
\end{lemma}
\begin{proof}
By an induction on $s$, where $Ss$ is set to vary over all possible values.
\end{proof}
\noindent
\begin{lemma}\label{ntimesClosedFormsSteps}
The following list of equalities or rewriting relations hold:\\
(i) $r^{\{n+1\}} \backslash_{rsimps} (c::s) =
\textit{rsimp} \; (\sum (\map \; (\opterm \;r \;\_) \; (\nupdates \;
s \; r \; [\Some \; ([c], n)])))$\\
(ii)
\begin{center}
$\sum (\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [
\Some \; ([c], n)]))$ \\ $ \sequal$\\
$\sum (\map \; (\textit{rsimp} \circ (\opterm \; r))\; (\nupdates \;
s\;r \; [\Some \; ([c], n)]))$\\
\end{center}
(iii)
\begin{center}
$\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
([c], n)]))$\\
$\sequal$\\
$\sum \;(\map \; (\optermsimp r) \; (\nupdates \; s \; r \; [\Some \;
([c], n)])) $\\
\end{center}
(iv)
\begin{center}
$\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
([c], n)])) $ \\ $\sequal$\\
$\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
([c], n)])) $\\
\end{center}
(v)
\begin{center}
$\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
([c], n)])) $ \\ $\sequal$\\
$\sum \; (\map \; (\textit{rsimp} \circ (\opterm \; r)) \;
(\nupdates \; s \; r \; [\Some \; ([c], n)]))$
\end{center}
\end{lemma}
\begin{proof}
Routine.
(iii) and (iv) make use of the fact that all the strings $s$
inside $\Some \; (s, m)$ which are elements of the list
$\nupdates \; s\;r\;[\Some\; ([c], n)]$ are non-empty,
which is from lemma \ref{nupdatesNonempty}.
Once the string in $o = \Some \; (s, n)$ is
nonempty, $\optermsimp \; r \;o$,
$\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed
to be equal.
(v) uses \ref{nupdateInduct1}.
\end{proof}
\noindent
Now we are ready to present the closed form for NTIMES:
\begin{theorem}\label{ntimesClosedForm}
The derivative of $r^{\{n+1\}}$ can be described as an alternative
containing a list
of terms:\\
$r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; (
\sum (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \;
[\Some \; ([c], n)])))$
\end{theorem}
\begin{proof}
By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}.
\end{proof}
\noindent
The key observation for bounding this closed form
is that the counter on $r^{\{n\}}$ will
only decrement during derivatives:
\begin{lemma}\label{nupdatesNLeqN}
For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \;
[\Some \; ([c], n)])$, either $o = \None$, or $o = \Some
\; (s', m)$ for some string $s'$ and number $m \leq n$.
\end{lemma}
\noindent
The proof is routine and therefore omitted.
This allows us to say what kind of terms
are in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; (
\nupdates \; s \; r \; [\Some \; ([c], n)]))$:
only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$
with a small $m$:
\begin{lemma}\label{ntimesClosedFormListElemShape}
For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
\nupdates \; s \; r \; [\Some \; ([c], n)]))$,
we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot
r^{\{m\}}$ for some string $s'$ and number $m \leq n$.
\end{lemma}
\begin{proof}
Using lemma \ref{nupdatesNLeqN}.
\end{proof}
\begin{theorem}\label{ntimesClosedFormBounded}
Assuming that for any string $s$, $\llbracket r \backslash_{rsimps} s
\rrbracket_r \leq N$ holds, then we have that\\
$\llbracket r^{\{n+1\}} \backslash_{rsimps} s \rrbracket_r \leq
\textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$,
where $c_N = \textit{card} \; (\textit{sizeNregex} \; (
N + \llbracket r^{\{n\}} \rrbracket_r+1))$.
\end{theorem}
\begin{proof}
We have that for all regular expressions $r'$ in
\begin{center}
$\textit{set} \; (\map \; (\optermsimp \; r) \; (
\nupdates \; s \; r \; [\Some \; ([c], n)]))$,
\end{center}
$r'$'s size is less than or equal to $N + \llbracket r^{\{n\}}
\rrbracket_r + 1$
because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot
r^{\{m\}}$ for some string $s'$ and number
$m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}).
In addition, we know that the list
$\map \; (\optermsimp \; r) \; (
\nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most
$c_N = \textit{card} \;
(\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$.
This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r
\leq N * c_N$.
\end{proof}
We aim to formalise the correctness and size bound
for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$
and so on, which is still work in progress.
They should more or less follow the same recipe described in this section.
Once we know how to deal with them recursively using suitable auxiliary
definitions, we can routinely establish the proofs.
%----------------------------------------------------------------------------------------
% SECTION 3
%----------------------------------------------------------------------------------------
\section{Comments and Future Improvements}
\subsection{Some Experimental Results}
What guarantee does this bound give us?
It states that whatever the regex is, it will not grow indefinitely.
Take our previous example $(a + aa)^*$ as an example:
\begin{center}
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
xlabel={number of $a$'s},
x label style={at={(1.05,-0.05)}},
ylabel={regex size},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
ymax= 40,
ytick={0,10,...,40},
scaled ticks=false,
axis lines=left,
width=5cm,
height=4cm,
legend entries={$(a + aa)^*$},
legend pos=south east,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
\end{axis}
\end{tikzpicture}
\end{tabular}
\end{center}
We are able to limit the size of the regex $(a + aa)^*$'s derivatives
with our simplification
rules very effectively.
In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
$f(x) = x * 2^x$.
This means the bound we have will surge up at least
tower-exponentially with a linear increase of the depth.
One might be pretty skepticafl about what this non-elementary
bound can bring us.
It turns out that the giant bounds are far from being hit.
Here we have some test data from randomly generated regular expressions:
\begin{figure}[H]
\begin{tabular}{@{}c@{\hspace{2mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
x label style={at={(1.05,-0.05)}},
ylabel={regex size},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
%ymax=1000,
%ytick={0,100,...,1000},
scaled ticks=false,
axis lines=left,
width=4.75cm,
height=3.8cm,
legend entries={regex1},
legend pos=north east,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
x label style={at={(1.05,-0.05)}},
%ylabel={time in secs},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
%ymax=1000,
%ytick={0,100,...,1000},
scaled ticks=false,
axis lines=left,
width=4.75cm,
height=3.8cm,
legend entries={regex2},
legend pos=south east,
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
x label style={at={(1.05,-0.05)}},
%ylabel={time in secs},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
%ymax=1000,
%ytick={0,100,...,1000},
scaled ticks=false,
axis lines=left,
width=4.75cm,
height=3.8cm,
legend entries={regex3},
legend pos=south east,
legend cell align=left]
\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
\end{axis}
\end{tikzpicture}\\
\multicolumn{3}{c}{}
\end{tabular}
\caption{Graphs: size change of 3 randomly generated
regular expressions $w.r.t.$ input string length.
The x-axis represents the length of the input.}
\end{figure}
\noindent
Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the
original size.
We will discuss improvements to this bound in the next chapter.
\subsection{Possible Further Improvements}
There are two problems with this finiteness result, though:\\
(i)
First, it is not yet a direct formalisation of our lexer's complexity,
as a complexity proof would require looking into
the time it takes to execute {\bf all} the operations
involved in the lexer (simp, collect, decode), not just the derivative.\\
(ii)
Second, the bound is not yet tight, and we seek to improve $N_a$ so that
it is polynomial on $\llbracket a \rrbracket$.\\
Still, we believe this contribution is useful,
because
\begin{itemize}
\item
The size proof can serve as a starting point for a complexity
formalisation.
Derivatives are the most important phases of our lexer algorithm.
Size properties about derivatives cover the majority of the algorithm
and is therefore a good indication of the complexity of the entire program.
\item
The bound is already a strong indication that catastrophic
backtracking is much less likely to occur in our $\blexersimp$
algorithm.
We refine $\blexersimp$ with $\blexerStrong$ in the next chapter
so that we conjecture the bound becomes polynomial.
\end{itemize}
%----------------------------------------------------------------------------------------
% SECTION 4
%----------------------------------------------------------------------------------------
One might wonder about the actual bound rather than the loose bound we gave
for the convenience of a more straightforward proof.
How much can the regex $r^* \backslash s$ grow?
As earlier graphs have shown,
%TODO: reference that graph where size grows quickly
they can grow at a maximum speed
exponential $w.r.t$ the number of characters,
but will eventually level off when the string $s$ is long enough.
If they grow to a size exponential $w.r.t$ the original regex, our algorithm
would still be slow.
And unfortunately, we have concrete examples
where such regular expressions grew exponentially large before levelling off:
\begin{center}
$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
(\underbrace{a \ldots a}_{\text{n a's}})^*)^*$
\end{center}
will already have a maximum
size that is exponential on the number $n$
under our current simplification rules:
%TODO: graph of a regex whose size increases exponentially.
\begin{center}
\begin{tikzpicture}
\begin{axis}[
height=0.5\textwidth,
width=\textwidth,
xlabel=number of a's,
xtick={0,...,9},
ylabel=maximum size,
ymode=log,
log basis y={2}
]
\addplot[mark=*,blue] table {re-chengsong.data};
\end{axis}
\end{tikzpicture}
\end{center}
For convenience we use $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
The exponential size is triggered by that the regex
$\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
inside the $(\ldots) ^*$ having exponentially many
different derivatives, despite those differences being minor.
$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
will therefore contain the following terms (after flattening out all nested
alternatives):
\begin{center}
$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
[1mm]
$(1 \leq m' \leq m )$
\end{center}
There are at least exponentially
many such terms.\footnote{To be exact, these terms are
distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted,
but the point is that the number is exponential.
}
With each new input character taking the derivative against the intermediate result, more and more such distinct
terms will accumulate.
The function $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms
\begin{center}
$(\sum_{i = 1}^{n}
(\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot
(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot
(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot
(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot
(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
\end{center}
\noindent
where $m' \neq m''$
as they are slightly different.
This means that with our current simplification methods,
we will not be able to control the derivative so that
$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial. %\leq O((\llbracket r\rrbacket)^c)$
These terms are similar in the sense that the head of those terms
are all consisted of sub-terms of the form:
$(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $.
For $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
$n * (n + 1) / 2$ such terms.
For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
can be described by 6 terms:
$a^*$, $a\cdot (aa)^*$, $ (aa)^*$,
$aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
The total number of different "head terms", $n * (n + 1) / 2$,
is proportional to the number of characters in the regex
$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
If we can improve our deduplication process so that it becomes smarter
and only keep track of these $n * (n+1) /2$ terms, then we can keep
the size growth polynomial again.
This example also suggests a slightly different notion of size, which we call the
alphabetic width:
\begin{center}
\begin{tabular}{lcl}
$\textit{awidth} \; \ZERO$ & $\dn$ & $0$\\
$\textit{awidth} \; \ONE$ & $\dn$ & $0$\\
$\textit{awidth} \; c$ & $\dn$ & $1$\\
$\textit{awidth} \; r_1 + r_2$ & $\dn$ & $\textit{awidth} \;
r_1 + \textit{awidth} \; r_2$\\
$\textit{awidth} \; r_1 \cdot r_2$ & $\dn$ & $\textit{awidth} \;
r_1 + \textit{awidth} \; r_2$\\
$\textit{awidth} \; r^*$ & $\dn$ & $\textit{awidth} \; r$\\
\end{tabular}
\end{center}
Antimirov\parencite{Antimirov95} has proven that
$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$,
where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
created by doing derivatives of $r$ against all possible strings.
If we can make sure that at any moment in our lexing algorithm our
intermediate result hold at most one copy of each of the
subterms then we can get the same bound as Antimirov's.
This leads to the algorithm in the next chapter.
%----------------------------------------------------------------------------------------
% SECTION 1
%----------------------------------------------------------------------------------------
%-----------------------------------
% SUBSECTION 1
%-----------------------------------
%\subsection{Syntactic Equivalence Under $\simp$}
%We prove that minor differences can be annihilated
%by $\simp$.
%For example,
%\begin{center}
% $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) =
% \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
%\end{center}