% Chapter Template
\chapter{Finiteness Bound} % Main chapter title
\label{Finite}
% In Chapter 4 \ref{Chapter4} we give the second guarantee
%of our bitcoded algorithm, that is a finite bound on the size of any
%regex's derivatives.
In this chapter we give a guarantee in terms of time complexity:
given a regular expression $r$, for any string $s$
our algorithm's internal data structure is finitely bounded.
Note that it is not immediately obvious that $\llbracket \bderssimp{r}{s} \rrbracket$ (the internal
data structure used in our $\blexer$)
is bounded by a constant $N_r$, where $N$ only depends on the regular expression
$r$, not the string $s$.
When doing a time complexity analysis of any
lexer/parser based on Brzozowski derivatives, one needs to take into account that
not only the "derivative steps".
%TODO: get a grpah for internal data structure growing arbitrary large
To obtain such a proof, we need to
\begin{itemize}
\item
Define an new datatype for regular expressions that makes it easy
to reason about the size of an annotated regular expression.
\item
A set of equalities for this new datatype that enables one to
rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc.
by their children regexes $r_1$, $r_2$, and $r$.
\item
Using those equalities to actually get those rewriting equations, which we call
"closed forms".
\item
Bound the closed forms, thereby bounding the original
$\blexersimp$'s internal data structures.
\end{itemize}
\section{the $\mathbf{r}$-rexp datatype and the size functions}
We have a size function for bitcoded regular expressions, written
$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
\begin{center}
\begin{tabular}{ccc}
$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\
$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
\end{tabular}
\end{center}
\noindent
Similarly there is a size function for plain regular expressions:
\begin{center}
\begin{tabular}{ccc}
$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
$\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$
\end{tabular}
\end{center}
\noindent
The idea of obatining a bound for $\llbracket \bderssimp{a}{s} \rrbracket$
is to get an equivalent form
of something like $\llbracket \bderssimp{a}{s}\rrbracket = f(a, s)$, where $f(a, s)$
is easier to estimate than $\llbracket \bderssimp{a}{s}\rrbracket$.
We notice that while it is not so clear how to obtain
a metamorphic representation of $\bderssimp{a}{s}$ (as we argued in chapter \ref{Bitcoded2},
not interleaving the application of the functions $\backslash$ and $\bsimp{\_}$
in the order as our lexer will result in the bit-codes dispensed differently),
it is possible to get an slightly different representation of the unlifted versions:
$ (\bderssimp{a}{s})_\downarrow = (\erase \; \bsimp{a \backslash s})_\downarrow$.
This suggest setting the bounding function $f(a, s)$ as
$\llbracket (a \backslash s)_\downarrow \rrbracket_p$, the plain size
of the erased annotated regular expression.
This requires the the regular expression accompanied by bitcodes
to have the same size as its plain counterpart after erasure:
\begin{center}
$\asize{a} \stackrel{?}{=} \llbracket \erase(a)\rrbracket_p$.
\end{center}
\noindent
But there is a minor nuisance:
the erase function unavoidbly messes with the structure of the regular expression,
due to the discrepancy between annotated regular expression's $\sum$ constructor
and plain regular expression's $+$ constructor having different arity.
\begin{center}
\begin{tabular}{ccc}
$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
\end{tabular}
\end{center}
\noindent
An alternative regular expression with an empty list of children
is turned into a $\ZERO$ during the
$\erase$ function, thereby changing the size and structure of the regex.
Therefore the equality in question does not hold.
These will likely be fixable if we really want to use plain $\rexp$s for dealing
with size, but we choose a more straightforward (or stupid) method by
defining a new datatype that is similar to plain $\rexp$s but can take
non-binary arguments for its alternative constructor,
which we call $\rrexp$ to denote
the difference between it and plain regular expressions.
\[ \rrexp ::= \RZERO \mid \RONE
\mid \RCHAR{c}
\mid \RSEQ{r_1}{r_2}
\mid \RALTS{rs}
\mid \RSTAR{r}
\]
For $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
but keep everything else intact.
It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
We denote the operation of erasing the bits and turning an annotated regular expression
into an $\rrexp{}$ as $\rerase{}$.
\begin{center}
\begin{tabular}{lcl}
$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
\end{tabular}
\end{center}
\noindent
$\rrexp$ give the exact correspondence between an annotated regular expression
and its (r-)erased version:
\begin{lemma}
$\rsize{\rerase a} = \asize a$
\end{lemma}
\noindent
This does not hold for plain $\rexp$s.
Similarly we could define the derivative and simplification on
$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
except that now they can operate on alternatives taking multiple arguments.
\begin{center}
\begin{tabular}{lcr}
$(\RALTS{rs})\; \backslash c$ & $\dn$ & $\RALTS{\map\; (\_ \backslash c) \;rs}$\\
(other clauses omitted)
With the new $\rrexp$ datatype in place, one can define its size function,
which precisely mirrors that of the annotated regular expressions:
\end{tabular}
\end{center}
\noindent
\begin{center}
\begin{tabular}{ccc}
$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\
$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\
$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$
\end{tabular}
\end{center}
\noindent
\subsection{Lexing Related Functions for $\rrexp$}
Everything else for $\rrexp$ will be precisely the same for annotated expressions,
except that they do not involve rectifying and augmenting bit-encoded tokenization information.
As expected, most functions are simpler, such as the derivative:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\
$(\ONE)\,\backslash_r c$ & $\dn$ &
$\textit{if}\;c=d\; \;\textit{then}\;
\ONE\;\textit{else}\;\ZERO$\\
$(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ &
$\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\
$(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ &
$\textit{if}\;\textit{rnullable}\,r_1$\\
& &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\
& &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\
& &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\
$(r^*)\,\backslash_r c$ & $\dn$ &
$( r\,\backslash_r c)\cdot
(_{[]}r^*))$
\end{tabular}
\end{center}
\noindent
The simplification function is simplified without annotation causing superficial differences.
Duplicate removal without an equivalence relation:
\begin{center}
\begin{tabular}{lcl}
$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
$\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
& & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
\end{tabular}
\end{center}
%TODO: definition of rsimp (maybe only the alternative clause)
\noindent
The prefix $r$ in front of $\rdistinct{}{}$ is used mainly to
differentiate with $\textit{distinct}$, which is a built-in predicate
in Isabelle that says all the elements of a list are unique.
With $\textit{rdistinct}$ one can chain together all the other modules
of $\bsimp{\_}$ (removing the functionalities related to bit-sequences)
and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$.
We omit these functions, as they are routine. Please refer to the formalisation
(in file BasicIdentities.thy) for the exact definition.
With $\rrexp$ the size caclulation of annotated regular expressions'
simplification and derivatives can be done by the size of their unlifted
counterpart with the unlifted version of simplification and derivatives applied.
\begin{lemma}
The following equalities hold:
\begin{itemize}
\item
$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
\item
$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$
\end{itemize}
\end{lemma}
\noindent
In the following content, we will focus on $\rrexp$'s size bound.
We will piece together this bound and show the same bound for annotated regular
expressions in the end.
Unless stated otherwise in this chapter all $\textit{rexp}$s without
bitcodes are seen as $\rrexp$s.
We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
as the former suits people's intuitive way of stating a binary alternative
regular expression.
%-----------------------------------
% SECTION ?
%-----------------------------------
\subsection{Finiteness Proof Using $\rrexp$s}
Now that we have defined the $\rrexp$ datatype, and proven that its size changes
w.r.t derivatives and simplifications mirrors precisely those of annotated regular expressions,
we aim to bound the size of $r \backslash s$ for any $\rrexp$ $r$.
Once we have a bound like:
\[
\llbracket r \backslash_{rsimp} s \rrbracket_r \leq N_r
\]
\noindent
we could easily extend that to
\[
\llbracket a \backslash_{bsimps} s \rrbracket \leq N_r.
\]
\subsection{Roadmap to a Bound for $\textit{Rrexp}$}
The way we obtain the bound for $\rrexp$s is by two steps:
\begin{itemize}
\item
First, we rewrite $r\backslash s$ into something else that is easier
to bound. This step is especially important for the inductive case
$r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way,
but after simplification they will always be equal or smaller to a form consisting of an alternative
list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application.
\item
Then, for such a sum list of regular expressions $f\; (g\; (\sum rs))$, we can control its size
by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only
reduce the size of a regular expression, not adding to it.
\end{itemize}
\section{Step One: Closed Forms}
We transform the function application $\rderssimp{r}{s}$
into an equivalent form $f\; (g \; (\sum rs))$.
The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$.
This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the
right hand side the "closed form" of $r\backslash s$.
\subsection{Basic Properties needed for Closed Forms}
\subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
The $\textit{rdistinct}$ function, as its name suggests, will
remove duplicates in an \emph{r}$\textit{rexp}$ list,
according to the accumulator
and leave only one of each different element in a list:
\begin{lemma}\label{rdistinctDoesTheJob}
The function $\textit{rdistinct}$ satisfies the following
properties:
\begin{itemize}
\item
If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$.
\item
If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
then $\textit{isDistinct} \; rs'$.
\item
$\rdistinct{rs}{acc} = rs - acc$
\end{itemize}
\end{lemma}
\noindent
The predicate $\textit{isDistinct}$ is for testing
whether a list's elements are all unique. It is defined
recursively on the structure of a regular expression,
and we omit the precise definition here.
\begin{proof}
The first part is by an induction on $rs$.
The second and third part can be proven by using the
induction rules of $\rdistinct{\_}{\_}$.
\end{proof}
\noindent
$\rdistinct{\_}{\_}$ will cancel out all regular expression terms
that are in the accumulator, therefore prepending a list $rs_a$ with an arbitrary
list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$
on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$
without the prepending of $rs$:
\begin{lemma}
The elements appearing in the accumulator will always be removed.
More precisely,
\begin{itemize}
\item
If $rs \subseteq rset$, then
$\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$.
\item
Furthermore, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$,
then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$
\end{itemize}
\end{lemma}
\begin{proof}
By induction on $rs$.
\end{proof}
\noindent
On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,
then expanding the accumulator to include that element will not cause the output list to change:
\begin{lemma}
The accumulator can be augmented to include elements not appearing in the input list,
and the output will not change.
\begin{itemize}
\item
If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{\{r\} \cup acc}$.
\item
Particularly, when $acc = \varnothing$ and $rs$ de-duplicated, we have\\
\[ \rdistinct{rs}{\varnothing} = rs \]
\end{itemize}
\end{lemma}
\begin{proof}
The first half is by induction on $rs$. The second half is a corollary of the first.
\end{proof}
\noindent
The next property gives the condition for
when $\rdistinct{\_}{\_}$ becomes an identical mapping
for any prefix of an input list, in other words, when can
we ``push out" the arguments of $\rdistinct{\_}{\_}$:
\begin{lemma}\label{distinctRdistinctAppend}
If $\textit{isDistinct} \; rs_1$, and $rs_1 \cap acc = \varnothing$,
then
\[\textit{rdistinct}\; (rs_1 @ rsa)\;\, acc
= rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1))\]
\end{lemma}
\noindent
In other words, it can be taken out and left untouched in the output.
\begin{proof}
By an induction on $rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary.
\end{proof}
\noindent
$\rdistinct{}{}$ removes any element in anywhere of a list, if it
had appeared previously:
\begin{lemma}\label{distinctRemovesMiddle}
The two properties hold if $r \in rs$:
\begin{itemize}
\item
$\rdistinct{rs}{rset} = \rdistinct{(rs @ [r])}{rset}$\\
and\\
$\rdistinct{(ab :: rs @ [ab])}{rset'} = \rdistinct{(ab :: rs)}{rset'}$
\item
$\rdistinct{ (rs @ rs') }{rset} = \rdistinct{rs @ [r] @ rs'}{rset}$\\
and\\
$\rdistinct{(ab :: rs @ [ab] @ rs'')}{rset'} =
\rdistinct{(ab :: rs @ rs'')}{rset'}$
\end{itemize}
\end{lemma}
\noindent
\begin{proof}
By induction on $rs$. All other variables are allowed to be arbitrary.
The second half of the lemma requires the first half.
Note that for each half's two sub-propositions need to be proven concurrently,
so that the induction goes through.
\end{proof}
\noindent
This allows us to prove ``Idempotency" of $\rdistinct{}{}$ of some kind:
\begin{lemma}\label{rdistinctConcatGeneral}
The following equalities involving multiple applications of $\rdistinct{}{}$ hold:
\begin{itemize}
\item
$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{((\rdistinct{rs}{\varnothing})@ rs')}{\varnothing}$
\item
$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{(\rdistinct{rs}{\varnothing} @ rs')}{\varnothing}$
\item
If $rset' \subseteq rset$, then $\rdistinct{rs}{rset} =
\rdistinct{(\rdistinct{rs}{rset'})}{rset}$. As a corollary
of this,
\item
$\rdistinct{(rs @ rs')}{rset} = \rdistinct{
(\rdistinct{rs}{\varnothing}) @ rs')}{rset}$. This
gives another corollary use later:
\item
If $a \in rset$, then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{
(\rdistinct{(a :: rs)}{\varnothing} @ rs')}{rset} $,
\end{itemize}
\end{lemma}
\begin{proof}
By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}.
\end{proof}
\subsubsection{The Properties of $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS}$}
We give in this subsection some properties of how $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS} $ interact with each other and with $@$, the concatenation operator.
These will be helpful in later closed form proofs, when
we want to transform the ways in which multiple functions involving
those are composed together
in interleaving derivative and simplification steps.
When the function $\textit{Rflts}$
is applied to the concatenation of two lists, the output can be calculated by first applying the
functions on two lists separately, and then concatenating them together.
\begin{lemma}\label{rfltsProps}
The function $\rflts$ has the below properties:\\
\begin{itemize}
\item
$\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$
\item
If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$
\item
$\rflts \; (rs @ [\RZERO]) = \rflts \; rs$
\item
$\rflts \; (rs' @ [\RALTS{rs}]) = \rflts \; rs'@rs$
\item
$\rflts \; (rs @ [\RONE]) = \rflts \; rs @ [\RONE]$
\item
If $r \neq \RZERO$ and $\nexists rs'. r = \RALTS{rs'}$ then $\rflts \; (rs @ [r])
= (\rflts \; rs) @ [r]$
\item
If $r = \RALTS{rs}$ and $r \in rs'$ then for all $r_1 \in rs.
r_1 \in \rflts \; rs'$.
\item
$\rflts \; (rs_a @ \RZERO :: rs_b) = \rflts \; (rs_a @ rs_b)$
\end{itemize}
\end{lemma}
\noindent
\begin{proof}
By induction on $rs_1$ in the first sub-lemma, and induction on $r$ in the second part,
and induction on $rs$, $rs'$, $rs$, $rs'$, $rs_a$ in the third, fourth, fifth, sixth and
last sub-lemma.
\end{proof}
\subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s}
Much like the definition of $L$ on plain regular expressions, one could also
define the language interpretation of $\rrexp$s.
\begin{center}
\begin{tabular}{lcl}
$RL \; (\ZERO)$ & $\dn$ & $\phi$\\
$RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\
$RL \; (c)$ & $\dn$ & $\{[c]\}$\\
$RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\
$RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\
$RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$
\end{tabular}
\end{center}
\noindent
The main use of $RL$ is to establish some connections between $\rsimp{}$
and $\rnullable{}$:
\begin{lemma}
The following properties hold:
\begin{itemize}
\item
If $\rnullable{r}$, then $\rsimp{r} \neq \RZERO$.
\item
$\rnullable{r \backslash s} \quad $ if and only if $\quad \rnullable{\rderssimp{r}{s}}$.
\end{itemize}
\end{lemma}
\begin{proof}
The first part is by induction on $r$.
The second part is true because property
\[ RL \; r = RL \; (\rsimp{r})\] holds.
\end{proof}
\subsubsection{$\rsimp{}$ is Non-Increasing}
In this subsection, we prove that the function $\rsimp{}$ does not
make the $\llbracket \rrbracket_r$ size increase.
\begin{lemma}\label{rsimpSize}
$\llbracket \rsimp{r} \rrbracket_r \leq \llbracket r \rrbracket_r$
\end{lemma}
\subsubsection{Simplified $\textit{Rrexp}$s are Good}
We formalise the notion of ``good" regular expressions,
which means regular expressions that
are not fully simplified. For alternative regular expressions that means they
do not contain any nested alternatives like
\[ r_1 + (r_2 + r_3) \], un-removed $\RZERO$s like \[\RZERO + r\]
or duplicate elements in a children regular expression list like \[ \sum [r, r, \ldots]\]:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\good\; \RZERO$ & $\dn$ & $\textit{false}$\\
$\good\; \RONE$ & $\dn$ & $\textit{true}$\\
$\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\
$\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\
$\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\
$\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ &
$\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\
& & $\textit{and}\; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \textit{and}\; \, \textit{nonAlt}\; r')$\\
$\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\
$\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\
$\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\
$\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\
$\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\
\end{tabular}
\end{center}
\noindent
The predicate $\textit{nonAlt}$ evaluates to true when the regular expression is not an
alternative, and false otherwise.
The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that
its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$,
and unique:
\begin{lemma}\label{rsimpaltsGood}
If $rs \neq []$ and forall $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$,
then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$.
\end{lemma}
\noindent
We also note that
if a regular expression $r$ is good, then $\rflts$ on the singleton
list $[r]$ will not break goodness:
\begin{lemma}\label{flts2}
If $\good \; r$, then forall $r' \in \rflts \; [r]. \; \good \; r'$ and $\textit{nonAlt} \; r'$.
\end{lemma}
\begin{proof}
By an induction on $r$.
\end{proof}
\noindent
The other observation we make about $\rsimp{r}$ is that it never
comes with nested alternatives, which we describe as the $\nonnested$
property:
\begin{center}
\begin{tabular}{lcl}
$\nonnested \; \, \sum []$ & $\dn$ & $\btrue$\\
$\nonnested \; \, \sum ((\sum rs_1) :: rs_2)$ & $\dn$ & $\bfalse$\\
$\nonnested \; \, \sum (r :: rs)$ & $\dn$ & $\nonnested (\sum rs)$\\
$\nonnested \; \, r $ & $\dn$ & $\btrue$
\end{tabular}
\end{center}
\noindent
The $\rflts$ function
always opens up nested alternatives,
which enables $\rsimp$ to be non-nested:
\begin{lemma}\label{nonnestedRsimp}
$\nonnested \; (\rsimp{r})$
\end{lemma}
\begin{proof}
By an induction on $r$.
\end{proof}
\noindent
With this we could prove that a regular expressions
after simplification and flattening and de-duplication,
will not contain any alternative regular expression directly:
\begin{lemma}\label{nonaltFltsRd}
If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$
then $\textit{nonAlt} \; x$.
\end{lemma}
\begin{proof}
By \ref{nonnestedRsimp}.
\end{proof}
\noindent
The other thing we know is that once $\rsimp{}$ had finished
processing an alternative regular expression, it will not
contain any $\RZERO$s, this is because all the recursive
calls to the simplification on the children regular expressions
make the children good, and $\rflts$ will not take out
any $\RZERO$s out of a good regular expression list,
and $\rdistinct{}$ will not mess with the result.
\begin{lemma}\label{flts3Obv}
The following are true:
\begin{itemize}
\item
If for all $r \in rs. \, \good \; r $ or $r = \RZERO$,
then for all $r \in \rflts\; rs. \, \good \; r$.
\item
If $x \in \rdistinct{\rflts\; (\map \; rsimp{}\; rs)}{\varnothing}$
and for all $y$ such that $\llbracket y \rrbracket_r$ less than
$\llbracket rs \rrbracket_r + 1$, either
$\good \; (\rsimp{y})$ or $\rsimp{y} = \RZERO$,
then $\good \; x$.
\end{itemize}
\end{lemma}
\begin{proof}
The first part is by induction on $rs$, where the induction
rule is the inductive cases for $\rflts$.
The second part is a corollary from the first part.
\end{proof}
And this leads to good structural property of $\rsimp{}$,
that after simplification, a regular expression is
either good or $\RZERO$:
\begin{lemma}\label{good1}
For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$.
\end{lemma}
\begin{proof}
By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$.
Lemma \ref{rsimpSize} says that
$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to
$\llbracket r \rrbracket_r$.
Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case,
Inductive hypothesis applies to the children regular expressions
$r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied
by that as well.
The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used
to ensure that goodness is preserved at the topmost level.
\end{proof}
We shall prove that any good regular expression is
a fixed-point for $\rsimp{}$.
First we prove an auxiliary lemma:
\begin{lemma}\label{goodaltsNonalt}
If $\good \; \sum rs$, then $\rflts\; rs = rs$.
\end{lemma}
\begin{proof}
By an induction on $\sum rs$. The inductive rules are the cases
for $\good$.
\end{proof}
\noindent
Now we are ready to prove that good regular expressions are invariant
of $\rsimp{}$ application:
\begin{lemma}\label{test}
If $\good \;r$ then $\rsimp{r} = r$.
\end{lemma}
\begin{proof}
By an induction on the inductive cases of $\good$.
The lemma \ref{goodaltsNonalt} is used in the alternative
case where 2 or more elements are present in the list.
\end{proof}
\noindent
Given below is a property involving $\rflts$, $\rdistinct{}{}$, $\rsimp{}$ and $\rsimp_{ALTS}$,
which requires $\ref{good1}$ to go through smoothly.
It says that an application of $\rsimp_{ALTS}$ can be "absorbed",
if it its output is concatenated with a list and then applied to $\rflts$.
\begin{lemma}\label{flattenRsimpalts}
$\rflts \; ( (\rsimp_{ALTS} \;
(\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) ::
\map \; \rsimp{} \; rs' ) =
\rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ (
\map \; \rsimp{rs'}))$
\end{lemma}
\begin{proof}
By \ref{good1}.
\end{proof}
\noindent
We are also
\subsubsection{$\rsimp$ is Idempotent}
The idempotency of $\rsimp$ is very useful in
manipulating regular expression terms into desired
forms so that key steps allowing further rewriting to closed forms
are possible.
\begin{lemma}\label{rsimpIdem}
$\rsimp{r} = \rsimp{\rsimp{r}}$
\end{lemma}
\begin{proof}
By \ref{test} and \ref{good1}.
\end{proof}
\noindent
This property means we do not have to repeatedly
apply simplification in each step, which justifies
our definition of $\blexersimp$.
On the other hand, we could repeat the same $\rsimp{}$ applications
on regular expressions as many times as we want, if we have at least
one simplification applied to it, and apply it wherever we would like to:
\begin{corollary}\label{headOneMoreSimp}
The following properties hold, directly from \ref{rsimpIdem}:
\begin{itemize}
\item
$\map \; \rsimp{(r :: rs)} = \map \; \rsimp{} \; (\rsimp{r} :: rs)$
\item
$\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$
\end{itemize}
\end{corollary}
\noindent
This will be useful in later closed form proof's rewriting steps.
Similarly, we point out the following useful facts below:
\begin{lemma}
The following equalities hold if $r = \rsimp{r'}$ for some $r'$:
\begin{itemize}
\item
If $r = \sum rs$ then $\rsimpalts \; rs = \sum rs$.
\item
If $r = \sum rs$ then $\rdistinct{rs}{\varnothing} = rs$.
\item
$\rsimpalts \; (\rdistinct{\rflts \; [r]}{\varnothing}) = r$.
\end{itemize}
\end{lemma}
\begin{proof}
By application of \ref{rsimpIdem} and \ref{good1}.
\end{proof}
\noindent
With the idempotency of $\rsimp{}$ and its corollaries,
we can start proving some key equalities leading to the
closed forms.
Now presented are a few equivalent terms under $\rsimp{}$.
We use $r_1 \sequal r_2 $ here to denote $\rsimp{r_1} = \rsimp{r_2}$.
\begin{lemma}
\begin{itemize}
The following equivalence hold:
\item
$\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$
\item
$\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$
\item
$\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$
\item
$\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$
\item
$\RALTS{rs} = \RALTS{\map \; \rsimp{} \; rs}$
\end{itemize}
\end{lemma}
\begin{proof}
By induction on the lists involved.
\end{proof}
\noindent
Similarly,
we introduce the equality for $\sum$ when certain child regular expressions
are $\sum$ themselves:
\begin{lemma}\label{simpFlatten3}
One can flatten the inside $\sum$ of a $\sum$ if it is being
simplified. Concretely,
\begin{itemize}
\item
If for all $r \in rs, rs', rs''$, we have $\good \; r $
or $r = \RZERO$, then $\sum (rs' @ rs @ rs'') \sequal
\sum (rs' @ [\sum rs] @ rs'')$ holds. As a corollary,
\item
$\sum (rs' @ [\sum rs] @ rs'') \sequal \sum (rs' @ rs @ rs'')$
\end{itemize}
\end{lemma}
\begin{proof}
By rewriting steps involving the use of \ref{test} and \ref{rdistinctConcatGeneral}.
The second sub-lemma is a corollary of the previous.
\end{proof}
%Rewriting steps not put in--too long and complicated-------------------------------
\begin{comment}
\begin{center}
$\rsimp{\sum (rs' @ rs @ rs'')} \stackrel{def of bsimp}{=}$ \\
$\rsimpalts \; (\rdistinct{\rflts \; ((\map \; \rsimp{}\; rs') @ (\map \; \rsimp{} \; rs ) @ (\map \; \rsimp{} \; rs''))}{\varnothing})$ \\
$\stackrel{by \ref{test}}{=}
\rsimpalts \; (\rdistinct{(\rflts \; rs' @ \rflts \; rs @ \rflts \; rs'')}{
\varnothing})$\\
$\stackrel{by \ref{rdistinctConcatGeneral}}{=}
\rsimpalts \; (\rdistinct{\rflts \; rs'}{\varnothing} @ \rdistinct{(
\rflts\; rs @ \rflts \; rs'')}{\rflts \; rs'})$\\
\end{center}
\end{comment}
%Rewriting steps not put in--too long and complicated-------------------------------
\noindent
We need more equalities like the above to enable a closed form,
but to proceed we need to introduce two rewrite relations,
to make things smoother.
\subsubsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$}
Insired by the success we had in the correctness proof
in \ref{Bitcoded2}, where we invented
a term rewriting system to capture the similarity between terms,
we follow suit here defining simplification
steps as rewriting steps. This allows capturing
similarities between terms that would be otherwise
hard to express.
We use $\hrewrite$ for one-step atomic rewrite of
regular expression simplification,
$\frewrite$ for rewrite of list of regular expressions that
include all operations carried out in $\rflts$, and $\grewrite$ for
rewriting a list of regular expressions possible in both $\rflts$ and $\rdistinct{}{}$.
Their reflexive transitive closures are used to denote zero or many steps,
as was the case in the previous chapter.
The presentation will be more concise than that in \ref{Bitcoded2}.
To differentiate between the rewriting steps for annotated regular expressions
and $\rrexp$s, we add characters $h$ and $g$ below the squig arrow symbol
to mean atomic simplification transitions
of $\rrexp$s and $\rrexp$ lists, respectively.
List of one-step rewrite rules for $\rrexp$ ($\hrewrite$):
\begin{center}
\begin{mathpar}
\inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\}
\inferrule[RSEQ0R]{}{r_1 \cdot \RZERO \hrewrite \RZERO\\}
\inferrule[RSEQ1]{}{(\RONE \cdot r) \hrewrite r\\}\\
\inferrule[RSEQL]{ r_1 \hrewrite r_2}{r_1 \cdot r_3 \hrewrite r_2 \cdot r_3\\}
\inferrule[RSEQR]{ r_3 \hrewrite r_4}{r_1 \cdot r_3 \hrewrite r_1 \cdot r_4\\}\\
\inferrule[RALTSChild]{r \hrewrite r'}{\sum (rs_1 @ [r] @ rs_2) \hrewrite \sum (rs_1 @ [r'] @ rs_2)\\}
\inferrule[RALTS0]{}{\sum (rs_a @ [\RZERO] @ rs_b) \hrewrite \sum (rs_a @ rs_b)}
\inferrule[RALTSNested]{}{\sum (rs_a @ [\sum rs_1] @ rs_b) \hrewrite \sum (rs_a @ rs_1 @ rs_b)}
\inferrule[RALTSNil]{}{ \sum [] \hrewrite \RZERO\\}
\inferrule[RALTSSingle]{}{ \sum [r] \hrewrite r\\}
\inferrule[RALTSDelete]{\\ r_1 = r_2}{\sum rs_a @ [r_1] @ rs_b @ [r_2] @ rsc \hrewrite \sum rs_a @ [r_1] @ rs_b @ rs_c}
\end{mathpar}
\end{center}
List of rewrite rules for a list of regular expressions,
where each element can rewrite in many steps to the other (scf stands for
li\emph{s}t \emph{c}losed \emph{f}orm). This relation is similar to the
$\stackrel{s*}{\rightsquigarrow}$ for annotated regular expressions.
\begin{center}
\begin{mathpar}
\inferrule{}{[] \scfrewrites [] }
\inferrule{r \hrewrites r' \\ rs \scfrewrites rs'}{r :: rs \scfrewrites r' :: rs'}
\end{mathpar}
\end{center}
%frewrite
List of one-step rewrite rules for flattening
a list of regular expressions($\frewrite$):
\begin{center}
\begin{mathpar}
\inferrule{}{\RZERO :: rs \frewrite rs \\}
\inferrule{}{(\sum rs) :: rs_a \frewrite rs @ rs_a \\}
\inferrule{rs_1 \frewrite rs_2}{r :: rs_1 \frewrite r :: rs_2}
\end{mathpar}
\end{center}
Lists of one-step rewrite rules for flattening and de-duplicating
a list of regular expressions ($\grewrite$):
\begin{center}
\begin{mathpar}
\inferrule{}{\RZERO :: rs \grewrite rs \\}
\inferrule{}{(\sum rs) :: rs_a \grewrite rs @ rs_a \\}
\inferrule{rs_1 \grewrite rs_2}{r :: rs_1 \grewrite r :: rs_2}
\inferrule[dB]{}{rs_a @ [a] @ rs_b @[a] @ rs_c \grewrite rs_a @ [a] @ rsb @ rsc}
\end{mathpar}
\end{center}
\noindent
The reason why we take the trouble of defining
two separate list rewriting definitions $\frewrite$ and $\grewrite$
is to separate the two stages of simplification: flattening and de-duplicating.
Sometimes $\grewrites$ is slightly too powerful
so we would rather use $\frewrites$ which makes certain rewriting steps
more straightforward to prove.
For example, when proving the closed-form for the alternative regular expression,
one of the rewriting steps would be:
\begin{lemma}
$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
\sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
$
\end{lemma}
\noindent
Proving this is by first showing
\begin{lemma}\label{earlyLaterDerFrewrites}
$\map \; (\_ \backslash x) \; (\rflts \; rs) \frewrites
\rflts \; (\map \; (\_ \backslash x) \; rs)$
\end{lemma}
\noindent
and then using lemma
\begin{lemma}\label{frewritesSimpeq}
If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal
\sum (\rDistinct \; rs_2 \; \varnothing)$.
\end{lemma}
\noindent
is a piece of cake.
But this trick will not work for $\grewrites$.
For example, a rewriting step in proving
closed forms is:
\begin{center}
$\rsimp{(\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})))}$\\
$=$ \\
$\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $
\noindent
\end{center}
For this one would hope to have a rewriting relation between the two lists involved,
similar to \ref{earlyLaterDerFrewrites}. However, it turns out that
\begin{center}
$\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \;
(\_ \backslash x) \; rs) \; ( rset \backslash x)$
\end{center}
\noindent
does $\mathbf{not}$ hold in general.
For this rewriting step we will introduce some slightly more cumbersome
proof technique in later sections.
The point is that $\frewrite$
allows us to prove equivalence in a straightforward two-step method that is
not possible for $\grewrite$, thereby reducing the complexity of the entire proof.
\subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}
We present in the below lemma a few pairs of terms that are rewritable via
$\grewrites$:
\begin{lemma}\label{gstarRdistinctGeneral}
\begin{itemize}
\item
$rs_1 @ rs \grewrites rs_1 @ (\rDistinct \; rs \; rs_1)$
\item
$rs \grewrites \rDistinct \; rs \; \varnothing$
\item
$rs_a @ (\rDistinct \; rs \; rs_a) \grewrites rs_a @ (\rDistinct \;
rs \; (\{\RZERO\} \cup rs_a))$
\item
$rs \;\; @ \;\; \rDistinct \; rs_a \; rset \grewrites rs @ \rDistinct \; rs_a \;
(rest \cup rs)$
\end{itemize}
\end{lemma}
\noindent
If a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other,
then they are equivalent under $\rsimp{}$:
\begin{lemma}\label{grewritesSimpalts}
If $rs_1 \grewrites rs_2$, then
we have the following equivalence hold:
\begin{itemize}
\item
$\sum rs_1 \sequal \sum rs_2$
\item
$\rsimpalts \; rs_1 \sequal \rsimpalts \; rs_2$
\end{itemize}
\end{lemma}
\noindent
Here are a few connecting lemmas showing that
if a list of regular expressions can be rewritten using $\grewrites$ or $\frewrites $ or
$\scfrewrites$,
then an alternative constructor taking the list can also be rewritten using $\hrewrites$:
\begin{lemma}
\begin{itemize}
\item
If $rs \grewrites rs'$ then $\sum rs \hrewrites \sum rs'$.
\item
If $rs \grewrites rs'$ then $\sum rs \hrewrites \rsimpalts \; rs'$
\item
If $rs_1 \scfrewrites rs_2$ then $\sum (rs @ rs_1) \hrewrites \sum (rs @ rs_2)$
\item
If $rs_1 \scfrewrites rs_2$ then $\sum rs_1 \hrewrites \sum rs_2$
\end{itemize}
\end{lemma}
\noindent
Here comes the meat of the proof,
which says that once two lists are rewritable to each other,
then they are equivalent under $\rsimp{}$:
\begin{lemma}
If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$.
\end{lemma}
\noindent
And similar to \ref{Bitcoded2} one can preserve rewritability after taking derivative
of two regular expressions on both sides:
\begin{lemma}\label{interleave}
If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
\end{lemma}
\noindent
This allows proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$ now.
\begin{lemma}\label{insideSimpRemoval}
$\rsimp{\rder{c}{\rsimp{r}}} = \rsimp{\rder{c}{r}} $
\end{lemma}
\noindent
\begin{proof}
By \ref{interleave} and \ref{rsimpIdem}.
\end{proof}
\noindent
And this unlocks more equivalent terms:
\begin{lemma}\label{Simpders}
As corollaries of \ref{insideSimpRemoval}, we have
\begin{itemize}
\item
If $s \neq []$ then $\rderssimp{r}{s} = \rsimp{(\rders \; r \; s)}$.
\item
$\rsimpalts \; (\map \; (\_ \backslash_r x) \;
(\rdistinct{rs}{\varnothing})) \sequal
\rsimpalts \; (\rDistinct \;
(\map \; (\_ \backslash_r x) rs) \;\varnothing )$
\end{itemize}
\end{lemma}
\noindent
Finally,
together with
\begin{lemma}\label{rderRsimpAltsCommute}
$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$
\end{lemma}
\noindent
this leads to the first closed form--
\begin{lemma}\label{altsClosedForm}
\begin{center}
$\rderssimp{(\sum rs)}{s} \sequal
\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
\end{center}
\end{lemma}
\noindent
\begin{proof}
By a reverse induction on the string $s$.
One rewriting step, as we mentioned earlier,
involves
\begin{center}
$\rsimpalts \; (\map \; (\_ \backslash x) \;
(\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \;
(\lambda r. \rderssimp{r}{xs}))))}{\varnothing}))
\sequal
\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \;
(\rflts \; (\map \; (\rsimp{} \; \circ \;
(\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}) $.
\end{center}
This can be proven by a combination of
\ref{grewritesSimpalts}, \ref{gstarRdistinctGeneral}, \ref{rderRsimpAltsCommute}, and
\ref{insideSimpRemoval}.
\end{proof}
\noindent
This closed form has a variant which can be more convenient in later proofs:
\begin{corollary}
If $s \neq []$ then
$\rderssimp \; (\sum \; rs) \; s =
\rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$.
\end{corollary}
\noindent
The harder closed forms are the sequence and star ones.
Before we go on to obtain them, some preliminary definitions
are needed to make proof statements concise.
\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
We note that the different ways in which regular expressions are
nested do not matter under $\rsimp{}$:
\begin{center}
To be proven:\\
$\rsimp{r} \stackrel{?}{=} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$
and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$
\end{center}
\noindent
This intuition is also echoed by IndianPaper, where they gave
a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:
\begin{center}
$((r_1 \cdot r_2) \backslash_r c_1) \stackrel{\backslash_r c_2}{=}
((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1),\; r_2 \backslash_r c_1))
\stackrel{\backslash_r c_2}{=}
((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
+ (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)
$
\end{center}
\noindent
The $\delta \; b \; r$ function above turns the entire term into $\RZERO$
if the boolean condition $b$ evaluates to false, and outputs $r$ otherwise.
The equality in their derivation steps should be interpretated
as language equivalence. To pin down the intuition into rigorous terms,
$\sflat{}$ is used to enable the transformation from
a left-associative nested sequence of alternatives into
a flattened list:
$r = \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{}}{\rightarrow}
(\ldots ((r_1 + r_2) + r_3) + \ldots)$
The definitions $\sflat{}$, $\sflataux{}$ are given below.
\begin{center}
\begin{tabular}{ccc}
$\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
$\sflataux r$ & $=$ & $ [r]$
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{ccc}
$\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\
$\sflat{\sum []}$ & $ = $ & $ \sum []$\\
$\sflat r$ & $=$ & $ r$
\end{tabular}
\end{center}
The intuition behind $\sflataux{\_}$ is to break up nested regexes
of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
It will return the singleton list $[r]$ otherwise.
$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps
the output type a regular expression, not a list.
$\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the
first element of the list of children of
an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression
$r_1 \cdot r_2 \backslash s$.
\subsection{The $\textit{vsuf}$ Function}
Note that $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
+ (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)$ does not faithfully
represent what would the intermediate derivatives would actually look like.
For example, when $r_1$ and $r_1 \backslash_r c_1$ are nullable,
the regular expression would not look like $(r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO$,
but actually $r_1\backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the
first place.
In a closed-form one would want to take into account this
and generate the list of string pairs $(s', s'')$ where $s'@s'' = s$ such that
only $r_1 \backslash s'$ nullable.
With $\sflat{\_}$ and $\sflataux{\_}$ we make
precise what "closed forms" we have for the sequence derivatives and their simplifications,
in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
and $\bderssimp{(r_1\cdot r_2)}{s}$,
if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over
the substring of $s$?
First let's look at a series of derivatives steps on a sequence
regular expression, (assuming) that each time the first
component of the sequence is always nullable):
\begin{center}
$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad
\ldots$
\end{center}
%TODO: cite indian paper
Indianpaper have come up with a slightly more formal way of putting the above process:
\begin{center}
$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
\end{center}
where $\delta(b, r)$ will produce $r$
if $b$ evaluates to true,
and $\ZERO$ otherwise.
But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
equivalence. To make this intuition useful
for a formal proof, we need something
stronger than language equivalence.
With the help of $\sflat{\_}$ we can state the equation in Indianpaper
more rigorously:
\begin{lemma}
$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
\end{lemma}
The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
\begin{center}
\begin{tabular}{lcl}
$\vsuf{[]}{\_} $ & $=$ & $[]$\\
$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
&& $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $
\end{tabular}
\end{center}
It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing
the entire result of $(r_1 \cdot r_2) \backslash s$,
it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$.
With this we can also add simplifications to both sides and get
\begin{lemma}
$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
\end{lemma}
Together with the idempotency property of $\rsimp{}$\ref{rsimpIdem},
%TODO: state the idempotency property of rsimp
it is possible to convert the above lemma to obtain a "closed form"
for derivatives nested with simplification:
\begin{lemma}
$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
\end{lemma}
And now the reason we have (1) in section 1 is clear:
$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$,
is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
%----------------------------------------------------------------------------------------
% SECTION 3
%----------------------------------------------------------------------------------------
\section{interaction between $\distinctWith$ and $\flts$}
Note that it is not immediately obvious that
\begin{center}
$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $.
\end{center}
The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
\subsection{A Closed Form for the Sequence Regular Expression}
\begin{quote}\it
Claim: For regular expressions $r_1 \cdot r_2$, we claim that
\begin{center}
$ \rderssimp{r_1 \cdot r_2}{s} =
\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
\end{center}
\end{quote}
\noindent
Before we get to the proof that says the intermediate result of our lexer will
remain finitely bounded, which is an important efficiency/liveness guarantee,
we shall first develop a few preparatory properties and definitions to
make the process of proving that a breeze.
We define rewriting relations for $\rrexp$s, which allows us to do the
same trick as we did for the correctness proof,
but this time we will have stronger equalities established.
\section{Estimating the Closed Forms' sizes}
The closed form $f\; (g\; (\sum rs)) $ is controlled
by the size of $rs'$, where every element in $rs'$ is distinct, and each element can be described by som e inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$), which will have a numeric uppder bound according to inductive hypothesis, which controls $r \backslash s$.
%-----------------------------------
% SECTION 2
%-----------------------------------
\begin{theorem}
For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
\end{theorem}
\noindent
\begin{proof}
We prove this by induction on $r$. The base cases for $\AZERO$,
$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
%
\begin{center}
\begin{tabular}{lcll}
& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
[\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
& $\leq$ &
$\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
[$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
\llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
\llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
\end{tabular}
\end{center}
\noindent
where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
The reason why we could write the derivatives of a sequence this way is described in section 2.
The term (2) is used to control (1).
That is because one can obtain an overall
smaller regex list
by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
Section 3 is dedicated to its proof.
In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is
bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
We reason similarly for $\STAR$.\medskip
\end{proof}
What guarantee does this bound give us?
Whatever the regex is, it will not grow indefinitely.
Take our previous example $(a + aa)^*$ as an example:
\begin{center}
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
xlabel={number of $a$'s},
x label style={at={(1.05,-0.05)}},
ylabel={regex size},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
ymax= 40,
ytick={0,10,...,40},
scaled ticks=false,
axis lines=left,
width=5cm,
height=4cm,
legend entries={$(a + aa)^*$},
legend pos=north west,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
\end{axis}
\end{tikzpicture}
\end{tabular}
\end{center}
We are able to limit the size of the regex $(a + aa)^*$'s derivatives
with our simplification
rules very effectively.
In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
$f(x) = x * 2^x$.
This means the bound we have will surge up at least
tower-exponentially with a linear increase of the depth.
For a regex of depth $n$, the bound
would be approximately $4^n$.
Test data in the graphs from randomly generated regular expressions
shows that the giant bounds are far from being hit.
%a few sample regular experessions' derivatives
%size change
%TODO: giving regex1_size_change.data showing a few regexes' size changes
%w;r;t the input characters number, where the size is usually cubic in terms of original size
%a*, aa*, aaa*, .....
%randomly generated regexes
\begin{center}
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
xlabel={number of $a$'s},
x label style={at={(1.05,-0.05)}},
ylabel={regex size},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
ymax=1000,
ytick={0,100,...,1000},
scaled ticks=false,
axis lines=left,
width=5cm,
height=4cm,
legend entries={regex1},
legend pos=north west,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
x label style={at={(1.05,-0.05)}},
%ylabel={time in secs},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
ymax=1000,
ytick={0,100,...,1000},
scaled ticks=false,
axis lines=left,
width=5cm,
height=4cm,
legend entries={regex2},
legend pos=north west,
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
x label style={at={(1.05,-0.05)}},
%ylabel={time in secs},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
ymax=1000,
ytick={0,100,...,1000},
scaled ticks=false,
axis lines=left,
width=5cm,
height=4cm,
legend entries={regex3},
legend pos=north west,
legend cell align=left]
\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
\end{axis}
\end{tikzpicture}\\
\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
\end{tabular}
\end{center}
\noindent
Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the
original size.
This suggests a link towrads "partial derivatives"
introduced by Antimirov \cite{Antimirov95}.
\section{Antimirov's partial derivatives}
The idea behind Antimirov's partial derivatives
is to do derivatives in a similar way as suggested by Brzozowski,
but maintain a set of regular expressions instead of a single one:
%TODO: antimirov proposition 3.1, needs completion
\begin{center}
\begin{tabular}{ccc}
$\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\
$\partial_x(\ONE)$ & $=$ & $\phi$
\end{tabular}
\end{center}
Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together
using the alternatives constructor, Antimirov cleverly chose to put them into
a set instead. This breaks the terms in a derivative regular expression up,
allowing us to understand what are the "atomic" components of it.
For example, To compute what regular expression $x^*(xx + y)^*$'s
derivative against $x$ is made of, one can do a partial derivative
of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$
from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.
To get all the "atomic" components of a regular expression's possible derivatives,
there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes
whatever character is available at the head of the string inside the language of a
regular expression, and gives back the character and the derivative regular expression
as a pair (which he called "monomial"):
\begin{center}
\begin{tabular}{ccc}
$\lf(\ONE)$ & $=$ & $\phi$\\
$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\
$\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
$\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\
\end{tabular}
\end{center}
%TODO: completion
There is a slight difference in the last three clauses compared
with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular
expression $r$ with every element inside $\textit{rset}$ to create a set of
sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates
on a set of monomials (which Antimirov called "linear form") and a regular
expression, and returns a linear form:
\begin{center}
\begin{tabular}{ccc}
$l \bigodot (\ZERO)$ & $=$ & $\phi$\\
$l \bigodot (\ONE)$ & $=$ & $l$\\
$\phi \bigodot t$ & $=$ & $\phi$\\
$\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\
$\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\
$\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\
$\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
$\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\
\end{tabular}
\end{center}
%TODO: completion
Some degree of simplification is applied when doing $\bigodot$, for example,
$l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,
and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and
$\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,
and so on.
With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regex $r$ with
an iterative procedure:
\begin{center}
\begin{tabular}{llll}
$\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\
& $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\
& $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\
$\partial_{UNIV}(r)$ & $=$ & $\PD$ &
\end{tabular}
\end{center}
$(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
However, if we introduce them in our
setting we would lose the POSIX property of our calculated values.
A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer
would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
occurs, and apply them in the right order once we get
a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
This is unlike the simplification we had before, where the rewriting rules
such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
We will discuss better
bounds in the last section of this chapter.\\[-6.5mm]
%----------------------------------------------------------------------------------------
% SECTION ??
%----------------------------------------------------------------------------------------
%-----------------------------------
% SECTION syntactic equivalence under simp
%-----------------------------------
\section{Syntactic Equivalence Under $\simp$}
We prove that minor differences can be annhilated
by $\simp$.
For example,
\begin{center}
$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) =
\simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
\end{center}
%----------------------------------------------------------------------------------------
% SECTION ALTS CLOSED FORM
%----------------------------------------------------------------------------------------
\section{A Closed Form for \textit{ALTS}}
Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
There are a few key steps, one of these steps is
One might want to prove this by something a simple statement like:
For this to hold we want the $\textit{distinct}$ function to pick up
the elements before and after derivatives correctly:
$r \in rset \equiv (rder x r) \in (rder x rset)$.
which essentially requires that the function $\backslash$ is an injective mapping.
Unfortunately the function $\backslash c$ is not an injective mapping.
\subsection{function $\backslash c$ is not injective (1-to-1)}
\begin{center}
The derivative $w.r.t$ character $c$ is not one-to-one.
Formally,
$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
\end{center}
This property is trivially true for the
character regex example:
\begin{center}
$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
\end{center}
But apart from the cases where the derivative
output is $\ZERO$, are there non-trivial results
of derivatives which contain strings?
The answer is yes.
For example,
\begin{center}
Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
where $a$ is not nullable.\\
$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
\end{center}
We start with two syntactically different regexes,
and end up with the same derivative result.
This is not surprising as we have such
equality as below in the style of Arden's lemma:\\
\begin{center}
$L(A^*B) = L(A\cdot A^* \cdot B + B)$
\end{center}
%----------------------------------------------------------------------------------------
% SECTION 4
%----------------------------------------------------------------------------------------
\section{A Bound for the Star Regular Expression}
We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then
the property of the $\distinct$ function.
Now we try to get a bound on $r^* \backslash s$ as well.
Again, we first look at how a star's derivatives evolve, if they grow maximally:
\begin{center}
$r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad
r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad
(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''}
\quad \ldots$
\end{center}
When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$,
$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not
count the possible size explosions of $r \backslash c$ themselves.
Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $
into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
%TODO: definitions of and \hflataux \hflat
\begin{center}
\begin{tabular}{ccc}
$\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
$\hflataux r$ & $=$ & $ [r]$
\end{tabular}
\end{center}
\begin{center}
\begin{tabular}{ccc}
$\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
$\hflat r$ & $=$ & $ r$
\end{tabular}
\end{center}s
Again these definitions are tailor-made for dealing with alternatives that have
originated from a star's derivatives, so we don't attempt to open up all possible
regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
elements.
We give a predicate for such "star-created" regular expressions:
\begin{center}
\begin{tabular}{lcr}
& & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
$\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
\end{tabular}
\end{center}
These definitions allows us the flexibility to talk about
regular expressions in their most convenient format,
for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
These definitions help express that certain classes of syntatically
distinct regular expressions are actually the same under simplification.
This is not entirely true for annotated regular expressions:
%TODO: bsimp bders \neq bderssimp
\begin{center}
$(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
\end{center}
For bit-codes, the order in which simplification is applied
might cause a difference in the location they are placed.
If we want something like
\begin{center}
$\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
\end{center}
Some "canonicalization" procedure is required,
which either pushes all the common bitcodes to nodes
as senior as possible:
\begin{center}
$_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
\end{center}
or does the reverse. However bitcodes are not of interest if we are talking about
the $\llbracket r \rrbracket$ size of a regex.
Therefore for the ease and simplicity of producing a
proof for a size bound, we are happy to restrict ourselves to
unannotated regular expressions, and obtain such equalities as
\begin{lemma}
$\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
\end{lemma}
\begin{proof}
By using the rewriting relation $\rightsquigarrow$
\end{proof}
%TODO: rsimp sflat
And from this we obtain a proof that a star's derivative will be the same
as if it had all its nested alternatives created during deriving being flattened out:
For example,
\begin{lemma}
$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
\end{lemma}
\begin{proof}
By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
\end{proof}
% The simplification of a flattened out regular expression, provided it comes
%from the derivative of a star, is the same as the one nested.
One might wonder the actual bound rather than the loose bound we gave
for the convenience of an easier proof.
How much can the regex $r^* \backslash s$ grow?
As earlier graphs have shown,
%TODO: reference that graph where size grows quickly
they can grow at a maximum speed
exponential $w.r.t$ the number of characters,
but will eventually level off when the string $s$ is long enough.
If they grow to a size exponential $w.r.t$ the original regex, our algorithm
would still be slow.
And unfortunately, we have concrete examples
where such regexes grew exponentially large before levelling off:
$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
size that is exponential on the number $n$
under our current simplification rules:
%TODO: graph of a regex whose size increases exponentially.
\begin{center}
\begin{tikzpicture}
\begin{axis}[
height=0.5\textwidth,
width=\textwidth,
xlabel=number of a's,
xtick={0,...,9},
ylabel=maximum size,
ymode=log,
log basis y={2}
]
\addplot[mark=*,blue] table {re-chengsong.data};
\end{axis}
\end{tikzpicture}
\end{center}
For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
The exponential size is triggered by that the regex
$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
inside the $(\ldots) ^*$ having exponentially many
different derivatives, despite those difference being minor.
$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
will therefore contain the following terms (after flattening out all nested
alternatives):
\begin{center}
$(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
$(1 \leq m' \leq m )$
\end{center}
These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
With each new input character taking the derivative against the intermediate result, more and more such distinct
terms will accumulate,
until the length reaches $L.C.M.(1, \ldots, n)$.
$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms
$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
where $m' \neq m''$ \\
as they are slightly different.
This means that with our current simplification methods,
we will not be able to control the derivative so that
$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
as there are already exponentially many terms.
These terms are similar in the sense that the head of those terms
are all consisted of sub-terms of the form:
$(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $.
For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
$n * (n + 1) / 2$ such terms.
For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
can be described by 6 terms:
$a^*$, $a\cdot (aa)^*$, $ (aa)^*$,
$aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
The total number of different "head terms", $n * (n + 1) / 2$,
is proportional to the number of characters in the regex
$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
This suggests a slightly different notion of size, which we call the
alphabetic width:
%TODO:
(TODO: Alphabetic width def.)
Antimirov\parencite{Antimirov95} has proven that
$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
created by doing derivatives of $r$ against all possible strings.
If we can make sure that at any moment in our lexing algorithm our
intermediate result hold at most one copy of each of the
subterms then we can get the same bound as Antimirov's.
This leads to the algorithm in the next chapter.
%----------------------------------------------------------------------------------------
% SECTION 1
%----------------------------------------------------------------------------------------
%-----------------------------------
% SUBSECTION 1
%-----------------------------------
\subsection{Syntactic Equivalence Under $\simp$}
We prove that minor differences can be annhilated
by $\simp$.
For example,
\begin{center}
$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) =
\simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
\end{center}