% Chapter Template\chapter{Finiteness Bound} % Main chapter title\label{Finite} % In Chapter 4 \ref{Chapter4} we give the second guarantee%of our bitcoded algorithm, that is a finite bound on the size of any %regex's derivatives. In this chapter we give a guarantee in terms of size: given an annotated regular expression $a$, for any string $s$our algorithm $\blexersimp$'s internal annotated regular expression size is finitely boundedby a constant $N_a$ that only depends on $a$:\begin{center} $\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$\end{center}\noindentwhere the size of an annotated regular expression is definedin terms of the number of nodes in its tree structure:\begin{center} \begin{tabular}{ccc} $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. \end{tabular}\end{center}We believe this size formalisation of the algorithm is important in our context, because \begin{itemize} \item It is a stepping stone towards an ``absence of catastrophic-backtracking'' guarantee. The next step would be to refine the bound $N_a$ so that it is polynomial on $\llbracket a\rrbracket$. \item The size bound proof gives us a higher confidence that our simplification algorithm $\simp$ does not ``mis-behave'' like $\simpsulz$ does. The bound is universal, which is an advantage over work which only gives empirical evidence on some test cases.\end{itemize}\section{Formalising About Size}\noindentIn our lexer $\blexersimp$,The regular expression is repeatedly being taken derivative ofand then simplified.\begin{figure}[H] \begin{tikzpicture}[scale=2, every node/.style={minimum size=11mm}, ->,>=stealth',shorten >=1pt,auto,thick ] \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; \node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$}; \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$}; \node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 12mm]{$a_2$}; \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$}; \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$}; \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$}; \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; \node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$}; \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode}; \end{tikzpicture} \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks}\end{figure}\noindentEach timea derivative is taken, a regular expression might grow a bit,but simplification always takes care that it stays small.This intuition is depicted by the relative sizechange between the black and blue nodes:After $\simp$ the node always shrinks.Our proof says that all the blue nodesstay below a size bound $N_a$ determined by $a$.\noindentSulzmann and Lu's assumed something similar about their algorithm,though in fact their algorithm's size might be better depicted by the following graph:\begin{figure}[H] \begin{tikzpicture}[scale=2, every node/.style={minimum size=11mm}, ->,>=stealth',shorten >=1pt,auto,thick ] \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; \node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$}; \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$}; \node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 17mm]{$a_2$}; \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$}; \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$}; \node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$}; \draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$}; \node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$}; \draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$}; \node (rnn) [right = of rns, minimum size = 1mm]{}; \draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$}; \end{tikzpicture} \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}\end{figure}\noindentThat is, on certain cases their lexer will have an indefinite size explosion, causing the running time of each derivative step to grow arbitrarily large (for example in \ref{SulzmannLuLexerTime}).The reason they made this mistake was thatthey tested out the run time of theirlexer on particular examples such as $(a+b+ab)^*$and generalised to all cases, whichcannot happen with our mecahnised proof.\\We give details of the proof in the next sections.\subsection{Overview of the Proof}Here is a bird's eye view of how the proof of finiteness works,which involves three steps:\begin{figure}[H] \begin{tikzpicture}[scale=1,font=\bf, node/.style={ rectangle,rounded corners=3mm, ultra thick,draw=black!50,minimum height=18mm, minimum width=20mm, top color=white,bottom color=black!20}] \node (0) at (-5,0) [node, text width=1.8cm, text centered] {$\llbracket \bderssimp{a}{s} \rrbracket$}; \node (A) at (0,0) [node,text width=1.6cm, text centered] {$\llbracket \rderssimp{r}{s} \rrbracket_r$}; \node (B) at (3,0) [node,text width=3.0cm, anchor=west, minimum width = 40mm] {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$}; \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$}; \draw [->,line width=0.5mm] (0) -- node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A); \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B); \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); \end{tikzpicture} %\caption{\end{figure}\noindentWe explain the steps one by one:\begin{itemize} \item We first introduce the operations such as derivatives, simplification, size calculation, etc. associated with $\rrexp$s, which we have given a very brief introduction to in chapter \ref{Bitcoded2}. The operations on $\rrexp$s are identical to those on annotated regular expressions except that they are unaware of bitcodes. This means that all proofs about size of $\rrexp$s will apply to annotated regular expressions. \item We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$, where $\textit{ClosedForm}(r, s)$ is entirely written in the derivatives of their children regular expressions. We call the right-hand-side the \emph{Closed Form} of the derivative $\rderssimp{r}{s}$. \item We estimate $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$. The key observation is that $\distinctBy$'s output is a list with a constant length bound.\end{itemize}We will expand on these steps in the next sections.\\\section{The $\textit{Rrexp}$ Datatype and Its Lexing-Related Functions}The first step is to define $\textit{rrexp}$s.They are without bitcodes,allowing a much simpler size bound proof.Of course, the bits which encode the lexing information would grow linearly with respect to the input, which should be taken into account when we wish to tackle the runtime comlexity.But for the sake of the structural size we can safely ignore them.\\To recapitulate, the datatype definition of the $\rrexp$, called\emph{r-regular expressions},was initially defined in \ref{rrexpDef}.The reason for the prefix $r$ isto make a distinction with basic regular expressions.\[ \rrexp ::= \RZERO \mid \RONE \mid \RCHAR{c} \mid \RSEQ{r_1}{r_2} \mid \RALTS{rs} \mid \RSTAR{r} \]The size of an r-regular expression iswritten $\llbracket r\rrbracket_r$, whose definition mirrors that of an annotated regular expression. \begin{center} \begin{tabular}{ccc} $\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\ $\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\ $\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\ $\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\ $\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\ $\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$. \end{tabular}\end{center}\noindentThe $r$ in the subscript of $\llbracket \rrbracket_r$ is to differentiate with the same operation for annotated regular expressions.Adding $r$ as subscript will be used in other operations as well.\\The transformation from an annotated regular expressionto an r-regular expression is straightforward.\begin{center} \begin{tabular}{lcl} $\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\ $\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\ $\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\ $\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ $\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$ \end{tabular}\end{center}\noindent$\textit{Rerase}$ throws away the bitcodes on the annotated regular expressions, but keeps everything else intact.Therefore it does not change the sizeof an annotated regular expression:\begin{lemma}\label{rsizeAsize} $\rsize{\rerase a} = \asize a$\end{lemma}\begin{proof} By routine structural induction on $a$.\end{proof}\noindent\subsection{Motivation Behind a New Datatype}The reason we take all the trouble defining a new datatype is that $\erase$ makes things harder.We initially started by using plain regular expressions and tried to provethe lemma \ref{rsizeAsize},however the $\erase$ function unavoidbly messes with the structure of the annotated regular expression.The $+$ constructorof basic regular expressions is binary whereas $\sum$ takes a list, and one has to convert between them:\begin{center} \begin{tabular}{ccc} $\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\ $\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\ $\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$ \end{tabular}\end{center}\noindentAn alternative regular expression with an empty argument listwill be turned into a $\ZERO$.The singleton alternative $\sum [r]$ would have $r$ during the$\erase$ function.The annotated regular expression $\sum[a, b, c]$ would turn into$(a+(b+c))$.All these operations change the size and structure ofan annotated regular expressions, adding unnecessary complexities to the size bound proof.\\For example, if we define the size of a basic regular expression in the usual way,\begin{center} \begin{tabular}{ccc} $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ $\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\ $\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\ $\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\ $\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\ $\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$ \end{tabular}\end{center}\noindentThen the property\begin{center} $\llbracket a \rrbracket = \llbracket a_\downarrow \rrbracket_p$\end{center}does not hold.One might be able to prove an inequality such as$\llbracket a \rrbracket \leq \llbracket a_\downarrow \rrbracket_p $and then estimate $\llbracket a_\downarrow \rrbracket_p$,but we found our approach more straightforward.\\\subsection{Lexing Related Functions for $\rrexp$ such as $\backslash_r$, $\rdistincts$, and $\rsimp$}The operations on r-regular expressions are almost identical to those of the annotated regular expressions,except that no bitcodes are used. For example,the derivative operation becomes simpler:\\\begin{center} \begin{tabular}{@{}lcl@{}} $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ $(\ONE)\,\backslash_r c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; \ONE\;\textit{else}\;\ZERO$\\ $(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ & $\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\ $(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ & $\textit{if}\;(\textit{rnullable}\,r_1)$\\ & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\ & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\ & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\ $(r^*)\,\backslash_r c$ & $\dn$ & $( r\,\backslash_r c)\cdot (_{[]}r^*))$ \end{tabular} \end{center} \noindentSimilarly, $\distinctBy$ does not need a function checking equivalence becausethere are no bit annotations causing superficial differencesbetween syntactically equal terms.\begin{center} \begin{tabular}{lcl} $\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\ $\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ & & $\textit{else}\; \; r::\rdistinct{rs}{(rset \cup \{r\})}$ \end{tabular}\end{center}%TODO: definition of rsimp (maybe only the alternative clause)\noindentNotice there is a difference between our $\rdistincts$ andthe Isabelle $\textit {distinct}$ function.In Isabelle $\textit{distinct}$ is a predicatethat tests if all the elements of a list are unique.\\With $\textit{rdistinct}$ one can chain together all the other modulesof $\bsimp{\_}$ (removing the functionalities related to bit-sequences)and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$.We omit these functions, as they are routine. Please refer to the formalisation(in file BasicIdentities.thy) for the exact definition.With $\rrexp$ the size caclulation of annotated regular expressions'simplification and derivatives can be done by the size of their unlifted counterpart with the unlifted version of simplification and derivatives applied.\begin{lemma}\label{sizeRelations} The following equalities hold: \begin{itemize} \item $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ \item $\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$ \end{itemize}\end{lemma}\noindentIn the following content, we will focus on $\rrexp$'s size bound.We will piece together this bound and show the same bound for annotated regular expressions in the end.Unless stated otherwise in this chapter all $\textit{rexp}$s withoutbitcodes are seen as $\rrexp$s.We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageablyas the former suits people's intuitive way of stating a binary alternativeregular expression.%-----------------------------------% SECTION ?%-----------------------------------\subsection{Finiteness Proof Using $\rrexp$s}Now that we have defined the $\rrexp$ datatype, and proven that its size changesw.r.t derivatives and simplifications mirrors precisely those of annotated regular expressions,we aim to bound the size of $r \backslash s$ for any $\rrexp$ $r$.Once we have a bound like: \[ \llbracket r \backslash_{rsimp} s \rrbracket_r \leq N_r\]\noindentwe could easily extend that to \[ \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r.\]\subsection{Roadmap to a Bound for $\textit{Rrexp}$}The way we obtain the bound for $\rrexp$s is by two steps:\begin{itemize} \item First, we rewrite $r\backslash s$ into something else that is easier to bound. This step is especially important for the inductive case $r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way, but after simplification they will always be equal or smaller to a form consisting of an alternative list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application. \item Then, for such a sum list of regular expressions $f\; (g\; (\sum rs))$, we can control its size by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only reduce the size of a regular expression, not adding to it.\end{itemize}\section{Step One: Closed Forms}We transform the function application $\rderssimp{r}{s}$into an equivalent form $f\; (g \; (\sum rs))$.The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$.This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call theright hand side the "closed form" of $r\backslash s$.\begin{quote}\it Claim: For regular expressions $r_1 \cdot r_2$, we claim that \begin{center} $ \rderssimp{r_1 \cdot r_2}{s} = \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$ \end{center}\end{quote}\noindentWe explain in detail how we reached those claims.\subsection{Basic Properties needed for Closed Forms}\subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}The $\textit{rdistinct}$ function, as its name suggests, willremove duplicates in an \emph{r}$\textit{rexp}$ list, according to the accumulatorand leave only one of each different element in a list:\begin{lemma}\label{rdistinctDoesTheJob} The function $\textit{rdistinct}$ satisfies the following properties: \begin{itemize} \item If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$. \item If list $rs'$ is the result of $\rdistinct{rs}{acc}$, then $\textit{isDistinct} \; rs'$. \item $\rdistinct{rs}{acc} = rs - acc$ \end{itemize}\end{lemma}\noindentThe predicate $\textit{isDistinct}$ is for testingwhether a list's elements are all unique. It is definedrecursively on the structure of a regular expression,and we omit the precise definition here.\begin{proof} The first part is by an induction on $rs$. The second and third part can be proven by using the induction rules of $\rdistinct{\_}{\_}$.\end{proof}\noindent$\rdistinct{\_}{\_}$ will cancel out all regular expression termsthat are in the accumulator, therefore prepending a list $rs_a$ with an arbitrarylist $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$without the prepending of $rs$:\begin{lemma} The elements appearing in the accumulator will always be removed. More precisely, \begin{itemize} \item If $rs \subseteq rset$, then $\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$. \item Furthermore, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$, then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$ \end{itemize}\end{lemma}\begin{proof} By induction on $rs$.\end{proof}\noindentOn the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,then expanding the accumulator to include that element will not cause the output list to change:\begin{lemma} The accumulator can be augmented to include elements not appearing in the input list, and the output will not change. \begin{itemize} \item If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{\{r\} \cup acc}$. \item Particularly, when $acc = \varnothing$ and $rs$ de-duplicated, we have\\ \[ \rdistinct{rs}{\varnothing} = rs \] \end{itemize}\end{lemma}\begin{proof} The first half is by induction on $rs$. The second half is a corollary of the first.\end{proof}\noindentThe next property gives the condition forwhen $\rdistinct{\_}{\_}$ becomes an identical mappingfor any prefix of an input list, in other words, when can we ``push out" the arguments of $\rdistinct{\_}{\_}$:\begin{lemma}\label{distinctRdistinctAppend} If $\textit{isDistinct} \; rs_1$, and $rs_1 \cap acc = \varnothing$, then \[\textit{rdistinct}\; (rs_1 @ rsa)\;\, acc = rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1))\]\end{lemma}\noindentIn other words, it can be taken out and left untouched in the output.\begin{proof} By an induction on $rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary.\end{proof}\noindent$\rdistinct{}{}$ removes any element in anywhere of a list, if ithad appeared previously:\begin{lemma}\label{distinctRemovesMiddle} The two properties hold if $r \in rs$: \begin{itemize} \item $\rdistinct{rs}{rset} = \rdistinct{(rs @ [r])}{rset}$\\ and\\ $\rdistinct{(ab :: rs @ [ab])}{rset'} = \rdistinct{(ab :: rs)}{rset'}$ \item $\rdistinct{ (rs @ rs') }{rset} = \rdistinct{rs @ [r] @ rs'}{rset}$\\ and\\ $\rdistinct{(ab :: rs @ [ab] @ rs'')}{rset'} = \rdistinct{(ab :: rs @ rs'')}{rset'}$ \end{itemize}\end{lemma}\noindent\begin{proof} By induction on $rs$. All other variables are allowed to be arbitrary. The second half of the lemma requires the first half. Note that for each half's two sub-propositions need to be proven concurrently, so that the induction goes through.\end{proof}\noindentThis allows us to prove ``Idempotency" of $\rdistinct{}{}$ of some kind:\begin{lemma}\label{rdistinctConcatGeneral} The following equalities involving multiple applications of $\rdistinct{}{}$ hold: \begin{itemize} \item $\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{((\rdistinct{rs}{\varnothing})@ rs')}{\varnothing}$ \item $\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{(\rdistinct{rs}{\varnothing} @ rs')}{\varnothing}$ \item If $rset' \subseteq rset$, then $\rdistinct{rs}{rset} = \rdistinct{(\rdistinct{rs}{rset'})}{rset}$. As a corollary of this, \item $\rdistinct{(rs @ rs')}{rset} = \rdistinct{ (\rdistinct{rs}{\varnothing}) @ rs')}{rset}$. This gives another corollary use later: \item If $a \in rset$, then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{ (\rdistinct{(a :: rs)}{\varnothing} @ rs')}{rset} $, \end{itemize}\end{lemma}\begin{proof} By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}.\end{proof}\subsubsection{The Properties of $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS}$} We give in this subsection some properties of how $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS} $ interact with each other and with $@$, the concatenation operator.These will be helpful in later closed form proofs, whenwe want to transform the ways in which multiple functions involvingthose are composed togetherin interleaving derivative and simplification steps.When the function $\textit{Rflts}$ is applied to the concatenation of two lists, the output can be calculated by first applying thefunctions on two lists separately, and then concatenating them together.\begin{lemma}\label{rfltsProps} The function $\rflts$ has the below properties:\\ \begin{itemize} \item $\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$ \item If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$ \item $\rflts \; (rs @ [\RZERO]) = \rflts \; rs$ \item $\rflts \; (rs' @ [\RALTS{rs}]) = \rflts \; rs'@rs$ \item $\rflts \; (rs @ [\RONE]) = \rflts \; rs @ [\RONE]$ \item If $r \neq \RZERO$ and $\nexists rs'. r = \RALTS{rs'}$ then $\rflts \; (rs @ [r]) = (\rflts \; rs) @ [r]$ \item If $r = \RALTS{rs}$ and $r \in rs'$ then for all $r_1 \in rs. r_1 \in \rflts \; rs'$. \item $\rflts \; (rs_a @ \RZERO :: rs_b) = \rflts \; (rs_a @ rs_b)$ \end{itemize}\end{lemma}\noindent\begin{proof} By induction on $rs_1$ in the first sub-lemma, and induction on $r$ in the second part, and induction on $rs$, $rs'$, $rs$, $rs'$, $rs_a$ in the third, fourth, fifth, sixth and last sub-lemma.\end{proof}\subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s}Much like the definition of $L$ on plain regular expressions, one could also define the language interpretation of $\rrexp$s.\begin{center} \begin{tabular}{lcl} $RL \; (\ZERO)$ & $\dn$ & $\phi$\\ $RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\ $RL \; (c)$ & $\dn$ & $\{[c]\}$\\ $RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\ $RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\ $RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$ \end{tabular}\end{center}\noindentThe main use of $RL$ is to establish some connections between $\rsimp{}$ and $\rnullable{}$:\begin{lemma} The following properties hold: \begin{itemize} \item If $\rnullable{r}$, then $\rsimp{r} \neq \RZERO$. \item $\rnullable{r \backslash s} \quad $ if and only if $\quad \rnullable{\rderssimp{r}{s}}$. \end{itemize}\end{lemma}\begin{proof} The first part is by induction on $r$. The second part is true because property \[ RL \; r = RL \; (\rsimp{r})\] holds.\end{proof}\subsubsection{$\rsimp{}$ is Non-Increasing}In this subsection, we prove that the function $\rsimp{}$ does not make the $\llbracket \rrbracket_r$ size increase.\begin{lemma}\label{rsimpSize} $\llbracket \rsimp{r} \rrbracket_r \leq \llbracket r \rrbracket_r$\end{lemma}\subsubsection{Simplified $\textit{Rrexp}$s are Good}We formalise the notion of ``good" regular expressions,which means regular expressions thatare not fully simplified. For alternative regular expressions that means theydo not contain any nested alternatives like \[ r_1 + (r_2 + r_3) \], un-removed $\RZERO$s like \[\RZERO + r\]or duplicate elements in a children regular expression list like \[ \sum [r, r, \ldots]\]:\begin{center} \begin{tabular}{@{}lcl@{}} $\good\; \RZERO$ & $\dn$ & $\textit{false}$\\ $\good\; \RONE$ & $\dn$ & $\textit{true}$\\ $\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\ $\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\ $\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\ $\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & $\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\ & & $\textit{and}\; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \textit{and}\; \, \textit{nonAlt}\; r')$\\ $\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\ $\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\ $\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\ $\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\ $\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\ \end{tabular}\end{center}\noindentThe predicate $\textit{nonAlt}$ evaluates to true when the regular expression is not analternative, and false otherwise.The $\good$ property is preserved under $\rsimp_{ALTS}$, provided thatits non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, and unique:\begin{lemma}\label{rsimpaltsGood} If $rs \neq []$ and forall $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$, then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$.\end{lemma}\noindentWe also note thatif a regular expression $r$ is good, then $\rflts$ on the singletonlist $[r]$ will not break goodness:\begin{lemma}\label{flts2} If $\good \; r$, then forall $r' \in \rflts \; [r]. \; \good \; r'$ and $\textit{nonAlt} \; r'$.\end{lemma}\begin{proof} By an induction on $r$.\end{proof}\noindentThe other observation we make about $\rsimp{r}$ is that it nevercomes with nested alternatives, which we describe as the $\nonnested$property:\begin{center} \begin{tabular}{lcl} $\nonnested \; \, \sum []$ & $\dn$ & $\btrue$\\ $\nonnested \; \, \sum ((\sum rs_1) :: rs_2)$ & $\dn$ & $\bfalse$\\ $\nonnested \; \, \sum (r :: rs)$ & $\dn$ & $\nonnested (\sum rs)$\\ $\nonnested \; \, r $ & $\dn$ & $\btrue$ \end{tabular}\end{center}\noindentThe $\rflts$ functionalways opens up nested alternatives,which enables $\rsimp$ to be non-nested:\begin{lemma}\label{nonnestedRsimp} $\nonnested \; (\rsimp{r})$\end{lemma}\begin{proof} By an induction on $r$.\end{proof}\noindentWith this we could prove that a regular expressionsafter simplification and flattening and de-duplication,will not contain any alternative regular expression directly:\begin{lemma}\label{nonaltFltsRd} If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ then $\textit{nonAlt} \; x$.\end{lemma}\begin{proof} By \ref{nonnestedRsimp}.\end{proof}\noindentThe other thing we know is that once $\rsimp{}$ had finishedprocessing an alternative regular expression, it will notcontain any $\RZERO$s, this is because all the recursive calls to the simplification on the children regular expressionsmake the children good, and $\rflts$ will not take outany $\RZERO$s out of a good regular expression list,and $\rdistinct{}$ will not mess with the result.\begin{lemma}\label{flts3Obv} The following are true: \begin{itemize} \item If for all $r \in rs. \, \good \; r $ or $r = \RZERO$, then for all $r \in \rflts\; rs. \, \good \; r$. \item If $x \in \rdistinct{\rflts\; (\map \; rsimp{}\; rs)}{\varnothing}$ and for all $y$ such that $\llbracket y \rrbracket_r$ less than $\llbracket rs \rrbracket_r + 1$, either $\good \; (\rsimp{y})$ or $\rsimp{y} = \RZERO$, then $\good \; x$. \end{itemize}\end{lemma}\begin{proof} The first part is by induction on $rs$, where the induction rule is the inductive cases for $\rflts$. The second part is a corollary from the first part.\end{proof}And this leads to good structural property of $\rsimp{}$,that after simplification, a regular expression iseither good or $\RZERO$:\begin{lemma}\label{good1} For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$.\end{lemma}\begin{proof} By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$. Lemma \ref{rsimpSize} says that $\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to $\llbracket r \rrbracket_r$. Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case, Inductive hypothesis applies to the children regular expressions $r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied by that as well. The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used to ensure that goodness is preserved at the topmost level.\end{proof}We shall prove that any good regular expression is a fixed-point for $\rsimp{}$.First we prove an auxiliary lemma:\begin{lemma}\label{goodaltsNonalt} If $\good \; \sum rs$, then $\rflts\; rs = rs$.\end{lemma}\begin{proof} By an induction on $\sum rs$. The inductive rules are the cases for $\good$.\end{proof}\noindentNow we are ready to prove that good regular expressions are invariantof $\rsimp{}$ application:\begin{lemma}\label{test} If $\good \;r$ then $\rsimp{r} = r$.\end{lemma}\begin{proof} By an induction on the inductive cases of $\good$. The lemma \ref{goodaltsNonalt} is used in the alternative case where 2 or more elements are present in the list.\end{proof}\noindentGiven below is a property involving $\rflts$, $\rdistinct{}{}$, $\rsimp{}$ and $\rsimp_{ALTS}$,which requires $\ref{good1}$ to go through smoothly.It says that an application of $\rsimp_{ALTS}$ can be "absorbed",if it its output is concatenated with a list and then applied to $\rflts$.\begin{lemma}\label{flattenRsimpalts} $\rflts \; ( (\rsimp_{ALTS} \; (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: \map \; \rsimp{} \; rs' ) = \rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ ( \map \; \rsimp{rs'}))$\end{lemma}\begin{proof} By \ref{good1}.\end{proof}\noindentWe are also \subsubsection{$\rsimp$ is Idempotent}The idempotency of $\rsimp$ is very useful in manipulating regular expression terms into desiredforms so that key steps allowing further rewriting to closed formsare possible.\begin{lemma}\label{rsimpIdem} $\rsimp{r} = \rsimp{\rsimp{r}}$\end{lemma}\begin{proof} By \ref{test} and \ref{good1}.\end{proof}\noindentThis property means we do not have to repeatedlyapply simplification in each step, which justifiesour definition of $\blexersimp$.On the other hand, we could repeat the same $\rsimp{}$ applicationson regular expressions as many times as we want, if we have at leastone simplification applied to it, and apply it wherever we would like to:\begin{corollary}\label{headOneMoreSimp} The following properties hold, directly from \ref{rsimpIdem}: \begin{itemize} \item $\map \; \rsimp{(r :: rs)} = \map \; \rsimp{} \; (\rsimp{r} :: rs)$ \item $\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$ \end{itemize}\end{corollary}\noindentThis will be useful in later closed form proof's rewriting steps.Similarly, we point out the following useful facts below:\begin{lemma} The following equalities hold if $r = \rsimp{r'}$ for some $r'$: \begin{itemize} \item If $r = \sum rs$ then $\rsimpalts \; rs = \sum rs$. \item If $r = \sum rs$ then $\rdistinct{rs}{\varnothing} = rs$. \item $\rsimpalts \; (\rdistinct{\rflts \; [r]}{\varnothing}) = r$. \end{itemize}\end{lemma}\begin{proof} By application of \ref{rsimpIdem} and \ref{good1}.\end{proof}\noindentWith the idempotency of $\rsimp{}$ and its corollaries, we can start proving some key equalities leading to the closed forms.Now presented are a few equivalent terms under $\rsimp{}$.We use $r_1 \sequal r_2 $ here to denote $\rsimp{r_1} = \rsimp{r_2}$.\begin{lemma} \begin{itemize} The following equivalence hold: \item $\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$ \item $\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$ \item $\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$ \item $\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$ \item $\RALTS{rs} = \RALTS{\map \; \rsimp{} \; rs}$\end{itemize}\end{lemma}\begin{proof} By induction on the lists involved.\end{proof}\noindentSimilarly,we introduce the equality for $\sum$ when certain child regular expressionsare $\sum$ themselves:\begin{lemma}\label{simpFlatten3} One can flatten the inside $\sum$ of a $\sum$ if it is being simplified. Concretely, \begin{itemize} \item If for all $r \in rs, rs', rs''$, we have $\good \; r $ or $r = \RZERO$, then $\sum (rs' @ rs @ rs'') \sequal \sum (rs' @ [\sum rs] @ rs'')$ holds. As a corollary, \item $\sum (rs' @ [\sum rs] @ rs'') \sequal \sum (rs' @ rs @ rs'')$ \end{itemize}\end{lemma}\begin{proof} By rewriting steps involving the use of \ref{test} and \ref{rdistinctConcatGeneral}. The second sub-lemma is a corollary of the previous.\end{proof}%Rewriting steps not put in--too long and complicated-------------------------------\begin{comment} \begin{center} $\rsimp{\sum (rs' @ rs @ rs'')} \stackrel{def of bsimp}{=}$ \\ $\rsimpalts \; (\rdistinct{\rflts \; ((\map \; \rsimp{}\; rs') @ (\map \; \rsimp{} \; rs ) @ (\map \; \rsimp{} \; rs''))}{\varnothing})$ \\ $\stackrel{by \ref{test}}{=} \rsimpalts \; (\rdistinct{(\rflts \; rs' @ \rflts \; rs @ \rflts \; rs'')}{ \varnothing})$\\ $\stackrel{by \ref{rdistinctConcatGeneral}}{=} \rsimpalts \; (\rdistinct{\rflts \; rs'}{\varnothing} @ \rdistinct{( \rflts\; rs @ \rflts \; rs'')}{\rflts \; rs'})$\\ \end{center}\end{comment}%Rewriting steps not put in--too long and complicated-------------------------------\noindentWe need more equalities like the above to enable a closed form,but to proceed we need to introduce two rewrite relations,to make things smoother.\subsubsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$}Insired by the success we had in the correctness proof in \ref{Bitcoded2}, where we inventeda term rewriting system to capture the similarity between terms,we follow suit here defining simplificationsteps as rewriting steps. This allows capturing similarities between terms that would be otherwisehard to express.We use $\hrewrite$ for one-step atomic rewrite of regular expression simplification, $\frewrite$ for rewrite of list of regular expressions that include all operations carried out in $\rflts$, and $\grewrite$ forrewriting a list of regular expressions possible in both $\rflts$ and $\rdistinct{}{}$.Their reflexive transitive closures are used to denote zero or many steps,as was the case in the previous chapter.The presentation will be more concise than that in \ref{Bitcoded2}.To differentiate between the rewriting steps for annotated regular expressionsand $\rrexp$s, we add characters $h$ and $g$ below the squig arrow symbolto mean atomic simplification transitions of $\rrexp$s and $\rrexp$ lists, respectively.List of one-step rewrite rules for $\rrexp$ ($\hrewrite$):\begin{center} \begin{mathpar} \inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\} \inferrule[RSEQ0R]{}{r_1 \cdot \RZERO \hrewrite \RZERO\\} \inferrule[RSEQ1]{}{(\RONE \cdot r) \hrewrite r\\}\\ \inferrule[RSEQL]{ r_1 \hrewrite r_2}{r_1 \cdot r_3 \hrewrite r_2 \cdot r_3\\} \inferrule[RSEQR]{ r_3 \hrewrite r_4}{r_1 \cdot r_3 \hrewrite r_1 \cdot r_4\\}\\ \inferrule[RALTSChild]{r \hrewrite r'}{\sum (rs_1 @ [r] @ rs_2) \hrewrite \sum (rs_1 @ [r'] @ rs_2)\\} \inferrule[RALTS0]{}{\sum (rs_a @ [\RZERO] @ rs_b) \hrewrite \sum (rs_a @ rs_b)} \inferrule[RALTSNested]{}{\sum (rs_a @ [\sum rs_1] @ rs_b) \hrewrite \sum (rs_a @ rs_1 @ rs_b)} \inferrule[RALTSNil]{}{ \sum [] \hrewrite \RZERO\\} \inferrule[RALTSSingle]{}{ \sum [r] \hrewrite r\\} \inferrule[RALTSDelete]{\\ r_1 = r_2}{\sum rs_a @ [r_1] @ rs_b @ [r_2] @ rsc \hrewrite \sum rs_a @ [r_1] @ rs_b @ rs_c} \end{mathpar}\end{center}List of rewrite rules for a list of regular expressions,where each element can rewrite in many steps to the other (scf stands forli\emph{s}t \emph{c}losed \emph{f}orm). This relation is similar to the $\stackrel{s*}{\rightsquigarrow}$ for annotated regular expressions.\begin{center} \begin{mathpar} \inferrule{}{[] \scfrewrites [] } \inferrule{r \hrewrites r' \\ rs \scfrewrites rs'}{r :: rs \scfrewrites r' :: rs'} \end{mathpar}\end{center}%frewriteList of one-step rewrite rules for flattening a list of regular expressions($\frewrite$):\begin{center} \begin{mathpar} \inferrule{}{\RZERO :: rs \frewrite rs \\} \inferrule{}{(\sum rs) :: rs_a \frewrite rs @ rs_a \\} \inferrule{rs_1 \frewrite rs_2}{r :: rs_1 \frewrite r :: rs_2} \end{mathpar}\end{center}Lists of one-step rewrite rules for flattening and de-duplicatinga list of regular expressions ($\grewrite$):\begin{center} \begin{mathpar} \inferrule{}{\RZERO :: rs \grewrite rs \\} \inferrule{}{(\sum rs) :: rs_a \grewrite rs @ rs_a \\} \inferrule{rs_1 \grewrite rs_2}{r :: rs_1 \grewrite r :: rs_2} \inferrule[dB]{}{rs_a @ [a] @ rs_b @[a] @ rs_c \grewrite rs_a @ [a] @ rsb @ rsc} \end{mathpar}\end{center}\noindentThe reason why we take the trouble of defining two separate list rewriting definitions $\frewrite$ and $\grewrite$is to separate the two stages of simplification: flattening and de-duplicating.Sometimes $\grewrites$ is slightly too powerfulso we would rather use $\frewrites$ which makes certain rewriting steps more straightforward to prove.For example, when proving the closed-form for the alternative regular expression,one of the rewriting steps would be:\begin{lemma} $\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal \sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) $\end{lemma}\noindentProving this is by first showing \begin{lemma}\label{earlyLaterDerFrewrites} $\map \; (\_ \backslash x) \; (\rflts \; rs) \frewrites \rflts \; (\map \; (\_ \backslash x) \; rs)$\end{lemma}\noindentand then using lemma\begin{lemma}\label{frewritesSimpeq} If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal \sum (\rDistinct \; rs_2 \; \varnothing)$.\end{lemma}\noindentis a piece of cake.But this trick will not work for $\grewrites$.For example, a rewriting step in provingclosed forms is:\begin{center} $\rsimp{(\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})))}$\\ $=$ \\ $\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $ \noindent\end{center}For this one would hope to have a rewriting relation between the two lists involved,similar to \ref{earlyLaterDerFrewrites}. However, it turns out that \begin{center} $\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \; (\_ \backslash x) \; rs) \; ( rset \backslash x)$\end{center}\noindentdoes $\mathbf{not}$ hold in general.For this rewriting step we will introduce some slightly more cumbersomeproof technique in later sections.The point is that $\frewrite$allows us to prove equivalence in a straightforward two-step method that is not possible for $\grewrite$, thereby reducing the complexity of the entire proof.\subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}We present in the below lemma a few pairs of terms that are rewritable via $\grewrites$:\begin{lemma}\label{gstarRdistinctGeneral} \begin{itemize} \item $rs_1 @ rs \grewrites rs_1 @ (\rDistinct \; rs \; rs_1)$ \item $rs \grewrites \rDistinct \; rs \; \varnothing$ \item $rs_a @ (\rDistinct \; rs \; rs_a) \grewrites rs_a @ (\rDistinct \; rs \; (\{\RZERO\} \cup rs_a))$ \item $rs \;\; @ \;\; \rDistinct \; rs_a \; rset \grewrites rs @ \rDistinct \; rs_a \; (rest \cup rs)$ \end{itemize}\end{lemma}\noindentIf a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other,then they are equivalent under $\rsimp{}$:\begin{lemma}\label{grewritesSimpalts} If $rs_1 \grewrites rs_2$, then we have the following equivalence hold: \begin{itemize} \item $\sum rs_1 \sequal \sum rs_2$ \item $\rsimpalts \; rs_1 \sequal \rsimpalts \; rs_2$ \end{itemize}\end{lemma}\noindentHere are a few connecting lemmas showing thatif a list of regular expressions can be rewritten using $\grewrites$ or $\frewrites $ or$\scfrewrites$,then an alternative constructor taking the list can also be rewritten using $\hrewrites$:\begin{lemma} \begin{itemize} \item If $rs \grewrites rs'$ then $\sum rs \hrewrites \sum rs'$. \item If $rs \grewrites rs'$ then $\sum rs \hrewrites \rsimpalts \; rs'$ \item If $rs_1 \scfrewrites rs_2$ then $\sum (rs @ rs_1) \hrewrites \sum (rs @ rs_2)$ \item If $rs_1 \scfrewrites rs_2$ then $\sum rs_1 \hrewrites \sum rs_2$ \end{itemize}\end{lemma}\noindentHere comes the meat of the proof, which says that once two lists are rewritable to each other,then they are equivalent under $\rsimp{}$:\begin{lemma} If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$.\end{lemma}\noindentAnd similar to \ref{Bitcoded2} one can preserve rewritability after taking derivativeof two regular expressions on both sides:\begin{lemma}\label{interleave} If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$\end{lemma}\noindentThis allows proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$ now.\begin{lemma}\label{insideSimpRemoval} $\rsimp{\rder{c}{\rsimp{r}}} = \rsimp{\rder{c}{r}} $\end{lemma}\noindent\begin{proof} By \ref{interleave} and \ref{rsimpIdem}.\end{proof}\noindentAnd this unlocks more equivalent terms:\begin{lemma}\label{Simpders} As corollaries of \ref{insideSimpRemoval}, we have \begin{itemize} \item If $s \neq []$ then $\rderssimp{r}{s} = \rsimp{(\rders \; r \; s)}$. \item $\rsimpalts \; (\map \; (\_ \backslash_r x) \; (\rdistinct{rs}{\varnothing})) \sequal \rsimpalts \; (\rDistinct \; (\map \; (\_ \backslash_r x) rs) \;\varnothing )$ \end{itemize}\end{lemma}\noindentFinally,together with \begin{lemma}\label{rderRsimpAltsCommute} $\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$\end{lemma}\noindentthis leads to the first closed form--\begin{lemma}\label{altsClosedForm} \begin{center} $\rderssimp{(\sum rs)}{s} \sequal \sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$ \end{center}\end{lemma}\noindent\begin{proof} By a reverse induction on the string $s$. One rewriting step, as we mentioned earlier, involves \begin{center} $\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})) \sequal \rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}) $. \end{center} This can be proven by a combination of \ref{grewritesSimpalts}, \ref{gstarRdistinctGeneral}, \ref{rderRsimpAltsCommute}, and \ref{insideSimpRemoval}.\end{proof}\noindentThis closed form has a variant which can be more convenient in later proofs:\begin{corollary}{altsClosedForm1} If $s \neq []$ then $\rderssimp \; (\sum \; rs) \; s = \rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$.\end{corollary}\noindentThe harder closed forms are the sequence and star ones.Before we go on to obtain them, some preliminary definitionsare needed to make proof statements concise.\section{"Closed Forms" of Sequence Regular Expressions}The problem of obataining a closed-form for sequence regular expression is constructing $(r_1 \cdot r_2) \backslash_r s$if we are only allowed to use a combination of $r_1 \backslash s''$ and $r_2 \backslash s''$ , where $s''$ is from $s$.First let's look at a series of derivatives steps on a sequence regular expression, assuming that each time the firstcomponent of the sequence is always nullable):\begin{center} $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad \ldots$\end{center}Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as a giant alternative taking a list of terms $[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,where the head of the list is always the termrepresenting a match involving only $r_1$, and the tail of the list consisting ofterms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$.This intuition is also echoed by IndianPaper, where they gavea pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:\begin{center} \begin{tabular}{c} $(r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) \myequiv$\\ \rule{0pt}{3ex} $((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1) \; r_2 \backslash_r c_1)) \backslash_r (c_2 :: \ldots c_n) \myequiv$\\ \rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable \; r_1) \; r_2 \backslash_r c_1c_2)) + (\delta \ (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)) \backslash_r (c_3 \ldots c_n) $ \end{tabular}\end{center}\noindentThe equality in above should be interpretatedas language equivalence. The $\delta$ function works similarly to that ofa Kronecker delta function:\[ \delta \; b\; r\]will produce $r$if $b$ evaluates to true, and $\RZERO$ otherwise.Note that their formulation \[ ((r_1 \backslash_r \, c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2) + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)\]does not faithfullyrepresent what the intermediate derivatives would actually look likewhen one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not nullable in the head of the sequence.For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable,the regular expression would not look like \[ (r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO,\]but actually $r_1 \backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in thefirst place.In a closed-form one would want to take into account this and generate the list ofregular expressions $r_2 \backslash_r s''$ withstring pairs $(s', s'')$ where $s'@s'' = s$ and$r_1 \backslash s'$ nullable.We denote the list consisting of such strings $s''$ as $\vsuf{s}{r_1}$.The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:\begin{center} \begin{tabular}{lcl} $\vsuf{[]}{\_} $ & $=$ & $[]$\\ $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ \end{tabular}\end{center}\noindentThe list is sorted in the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing the entire result $(r_1 \cdot r_2) \backslash s$, it only stores all the strings $s''$ such that $r_2 \backslash s''$are occurring terms in $(r_1\cdot r_2)\backslash s$.To make the closed form representation more straightforward,the flattetning function $\sflat{\_}$ is used to enable the transformation from a left-associative nested sequence of alternatives into a flattened list:\[ \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{\_}}{\rightarrow} (\ldots ((r_1 + r_2) + r_3) + \ldots)\]\noindentThe definitions $\sflat{\_}$, $\sflataux{\_}$ are given below.\begin{center} \begin{tabular}{ccc} $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\ $\sflataux r$ & $=$ & $ [r]$ \end{tabular}\end{center}\begin{center} \begin{tabular}{ccc} $\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\ $\sflat{\sum []}$ & $ = $ & $ \sum []$\\ $\sflat r$ & $=$ & $ r$ \end{tabular}\end{center}\noindent$\sflataux{\_}$ breaks up nested alternative regular expressions of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shapeinto a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.It will return the singleton list $[r]$ otherwise.$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keepsthe output type a regular expression, not a list.$\sflataux{\_}$ and $\sflat{\_}$ are only recursive on the first element of the list.With $\sflataux{}$ a preliminary to the closed form can be stated,where the derivative of $r_1 \cdot r_2 \backslash s$ can beflattened into a list whose head and tail meet the descriptionwe gave earlier.\begin{lemma}\label{seqSfau0} $\sflataux{\rders{(r_1 \cdot r_2) \backslash s }} = (r_1 \backslash_r s) \cdot r_2 :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ \end{lemma}\begin{proof} By an induction on the string $s$, where the inductive cases are split as $[]$ and $xs @ [x]$. Note the key identify holds: \[ \map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\; \map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1})) \] = \[ \map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1}) \] This enables the inductive case to go through.\end{proof}\noindent Note that this lemma does $\mathbf{not}$ depend on anyspecific definitions we used,allowing people investigating derivatives to get an alternativeview of what $r_1 \cdot r_2$ is.Now we are able to use this for the intuition that the different ways in which regular expressions are nested do not matter under $\rsimp{}$:\begin{center} $\rsimp{r} \stackrel{?}{\sequal} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$ and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$\end{center}Simply wrap with $\sum$ constructor and add simplifications to both sides of \ref{seqSfau0}and one gets\begin{corollary}\label{seqClosedFormGeneral} $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} } =\rsimp{(\sum ( (r_1 \backslash s) \cdot r_2 :: \map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$\end{corollary}Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),it is possible to convert the above lemma to obtain a "closed form"for derivatives nested with simplification:\begin{lemma}\label{seqClosedForm} $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$\end{lemma}\begin{proof} By a case analysis of string $s$. When $s$ is empty list, the rewrite is straightforward. When $s$ is a list, one could use the corollary \ref{seqSfau0}, and lemma \ref{Simpders} to rewrite the left-hand-side.\end{proof}As a corollary for this closed form, one can estimate the size of the sequence derivative $r_1 \cdot r_2 \backslash_r s$ using an easier-to-handle expression:\begin{corollary}\label{seqEstimate1} \begin{center} $\llbracket \rderssimp{(r_1 \cdot r_2)}{s} \rrbracket_r = \llbracket \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$ \end{center}\end{corollary}\noindent\subsection{Closed Forms for Star Regular Expressions}We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ usingthe "closed form" of $(r_1 \cdot r_2) \backslash s$ and then the property of the $\distinct$ function.Now we try to get a bound on $r^* \backslash s$ as well.Again, we first look at how a star's derivatives evolve, if they grow maximally: \begin{center} $r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''} \quad \ldots$\end{center}When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,the number of terms in $r^* \backslash s$ will grow exponentially, causing the sizeof the derivatives $r^* \backslash s$ to grow exponentially, even if we do not count the possible size explosions of $r \backslash c$ themselves.Thanks to $\rflts$ and $\rDistinct$, we are able to open up regular expressions like$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).This allows us to use a similar technique as $r_1 \cdot r_2$ case,where the crux is to get an equivalent form of $\rderssimp{r^*}{s}$ with shape $\rsimp{\sum rs}$.This requires generating all possible sub-strings $s'$ of $s$such that $r\backslash s' \cdot r^*$ will appear as a term in $(r^*) \backslash s$.The first function we define is a single-stepupdating function $\starupdate$, which takes three arguments as input:the new character $c$ to take derivative with, the regular expression$r$ directly under the star $r^*$, and thelist of strings $sSet$ for the derivative $r^* \backslash s$ up til this point such that $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ (the equality is not exact, more on this later).\begin{center} \begin{tabular}{lcl} $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\ & & $\textit{if} \; (\rnullable \; (\rders \; r \; s))$ \\ & & $\textit{then} \;\; (s @ [c]) :: [c] :: ( \starupdate \; c \; r \; Ss)$ \\ & & $\textit{else} \;\; (s @ [c]) :: ( \starupdate \; c \; r \; Ss)$ \end{tabular}\end{center}\noindentAs a generalisation from characters to strings,$\starupdates$ takes a string instead of a characteras the first input argument, and is otherwise the sameas $\starupdate$.\begin{center} \begin{tabular}{lcl} $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\ $\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; ( \starupdate \; c \; r \; Ss)$ \end{tabular}\end{center}\noindentFor the star regular expression,its derivatives can be seen as a nested giganticalternative similar to that of sequence regular expression's derivatives, and therefore needto be ``straightened out" as well.The function for this would be $\hflat{}$ and $\hflataux{}$.\begin{center} \begin{tabular}{lcl} $\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ $\hflataux{r}$ & $\dn$ & $[r]$ \end{tabular}\end{center}\begin{center} \begin{tabular}{lcl} $\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\ $\hflat{r}$ & $\dn$ & $r$ \end{tabular}\end{center}\noindent%MAYBE TODO: introduce createdByStarAgain these definitions are tailor-made for dealing with alternatives that haveoriginated from a star's derivatives, so we do not attempt to open up all possible regular expressions of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2elements.We give a predicate for such "star-created" regular expressions:\begin{center} \begin{tabular}{lcr} & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ \end{tabular}\end{center}These definitions allows us the flexibility to talk about regular expressions in their most convenient format,for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.These definitions help express that certain classes of syntatically distinct regular expressions are actually the same under simplification.This is not entirely true for annotated regular expressions: %TODO: bsimp bders \neq bderssimp\begin{center} $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$\end{center}For bit-codes, the order in which simplification is appliedmight cause a difference in the location they are placed.If we want something like\begin{center} $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$\end{center}Some "canonicalization" procedure is required,which either pushes all the common bitcodes to nodesas senior as possible:\begin{center} $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $\end{center}or does the reverse. However bitcodes are not of interest if we are talking aboutthe $\llbracket r \rrbracket$ size of a regex.Therefore for the ease and simplicity of producing aproof for a size bound, we are happy to restrict ourselves to unannotated regular expressions, and obtain such equalities as\begin{lemma} $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$\end{lemma}\begin{proof} By using the rewriting relation $\rightsquigarrow$\end{proof}%TODO: rsimp sflatAnd from this we obtain a proof that a star's derivative will be the sameas if it had all its nested alternatives created during deriving being flattened out:For example,\begin{lemma} $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$\end{lemma}\begin{proof} By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.\end{proof}% The simplification of a flattened out regular expression, provided it comes%from the derivative of a star, is the same as the one nested.We first introduce an inductive propertyfor $\starupdate$ and $\hflataux{\_}$, it says if we do derivatives of $r^*$with a string that starts with $c$,then flatten it out,we obtain a listof the shape $\sum_{s' \in sSet} (r\backslash_r s') \cdot r^*$,where $sSet = \starupdates \; s \; r \; [[c]]$.\begin{lemma}\label{starHfauInduct} $\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; (\starupdates \; s \; r_0 \; [[c]])$\end{lemma}\begin{proof} By an induction on $s$, the inductive cases being $[]$ and $s@[c]$.\end{proof}\noindentHere is a corollary that states the lemma ina more intuitive way:\begin{corollary} $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot (r^*))\; (\starupdates \; c\; r\; [[c]])$\end{corollary}\noindentNote that this is also agnostic of the simplificationfunction we defined, and is therefore of more general interest.Now adding the $\rsimp{}$ bit for closed forms,we have\begin{lemma} $a :: rs \grewrites \hflataux{a} @ rs$\end{lemma}\noindentgiving us\begin{lemma}\label{cbsHfauRsimpeq1} $\rsimp{a+b} = \rsimp{(\sum \hflataux{a} @ \hflataux{b})}$.\end{lemma}\noindentThis yields\begin{lemma}\label{hfauRsimpeq2} $\rsimp{r} = \rsimp{(\sum \hflataux{r})}$\end{lemma}\noindentTogether with the rewriting relation\begin{lemma}\label{starClosedForm6Hrewrites} $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss \scfrewrites \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$\end{lemma}\noindentWe obtain the closed form for star regular expression:\begin{lemma}\label{starClosedForm} $\rderssimp{r^*}{c::s} = \rsimp{ (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; (\starupdates \; s\; r \; [[c]]) ) ) } $\end{lemma}\begin{proof} By an induction on $s$. The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2} are used. \end{proof}\section{Estimating the Closed Forms' sizes}We now summarize the closed forms below:\begin{itemize} \item $\rderssimp{(\sum rs)}{s} \sequal \sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$ \item $\rderssimp{(r_1 \cdot r_2)}{s} \sequal \sum ((r_1 \backslash s) \cdot r_2 ) :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1}))$ \item $\rderssimp{r^*}{c::s} = \rsimp{ (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; (\starupdates \; s\; r \; [[c]]) ) ) } $\end{itemize} \noindent The closed forms on the left-hand-sideare all of the same shape: $\rsimp{ (\sum rs)} $.Such regular expression will be bounded by the size of $\sum rs'$, where every element in $rs'$ is distinct, and each element can be described by some inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$).which will each have a size uppder bound according to inductive hypothesis, which controls $r \backslash s$.We elaborate the above reasoning by a series of lemmasbelow, where straightforward proofs are omitted.\begin{lemma} If $\forall r \in rs. \rsize{r} $ is less than or equal to $N$, and $\textit{length} \; rs$ is less than or equal to $l$, then $\rsize{\sum rs}$ is less than or equal to $l*N + 1$.\end{lemma}\noindentIf we define all regular expressions with size nomore than $N$ as $\sizeNregex \; N$:\[ \sizeNregex \; N \dn \{r \mid \rsize{r} \leq N \}\]Then such set is finite:\begin{lemma}\label{finiteSizeN} $\textit{isFinite}\; (\sizeNregex \; N)$\end{lemma}\begin{proof} By overestimating the set $\sizeNregex \; N + 1$ using union of sets like $\{r_1 \cdot r_2 \mid r_1 \in A \text{and} r_2 \in A\} $ where $A = \sizeNregex \; N$.\end{proof}\noindentFrom this we get a corollary thatif forall $r \in rs$, $\rsize{r} \leq N$, then the output of $\rdistinct{rs}{\varnothing}$ is a list of regularexpressions of finite size depending on $N$ only. \begin{corollary}\label{finiteSizeNCorollary} Assumes that for all $r \in rs. \rsize{r} \leq N$, and the cardinality of $\sizeNregex \; N$ is $c_N$ then$\rsize{\rdistinct{rs}{\varnothing}} \leq c*N$.\end{corollary}\noindentWe have proven that the output of $\rdistinct{rs'}{\varnothing}$is bounded by a constant $c_N$ depending only on $N$,provided that each of $rs'$'s elementis bounded by $N$.We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.We show how $\rdistinct$ and $\rflts$in the simplification function together is at least as good as $\rdistinct{}{}$ alone.\begin{lemma}\label{interactionFltsDB} $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r \leq \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $.\end{lemma}\noindentThe intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:\begin{lemma}\label{altsSimpControl} $\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$\end{lemma}\begin{proof} By using \ref{interactionFltsDB}.\end{proof}\noindentwhich says that the size of regular expressionis always smaller if we apply the full simplificationrather than just one component ($\rdistinct{}{}$).Now we are ready to control the sizes of$r_1 \cdot r_2 \backslash s$, $r^* \backslash s$.\begin{theorem} For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$\end{theorem}\noindent\begin{proof} We prove this by induction on $r$. The base cases for $\RZERO$, $\RONE $ and $\RCHAR{c}$ are straightforward. In the sequence $r_1 \cdot r_2$ case, the inductive hypotheses state $\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and $\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. When the string $s$ is not empty, we can reason as follows % \begin{center} \begin{tabular}{lcll}& & $ \llbracket \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\& $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash_{rsimp} s \cdot r_2 \; \; :: \; \; \map \; (r_2\backslash_{rsimp} \_)\; (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\ & $\leq$ & $\llbracket \rdistinct{(r_1 \backslash_{rsimp} s \cdot r_2 \; \; :: \; \; \map \; (r_2\backslash_{rsimp} \_)\; (\vsuf{s}{r}))}{\varnothing} \rrbracket_r + 1$ & (2) \\ & $\leq$ & $2 + N_1 + \rsize{r_2} + (N_2 * (card\;(\sizeNregex \; N_2)))$ & (3)\\\end{tabular}\end{center}\noindent(1) is by the corollary \ref{seqEstimate1}.(2) is by \ref{altsSimpControl}.(3) is by \ref{finiteSizeNCorollary}.Combining the cases when $s = []$ and $s \neq []$, we get (4):\begin{center} \begin{tabular}{lcll} $\rsize{(r_1 \cdot r_2) \backslash_r s}$ & $\leq$ & $max \; (2 + N_1 + \llbracket r_2 \rrbracket_r + N_2 * (card\; (\sizeNregex \; N_2))) \; \rsize{r_1\cdot r_2}$ & (4) \end{tabular}\end{center}We reason similarly for $\STAR$.The inductive hypothesis is$\exists N. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N$.Let $n_r = \llbracket r^* \rrbracket_r$.When $s = c :: cs$ is not empty,\begin{center} \begin{tabular}{lcll}& & $ \llbracket \rderssimp{r^* }{c::cs} \rrbracket_r $\\& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\lambda s. (r \backslash_{rsimp} s) \cdot r^*) \; (\starupdates\; cs \; r \; [[c]] )) )} \rrbracket_r $ & (5) \\ & $\leq$ & $\llbracket \rdistinct{ (\map \; (\lambda s. (r \backslash_{rsimp} s) \cdot r^*) \; (\starupdates\; cs \; r \; [[c]] ) )} {\varnothing} \rrbracket_r + 1$ & (6) \\ & $\leq$ & $1 + (\textit{card} (\sizeNregex \; (N + n_r))) * (1 + (N + n_r)) $ & (7)\\\end{tabular}\end{center}\noindent(5) is by the lemma \ref{starClosedForm}.(6) is by \ref{altsSimpControl}.(7) is by \ref{finiteSizeNCorollary}.Combining with the case when $s = []$, one gets\begin{center} \begin{tabular}{lcll} $\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r))) * (1 + (N + n_r)) $ & (8)\\ \end{tabular}\end{center}\noindentThe alternative case is slightly less involved.The inductive hypothesis is equivalent to $\exists N. \forall r \in (\map \; (\_ \backslash_r s) \; rs). \rsize{r} \leq N$.In the case when $s = c::cs$, we have \begin{center} \begin{tabular}{lcll}& & $ \llbracket \rderssimp{\sum rs }{c::cs} \rrbracket_r $\\& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\_ \backslash_{rsimp} s) \; rs) )} \rrbracket_r $ & (9) \\ & $\leq$ & $\llbracket (\sum (\map \; (\_ \backslash_{rsimp} s) \; rs) ) \rrbracket_r $ & (10) \\& $\leq$ & $1 + N * (length \; rs) $ & (11)\\ \end{tabular}\end{center}\noindent(9) is by \ref{altsClosedForm}, (10) by \ref{rsimpSize} and (11) by inductive hypothesis.Combining with the case when $s = []$, one gets\begin{center} \begin{tabular}{lcll} $\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ & (12)\\ \end{tabular}\end{center}(4), (8), and (12) are all the inductive cases proven.\end{proof}\begin{corollary} For any regex $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$\end{corollary}\begin{proof} By \ref{sizeRelations}.\end{proof}\noindent%-----------------------------------% SECTION 2%-----------------------------------%----------------------------------------------------------------------------------------% SECTION 3%----------------------------------------------------------------------------------------\subsection{A Closed Form for the Sequence Regular Expression}\noindentBefore we get to the proof that says the intermediate result of our lexer willremain finitely bounded, which is an important efficiency/liveness guarantee,we shall first develop a few preparatory properties and definitions to make the process of proving that a breeze.We define rewriting relations for $\rrexp$s, which allows us to do the same trick as we did for the correctness proof,but this time we will have stronger equalities established.What guarantee does this bound give us?Whatever the regex is, it will not grow indefinitely.Take our previous example $(a + aa)^*$ as an example:\begin{center} \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} \begin{tikzpicture} \begin{axis}[ xlabel={number of $a$'s}, x label style={at={(1.05,-0.05)}}, ylabel={regex size}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax= 40, ytick={0,10,...,40}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={$(a + aa)^*$}, legend pos=north west, legend cell align=left] \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; \end{axis} \end{tikzpicture} \end{tabular}\end{center}We are able to limit the size of the regex $(a + aa)^*$'s derivativeswith our simplificationrules very effectively.In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the boundis $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function$f(x) = x * 2^x$.This means the bound we have will surge up at leasttower-exponentially with a linear increase of the depth.For a regex of depth $n$, the boundwould be approximately $4^n$.Test data in the graphs from randomly generated regular expressionsshows that the giant bounds are far from being hit.%a few sample regular experessions' derivatives%size change%TODO: giving regex1_size_change.data showing a few regular expressions' size changes %w;r;t the input characters number, where the size is usually cubic in terms of original size%a*, aa*, aaa*, .....%randomly generated regular expressions\begin{center} \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} \begin{tikzpicture} \begin{axis}[ xlabel={number of $a$'s}, x label style={at={(1.05,-0.05)}}, ylabel={regex size}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=1000, ytick={0,100,...,1000}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={regex1}, legend pos=north west, legend cell align=left] \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; \end{axis} \end{tikzpicture} & \begin{tikzpicture} \begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=1000, ytick={0,100,...,1000}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={regex2}, legend pos=north west, legend cell align=left] \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; \end{axis} \end{tikzpicture} & \begin{tikzpicture} \begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=1000, ytick={0,100,...,1000}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={regex3}, legend pos=north west, legend cell align=left] \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; \end{axis} \end{tikzpicture}\\ \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.} \end{tabular} \end{center} \noindentMost of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the original size.We will discuss improvements to this bound in the next chapter.%----------------------------------------------------------------------------------------% SECTION ??%----------------------------------------------------------------------------------------%-----------------------------------% SECTION syntactic equivalence under simp%-----------------------------------\section{Syntactic Equivalence Under $\simp$}We prove that minor differences can be annhilatedby $\simp$.For example,\begin{center} $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$\end{center}%----------------------------------------------------------------------------------------% SECTION ALTS CLOSED FORM%----------------------------------------------------------------------------------------\section{A Closed Form for \textit{ALTS}}Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.There are a few key steps, one of these steps isOne might want to prove this by something a simple statement like: For this to hold we want the $\textit{distinct}$ function to pick upthe elements before and after derivatives correctly:$r \in rset \equiv (rder x r) \in (rder x rset)$.which essentially requires that the function $\backslash$ is an injective mapping.Unfortunately the function $\backslash c$ is not an injective mapping.\subsection{function $\backslash c$ is not injective (1-to-1)}\begin{center} The derivative $w.r.t$ character $c$ is not one-to-one. Formally, $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$\end{center}This property is trivially true for thecharacter regex example:\begin{center} $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$\end{center}But apart from the cases where the derivativeoutput is $\ZERO$, are there non-trivial resultsof derivatives which contain strings?The answer is yes.For example,\begin{center} Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ where $a$ is not nullable.\\ $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\end{center}We start with two syntactically different regular expressions,and end up with the same derivative result.This is not surprising as we have such equality as below in the style of Arden's lemma:\\\begin{center} $L(A^*B) = L(A\cdot A^* \cdot B + B)$\end{center}There are two problems with this finiteness result, though.\begin{itemize} \item First, It is not yet a direct formalisation of our lexer's complexity, as a complexity proof would require looking into the time it takes to execute {\bf all} the operations involved in the lexer (simp, collect, decode), not just the derivative. \item Second, the bound is not yet tight, and we seek to improve $N_a$ so that it is polynomial on $\llbracket a \rrbracket$.\end{itemize}Still, we believe this contribution is fruitful,because\begin{itemize} \item The size proof can serve as a cornerstone for a complexity formalisation. Derivatives are the most important phases of our lexer algorithm. Size properties about derivatives covers the majority of the algorithm and is therefore a good indication of complexity of the entire program. \item The bound is already a strong indication that catastrophic backtracking is much less likely to occur in our $\blexersimp$ algorithm. We refine $\blexersimp$ with $\blexerStrong$ in the next chapter so that the bound becomes polynomial.\end{itemize}%----------------------------------------------------------------------------------------% SECTION 4%----------------------------------------------------------------------------------------One might wonder the actual bound rather than the loose bound we gavefor the convenience of an easier proof.How much can the regex $r^* \backslash s$ grow? As earlier graphs have shown,%TODO: reference that graph where size grows quicklythey can grow at a maximum speedexponential $w.r.t$ the number of characters, but will eventually level off when the string $s$ is long enough.If they grow to a size exponential $w.r.t$ the original regex, our algorithmwould still be slow.And unfortunately, we have concrete exampleswhere such regular expressions grew exponentially large before levelling off:$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximumsize that is exponential on the number $n$ under our current simplification rules:%TODO: graph of a regex whose size increases exponentially.\begin{center} \begin{tikzpicture} \begin{axis}[ height=0.5\textwidth, width=\textwidth, xlabel=number of a's, xtick={0,...,9}, ylabel=maximum size, ymode=log, log basis y={2} ] \addplot[mark=*,blue] table {re-chengsong.data}; \end{axis} \end{tikzpicture}\end{center}For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.The exponential size is triggered by that the regex$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$inside the $(\ldots) ^*$ having exponentially manydifferent derivatives, despite those difference being minor.$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$will therefore contain the following terms (after flattening out all nested alternatives):\begin{center} $(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ $(1 \leq m' \leq m )$\end{center}These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).With each new input character taking the derivative against the intermediate result, more and more such distinctterms will accumulate, until the length reaches $L.C.M.(1, \ldots, n)$.$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\where $m' \neq m''$ \\as they are slightly different.This means that with our current simplification methods,we will not be able to control the derivative so that$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$as there are already exponentially many terms.These terms are similar in the sense that the head of those termsare all consisted of sub-terms of the form: $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $.For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most$n * (n + 1) / 2$ such terms. For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivativescan be described by 6 terms:$a^*$, $a\cdot (aa)^*$, $ (aa)^*$, $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.The total number of different "head terms", $n * (n + 1) / 2$,is proportional to the number of characters in the regex $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.This suggests a slightly different notion of size, which we call the alphabetic width:%TODO:(TODO: Alphabetic width def.)Antimirov\parencite{Antimirov95} has proven that $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subtermscreated by doing derivatives of $r$ against all possible strings.If we can make sure that at any moment in our lexing algorithm our intermediate result hold at most one copy of each of the subterms then we can get the same bound as Antimirov's.This leads to the algorithm in the next chapter.%----------------------------------------------------------------------------------------% SECTION 1%----------------------------------------------------------------------------------------%-----------------------------------% SUBSECTION 1%-----------------------------------\subsection{Syntactic Equivalence Under $\simp$}We prove that minor differences can be annhilatedby $\simp$.For example,\begin{center} $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$\end{center}