Addressed Gerog "can't understand 'erase messes with structure'" comment
% Chapter Template\chapter{Finiteness Bound} % Main chapter title\label{Finite} % In Chapter 4 \ref{Chapter4} we give the second guarantee%of our bitcoded algorithm, that is a finite bound on the size of any %regex's derivatives. In this chapter we give a bound in terms of the size of the calculated derivatives: given an annotated regular expression $a$, for any string $s$our algorithm $\blexersimp$'s derivativesare finitely boundedby a constant that only depends on $a$.Formally we show that there exists an $N_a$ such that\begin{center} $\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$\end{center}\noindentwhere the size ($\llbracket \_ \rrbracket$) of an annotated regular expression is definedin terms of the number of nodes in its tree structure (its recursive definition is given in the next page).We believe this size boundis important in the context of POSIX lexing because \begin{itemize} \item It is a stepping stone towards the goal of eliminating ``catastrophic backtracking''. If the internal data structures used by our algorithm grows beyond a finite bound, then clearly the algorithm (which traverses these structures) will be slow. The next step is to refine the bound $N_a$ so that it is not just finite but polynomial in $\llbracket a\rrbracket$. \item Having the finite bound formalised gives us higher confidence that our simplification algorithm $\simp$ does not ``misbehave'' like $\textit{simpSL}$ does. The bound is universal for a given regular expression, which is an advantage over work which only gives empirical evidence on some test cases (see for example Verbatim work \cite{Verbatimpp}).\end{itemize}\noindentWe then extend our $\blexersimp$to support bounded repetitions ($r^{\{n\}}$).We update our formalisation of the correctness and finiteness properties toinclude this new construct. We show that we can out-compete other verified lexers such asVerbatim++ on bounded regular expressions.In the next section we describe in more detailwhat the finite bound means in our algorithmand why the size of the internal data structures ofa typical derivative-based lexer such asSulzmann and Lu's needs formal treatment.\section{Formalising Size Bound of Derivatives}\noindentIn our lexer ($\blexersimp$),we take an annotated regular expression as input,and repeately take derivative of and simplify it.\begin{figure} \begin{center} \begin{tabular}{lcl} $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. \end{tabular} \end{center} \caption{The size function of bitcoded regular expressions}\label{brexpSize}\end{figure}\begin{figure} \begin{tikzpicture}[scale=2, every node/.style={minimum size=11mm}, ->,>=stealth',shorten >=1pt,auto,thick ] \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; \node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$}; \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$}; \node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 12mm]{$a_2$}; \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$}; \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$}; \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$}; \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; \node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$}; \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode}; \end{tikzpicture} \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks}\end{figure}\noindentEach timea derivative is taken, the regular expression might grow.However, the simplification that is immediately afterwards will often shrink it so that the overall size of the derivatives stays relatively small.This intuition is depicted by the relative sizechange between the black and blue nodes:After $\simp$ the node shrinks.Our proof states that all the blue nodesstay below a size bound $N_a$ determined by the input $a$.\noindentSulzmann and Lu's assumed a similar picture of their algorithm,though in fact their algorithm's size might be better depicted by the following graph:\begin{figure}[H] \begin{tikzpicture}[scale=2, every node/.style={minimum size=11mm}, ->,>=stealth',shorten >=1pt,auto,thick ] \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; \node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$}; \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$}; \node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 17mm]{$a_2$}; \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$}; \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$}; \node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$}; \draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$}; \node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$}; \draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$}; \node (rnn) [right = of rns, minimum size = 1mm]{}; \draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$}; \end{tikzpicture} \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}\end{figure}\noindentThe picture means that in some cases their lexer (where they use $\simpsulz$ as the simplification function)will have a size explosion, causing the running time of each derivative step to grow continuously (for example in \ref{SulzmannLuLexerTime}).They tested out the run time of theirlexer on particular examples such as $(a+b+ab)^*$and claimed that their algorithm is linear w.r.t to the input.With our mechanised proof, we avoid this type of unintentionalgeneralisation.Before delving into the details of the formalisation,we are going to provide an overview of it in the following subsection.\subsection{Overview of the Proof}A high-level overview of the main components of the finiteness proofis as follows:\begin{figure}[H] \begin{tikzpicture}[scale=1,font=\bf, node/.style={ rectangle,rounded corners=3mm, ultra thick,draw=black!50,minimum height=18mm, minimum width=20mm, top color=white,bottom color=black!20}] \node (0) at (-5,0) [node, text width=1.8cm, text centered] {$\llbracket \bderssimp{a}{s} \rrbracket$}; \node (A) at (0,0) [node,text width=1.6cm, text centered] {$\llbracket \rderssimp{r}{s} \rrbracket_r$}; \node (B) at (3,0) [node,text width=3.0cm, anchor=west, minimum width = 40mm] {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$}; \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$}; \draw [->,line width=0.5mm] (0) -- node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A); \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B); \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); \end{tikzpicture} %\caption{\end{figure}\noindentWe explain the steps one by one:\begin{itemize} \item We first introduce the operations such as derivatives, simplification, size calculation, etc. associated with $\rrexp$s, which we have introduced in chapter \ref{Bitcoded2}. As promised we will discuss why they are needed in \ref{whyRerase}. The operations on $\rrexp$s are identical to those on annotated regular expressions except that they dispense with bitcodes. This means that all proofs about size of $\rrexp$s will apply to annotated regular expressions, because the size of a regular expression is independent of the bitcodes. \item We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$, where $\textit{ClosedForm}(r, s)$ is entirely given as the derivatives of their children regular expressions. We call the right-hand-side the \emph{Closed Form} of the derivative $\rderssimp{r}{s}$. \item Formally we give an estimate of $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$. The key observation is that $\distinctBy$'s output is a list with a constant length bound.\end{itemize}We will expand on these steps in the next sections.\\\section{The $\textit{Rrexp}$ Datatype}The first step is to define $\textit{rrexp}$s.They are annotated regular expressions without bitcodes,allowing a more convenient size bound proof.%Of course, the bits which encode the lexing information %would grow linearly with respect %to the input, which should be taken into accounte when we wish to tackle the runtime complexity.%But for the sake of the structural size %we can safely ignore them.\\The datatype definition of the $\rrexp$, called\emph{r-regular expressions},was initially defined in \ref{rrexpDef}.The reason for the prefix $r$ isto make a distinction with basic regular expressions.We give here again the definition of $\rrexp$.\[ \rrexp ::= \RZERO \mid \RONE \mid \RCHAR{c} \mid \RSEQ{r_1}{r_2} \mid \RALTS{rs} \mid \RSTAR{r} \]The size of an r-regular expression iswritten $\llbracket r\rrbracket_r$, whose definition mirrors that of an annotated regular expression. \begin{center} \begin{tabular}{lcl} $\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\ $\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\ $\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\ $\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\ $\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\ $\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$. \end{tabular}\end{center}\noindentThe $r$ in the subscript of $\llbracket \rrbracket_r$ is to differentiate with the same operation for annotated regular expressions.Similar subscripts will be added for operations like $\rerase{}$:\begin{center} \begin{tabular}{lcl} $\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\ $\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\ $\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\ $\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ $\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$ \end{tabular}\end{center}\subsection{Why a New Datatype?}\label{whyRerase}\marginpar{\em added label so this section can be referenced by other parts of the thesisso that interested readers can jump to/be reassured that there will explanations.}Originally the erase operation $(\_)_\downarrow$ wasused by Ausaf et al. in their proofs related to $\blexer$.This function was not part of the lexing algorithm, and the sole purpose was tobridge the gap between the $r$%$\textit{rexp}$ (un-annotated) and $\textit{arexp}$ (annotated)regular expression datatypes so as to leverage the correctnesstheorem of $\lexer$.%to establish the correctness of $\blexer$.For example, lemma \ref{retrieveStepwise} %and \ref{bmkepsRetrieve} uses $\erase$ to convert an annotated regular expression $a$ intoa plain one so that it can be used by $\inj$ to create the desired value$\inj\; (a)_\downarrow \; c \; v$.Ideally $\erase$ should only remove the auxiliary information not related to thestructure--the bitcodes. However there exists a complicationwhere the alternative constructors have different arity for $\textit{arexp}$and $\textit{r}$:\begin{center} \begin{tabular}{lcl} $\textit{r}$ & $::=$ & $\ldots \;|\; (\_ + \_) \; ::\; "\textit{r} \Rightarrow \textit{r} \Rightarrow \textit{r}" | \ldots$\\ $\textit{arexp}$ & $::=$ & $\ldots\; |\; (\Sigma \_ ) \; ::\; "\textit{arexp} \; list \Rightarrow \textit{arexp}" | \ldots$ \end{tabular}\end{center}\noindentTo convert between the two$\erase$ has to recursively disassemble a list into nested binary applications of the $(\_ + \_)$ operator,handling corner cases like empty orsingleton alternative lists:%becomes $r$ during the%$\erase$ function.%The annotated regular expression $\sum[a, b, c]$ would turn into%$(a+(b+c))$.\begin{center} \begin{tabular}{lcl} $ (_{bs}\sum [])_\downarrow $ & $\dn$ & $\ZERO$\\ $ (_{bs}\sum [a])_\downarrow$ & $\dn$ & $a$\\ $ (_{bs}\sum a_1 :: a_2)_\downarrow$ & $\dn$ & $(a_1)_\downarrow + (a_2)_\downarrow)$\\ $ (_{bs}\sum a :: as)_\downarrow$ & $\dn$ & $a_\downarrow + (\erase \; _{[]} \sum as)$ \end{tabular}\end{center}\noindentThese operations inevitably change the structure and size ofan annotated regular expression. For example,$a_1 = \sum _{Z}[x]$ has size 2, but $(a_1)_\downarrow = x$ only has size 1.%adding unnecessary %complexities to the size bound proof.%The reason we%define a new datatype is that %the $\erase$ function %does not preserve the structure of annotated%regular expressions.%We initially started by using %plain regular expressions and tried to prove%lemma \ref{rsizeAsize},%however the $\erase$ function messes with the structure of the %annotated regular expression.%The $+$ constructor%of basic regular expressions is only binary, whereas $\sum$ %takes a list. Therefore we need to convert between%annotated and normal regular expressions as follows:For example, if we define the size of a basic plain regular expression in the usual way,\begin{center} \begin{tabular}{lcl} $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ $\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\ $\llbracket r_1 + r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\ $\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\ $\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\ $\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$ \end{tabular}\end{center}\noindentThen the property\begin{center} $\llbracket a \rrbracket \stackrel{?}{=} \llbracket a_\downarrow \rrbracket_p$\end{center}does not hold.%With $\textit{rerase}$, however, %only the bitcodes are thrown away.That leads to us defining the new regular expression datatype withoutbitcodes but with a list alternative constructor, and defining a new erase functionin a strictly structure-preserving manner:\begin{center} \begin{tabular}{lcl} $\textit{rrexp}$ & $::=$ & $\ldots\; |\; (\sum \_ ) \; ::\; "\textit{rrexp} \; list \Rightarrow \textit{rrexp}" | \ldots$\\ $\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ \end{tabular}\end{center}\noindent%But%Everything about the structure remains intact.%Therefore it does not change the size%of an annotated regular expression and we have:\noindentOne might be able to prove an inequality such as$\llbracket a \rrbracket \leq \llbracket a_\downarrow \rrbracket_p $and then estimate $\llbracket a_\downarrow \rrbracket_p$,but we found our approach more straightforward.\\\subsection{Functions for R-regular Expressions}The downside of our approach is that we need to redefineseveral functions for $\rrexp$.In this section we shall define the r-regular expression versionof $\bder$, and $\textit{bsimp}$ related functions.We use $r$ as the prefix or subscript to differentiatewith the bitcoded version.%For example,$\backslash_r$, $\rdistincts$, and $\rsimp$%as opposed to $\backslash$, $\distinctBy$, and $\bsimp$.%As promised, they are much simpler than their bitcoded counterparts.%The operations on r-regular expressions are %almost identical to those of the annotated regular expressions,%except that no bitcodes are used. For example,The derivative operation for an r-regular expression is\\\begin{center} \begin{tabular}{@{}lcl@{}} $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ $(\ONE)\,\backslash_r c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; \ONE\;\textit{else}\;\ZERO$\\ $(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ & $\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\ $(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ & $\textit{if}\;(\textit{rnullable}\,r_1)$\\ & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\ & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\ & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\ $(r^*)\,\backslash_r c$ & $\dn$ & $( r\,\backslash_r c)\cdot (_{[]}r^*))$ \end{tabular} \end{center} \noindentwhere we omit the definition of $\textit{rnullable}$.The generalisation from the derivatives w.r.t a character toderivatives w.r.t strings is given as\begin{center} \begin{tabular}{lcl} $r \backslash_{rs} []$ & $\dn$ & $r$\\ $r \backslash_{rs} c::s$ & $\dn$ & $(r\backslash_r c) \backslash_{rs} s$ \end{tabular}\end{center}The function $\distinctBy$ for r-regular expressions does not need a function checking equivalence becausethere are no bit annotations.Therefore we have\begin{center} \begin{tabular}{lcl} $\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\ $\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ & & $\textit{else}\; \; r::\rdistinct{rs}{(rset \cup \{r\})}$ \end{tabular}\end{center}%TODO: definition of rsimp (maybe only the alternative clause)\noindent%We would like to make clear%a difference between our $\rdistincts$ and%the Isabelle $\textit {distinct}$ predicate.%In Isabelle $\textit{distinct}$ is a function that returns a boolean%rather than a list.%It tests if all the elements of a list are unique.\\With $\textit{rdistinct}$ in place,the flatten function for $\rrexp$ is as follows: \begin{center} \begin{tabular}{@{}lcl@{}} $\textit{rflts} \; (\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $as \; @ \; \textit{rflts} \; as' $ \\ $\textit{rflts} \; \ZERO :: as'$ & $\dn$ & $ \textit{rflts} \; \textit{as'} $ \\ $\textit{rflts} \; a :: as'$ & $\dn$ & $a :: \textit{rflts} \; \textit{as'}$ \quad(otherwise) \end{tabular} \end{center} \noindentThe function $\rsimpalts$ corresponds to $\textit{bsimp}_{ALTS}$:\begin{center} \begin{tabular}{@{}lcl@{}} $\rsimpalts \;\; nil$ & $\dn$ & $\RZERO$\\ $\rsimpalts \;\; r::nil$ & $\dn$ & $r$\\ $\rsimpalts \;\; rs$ & $\dn$ & $\sum rs$\\\end{tabular} \end{center} \noindentSimilarly, we have $\rsimpseq$ which corresponds to $\textit{bsimp}_{SEQ}$:\begin{center} \begin{tabular}{@{}lcl@{}} $\rsimpseq \;\; \RZERO \; \_ $ & $=$ & $\RZERO$\\ $\rsimpseq \;\; \_ \; \RZERO $ & $=$ & $\RZERO$\\ $\rsimpseq \;\; \RONE \cdot r_2$ & $\dn$ & $r_2$\\ $\rsimpseq \;\; r_1 r_2$ & $\dn$ & $r_1 \cdot r_2$\\\end{tabular} \end{center} and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$:\begin{center} \begin{tabular}{@{}lcl@{}} $\textit{rsimp} \; (r_1\cdot r_2)$ & $\dn$ & $ \textit{rsimp}_{SEQ} \; bs \;(\textit{rsimp} \; r_1) \; (\textit{rsimp} \; r_2) $ \\ $\textit{rsimp} \; (_{bs}\sum \textit{rs})$ & $\dn$ & $\textit{rsimp}_{ALTS} \; \textit{bs} \; (\textit{rdistinct} \; ( \textit{rflts} ( \textit{map} \; rsimp \; rs)) \; \rerases \; \varnothing) $ \\ $\textit{rsimp} \; r$ & $\dn$ & $\textit{r} \qquad \textit{otherwise}$ \end{tabular} \end{center} \begin{center} \begin{tabular}{@{}lcl@{}} $r\backslash_{rsimp} \, c$ & $\dn$ & $\rsimp \; (r\backslash_r \, c)$ \end{tabular}\end{center}\begin{center} \begin{tabular}{@{}lcl@{}}$r \backslash_{rsimps} \; \; c\!::\!s $ & $\dn$ & $(r \backslash_{rsimp}\, c) \backslash_{rsimps}\, s$ \\$r \backslash_{rsimps} [\,] $ & $\dn$ & $r$ \end{tabular}\end{center}\noindentWe do not define an r-regular expression version of $\blexersimp$,as our proof does not depend on it.Now we are ready to introduce how r-regular expressions allowus to prove the size bound on bitcoded regular expressions.\subsection{Using R-regular Expressions to Bound Bit-coded Regular Expressions}Everything about the size of annotated regular expressions after the applicationof function $\bsimp$ and $\backslash_{simps}$can be calculated via the size of r-regular expressions after the applicationof $\rsimp$ and $\backslash_{rsimps}$:\begin{lemma}\label{sizeRelations} The following equalities hold: \begin{itemize} \item $\rsize{\rerase a} = \asize a$ \item $\asize{\bsimps \; a} = \rsize{\rsimp{ \rerase{a}}}$ \item $\asize{\bderssimp{a}{s}} = \rsize{\rderssimp{\rerase{a}}{s}}$ \end{itemize}\end{lemma}\begin{proof} First part follows from the definition of $(\_)_{\downarrow_r}$. The second part is by induction on the inductive cases of $\textit{bsimp}$. The third part is by induction on the string $s$, where the inductive step follows from part one.\end{proof}\noindentWith lemma \ref{sizeRelations},we will be able to focus on estimating only$\rsize{\rderssimp{\rerase{a}}{s}}$in later parts because\begin{center} $\rsize{\rderssimp{\rerase{a}}{s}} \leq N_r \quad$ implies $\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$.\end{center}%From now on we %Unless stated otherwise in the rest of this %chapter all regular expressions without%bitcodes are seen as r-regular expressions ($\rrexp$s).%For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,%we use the notation $r_1 + r_2$%for brevity.%-----------------------------------% SUB SECTION ROADMAP RREXP BOUND%-----------------------------------%\subsection{Roadmap to a Bound for $\textit{Rrexp}$}%The way we obtain the bound for $\rrexp$s is by two steps:%\begin{itemize}% \item% First, we rewrite $r\backslash s$ into something else that is easier% to bound. This step is crucial for the inductive case % $r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way,% but after simplification, they will always be equal or smaller to a form consisting of an alternative% list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application.% \item% Then, for such a sum list of regular expressions $f\; (g\; (\sum rs))$, we can control its size% by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only % reduce the size of a regular expression, not adding to it.%\end{itemize}%%\section{Step One: Closed Forms}%We transform the function application $\rderssimp{r}{s}$%into an equivalent %form $f\; (g \; (\sum rs))$.%The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$.%This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the%right hand side the "closed form" of $r\backslash s$.%%\begin{quote}\it% Claim: For regular expressions $r_1 \cdot r_2$, we claim that%\end{quote}%\noindent%We explain in detail how we reached those claims.If we attempt to prove \begin{center} $\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$\end{center}using a naive induction on the structure of $r$,then we are stuck at the inductive cases such as$r_1\cdot r_2$.The inductive hypotheses are:\begin{center} 1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t. \;\;\forall s. \llbracket r_1 \backslash_{rsimps} s \rrbracket_r \leq N_{r_1}. $\\ 2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t. \;\; \forall s. \llbracket r_2 \backslash_{rsimps} s \rrbracket_r \leq N_{r_2}. $\end{center}The inductive step to prove would be \begin{center} $\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s. \llbracket (r_1 \cdot r_2) \backslash_{rsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$\end{center}The problem is that it is not clear what $(r_1\cdot r_2) \backslash_{rsimps} s$ looks like,and therefore $N_{r_1}$ and $N_{r_2}$ in theinductive hypotheses cannot be directly used.%We have already seen that $(r_1 \cdot r_2)\backslash s$ %and $(r^*)\backslash s$ can grow in a wild way.The point however, is that they will be equivalent to a list ofterms $\sum rs$, where each term in $rs$ willbe made of $r_1 \backslash s' $, $r_2\backslash s'$,and $r \backslash s'$ with $s' \in \textit{SubString} \; s$ (which standsfor the set of substrings of $s$).The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$in the simplification, which prevents the $rs$ from growing indefinitely.Based on this idea, we develop a proof in two steps.First, we show the below equality (where$f$ and $g$ are functions that do not increase the size of the input)\begin{center}$r\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$,\end{center}where $r = r_1 \cdot r_2$ or $r = r_0^*$ and so on.For example, for $r_1 \cdot r_2$ we have the equality as \begin{center} $ \rderssimp{r_1 \cdot r_2}{s} = \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r_2}{\_} \;(\vsuf{s}{r_1})))}$ \end{center}We call the right-hand-side the \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.Second, we will bound the closed form of r-regular expressionsusing some estimation techniquesand then applylemma \ref{sizeRelations} to show that the bitcoded regular expressionsin our $\blexersimp$ are finitely bounded.We will describe in detail the first step of the proofin the next section.\section{Closed Forms}In this section we introduce in detailhow to express the string derivativesof regular expressions (i.e. $r \backslash_r s$ where $s$ is a stringrather than a single character) in a different way than our previous definition.In previous chapters, the derivative of a regular expression $r$ w.r.t a string $s$was recursively defined on the string:\begin{center} $r \backslash_s (c::s) \dn (r \backslash c) \backslash_s s$\end{center}The problem is that this definition does not provide much informationon what $r \backslash_s s$ looks like.If we are interested in the size of a derivative like $(r_1 \cdot r_2)\backslash s$,we have to somehow get a more concrete form to begin.We call such more concrete representations the ``closed forms'' ofstring derivatives as opposed to their original definitions.The terminology ``closed form'' is borrowed from mathematics,which usually describe expressions that are solely comprised of finitely manywell-known and easy-to-compute operations such as additions, multiplications, and exponential functions.We start by proving some basic identitiesinvolving the simplification functions for r-regular expressions.After that we introduce the rewrite relations$\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$$\rightsquigarrow_f$ and $\rightsquigarrow_g$.These relations involve similar techniques as in chapter \ref{Bitcoded2}for annotated regular expressions.Finally, we use these identities to establish theclosed forms of the alternative regular expression,the sequence regular expression, and the star regular expression.%$r_1\cdot r_2$, $r^*$ and $\sum rs$.\subsection{Some Basic Identities}In what follows we will often convert between listsand sets.We use Isabelle's $set$ to refer to the function that converts a list $rs$ to the setcontaining all the elements in $rs$.\subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication}The $\textit{rdistinct}$ function, as its name suggests, willde-duplicate an r-regular expression list.It will also remove any elements that are already in the accumulator set.\begin{lemma}\label{rdistinctDoesTheJob} %The function $\textit{rdistinct}$ satisfies the following %properties: Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its recursive definition here. Its Isabelle counterpart would be $\textit{distinct}$.} for testing whether a list's elements are unique. Then the following properties about $\textit{rdistinct}$ hold: \begin{itemize} \item If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$. \item %If list $rs'$ is the result of $\rdistinct{rs}{acc}$, $\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$. \item $\textit{set} \; (\rdistinct{rs}{acc}) = (\textit{set} \; rs) - acc$ \end{itemize}\end{lemma}\noindent\begin{proof} The first part is by an induction on $rs$. The second and third parts can be proven by using the inductive cases of $\textit{rdistinct}$.\end{proof}\noindent%$\textit{rdistinct}$ will out all regular expression terms%that are in the accumulator, therefore Concatenating a list $rs_a$ at the front of anotherlist $rs$ whose elements are all from the accumulator, and then calling $\textit{rdistinct}$on the merged list, the output will be as if we had called $\textit{rdistinct}$without the prepending of $rs$:\begin{lemma}\label{rdistinctConcat} The elements appearing in the accumulator will always be removed. More precisely, \begin{itemize} \item If $rs \subseteq rset$, then $\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$. \item More generally, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$, then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$ \end{itemize}\end{lemma}\begin{proof} By induction on $rs$ and using \ref{rdistinctDoesTheJob}.\end{proof}\noindentOn the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,then expanding the accumulator to include that element will not cause the output list to change:\begin{lemma}\label{rdistinctOnDistinct} The accumulator can be augmented to include elements not appearing in the input list, and the output will not change. \begin{itemize} \item If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{(\{r\} \cup acc)}$. \item Particularly, if $\;\;\textit{isDistinct} \; rs$, then we have\\ \[ \rdistinct{rs}{\varnothing} = rs \] \end{itemize}\end{lemma}\begin{proof} The first half is by induction on $rs$. The second half is a corollary of the first.\end{proof}\noindentThe function $\textit{rdistinct}$ removes duplicates from anywhere in a list.Despite being seemingly obvious, the induction technique is not as straightforward.\begin{lemma}\label{distinctRemovesMiddle} The two properties hold if $r \in rs$: \begin{itemize} \item $\rdistinct{rs}{rset} = \rdistinct{(rs @ [r])}{rset}$\\ and\\ $\rdistinct{(ab :: rs @ [ab])}{rset'} = \rdistinct{(ab :: rs)}{rset'}$ \item $\rdistinct{ (rs @ rs') }{rset} = \rdistinct{rs @ [r] @ rs'}{rset}$\\ and\\ $\rdistinct{(ab :: rs @ [ab] @ rs'')}{rset'} = \rdistinct{(ab :: rs @ rs'')}{rset'}$ \end{itemize}\end{lemma}\noindent\begin{proof} By induction on $rs$. All other variables are allowed to be arbitrary. The second part of the lemma requires the first. Note that for each part, the two sub-propositions need to be proven at the same time, so that the induction goes through.\end{proof}\noindentThis allows us to prove a few more equivalence relations involving $\textit{rdistinct}$ (they will be useful later):\begin{lemma}\label{rdistinctConcatGeneral} \mbox{} \begin{itemize} \item $\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{((\rdistinct{rs}{\varnothing})@ rs')}{\varnothing}$ \item $\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{(\rdistinct{rs}{\varnothing} @ rs')}{\varnothing}$ \item If $rset' \subseteq rset$, then $\rdistinct{rs}{rset} = \rdistinct{(\rdistinct{rs}{rset'})}{rset}$. As a corollary of this, \item $\rdistinct{(rs @ rs')}{rset} = \rdistinct{ (\rdistinct{rs}{\varnothing}) @ rs')}{rset}$. This gives another corollary use later: \item If $a \in rset$, then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{ (\rdistinct{(a :: rs)}{\varnothing} @ rs')}{rset} $, \end{itemize}\end{lemma}\begin{proof} By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}.\end{proof}\noindentThe next lemma is a more general form of \ref{rdistinctConcat};It says that$\textit{rdistinct}$ is composable w.r.t list concatenation:\begin{lemma}\label{distinctRdistinctAppend} If $\;\; \textit{isDistinct} \; rs_1$, and $(set \; rs_1) \cap acc = \varnothing$, then applying $\textit{rdistinct}$ on $rs_1 @ rs_a$ does not have an effect on $rs_1$: \[\textit{rdistinct}\; (rs_1 @ rsa)\;\, acc = rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1))\]\end{lemma}\begin{proof} By an induction on $rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary.\end{proof}\noindent$\textit{rdistinct}$ needs to be applied only once, and applying it multiple times does not make any difference:\begin{corollary}\label{distinctOnceEnough} $\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; ( (rdistinct \; rs \; \{ \}) @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$\end{corollary}\begin{proof} By lemma \ref{distinctRdistinctAppend}.\end{proof}\subsubsection{The Properties of $\textit{Rflts}$} We give in this subsection some propertiesinvolving $\backslash_r$, $\backslash_{rsimps}$, $\textit{rflts}$ and $\textit{rsimp}_{ALTS} $, together with any non-trivial lemmas that lead to them.These will be helpful in later closed-form proofs, whenwe want to transform derivative terms which have%the ways in which multiple functions involving%those are composed togetherinterleaving derivatives and simplifications applied to them.\noindent%When the function $\textit{Rflts}$ %is applied to the concatenation of two lists; the output can be calculated by first applying the%functions on two lists separately and then concatenating them together.$\textit{Rflts}$ is composable in terms of concatenation:\begin{lemma}\label{rfltsProps} The function $\rflts$ has the properties below:\\ \begin{itemize} \item $\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$ \item If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$ \item $\rflts \; (rs @ [\RZERO]) = \rflts \; rs$ \item $\rflts \; (rs' @ [\RALTS{rs}]) = \rflts \; rs'@rs$ \item $\rflts \; (rs @ [\RONE]) = \rflts \; rs @ [\RONE]$ \item If $r \neq \RZERO$ and $\nexists rs'. r = \RALTS{rs'}$ then $\rflts \; (rs @ [r]) = (\rflts \; rs) @ [r]$ \item If $r = \RALTS{rs}$ and $r \in rs'$ then for all $r_1 \in rs. r_1 \in \rflts \; rs'$. \item $\rflts \; (rs_a @ \RZERO :: rs_b) = \rflts \; (rs_a @ rs_b)$ \end{itemize}\end{lemma}\noindent\begin{proof} By induction on $rs_1$ in the first sub-lemma, and induction on $r$ in the second part, and induction on $rs$, $rs'$, $rs$, $rs'$, $rs_a$ in the third, fourth, fifth, sixth and last sub-lemma.\end{proof}\noindentNow we introduce the property that the operations derivative and $\rsimpalts$commute, this will be used later on when deriving the closed form forthe alternative regular expression:\begin{lemma}\label{rderRsimpAltsCommute} $\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$\end{lemma}\begin{proof} By induction on $rs$.\end{proof}\noindent\subsubsection{The $RL$ Function: Language Interpretation for $\textit{Rrexp}$s}Much like the definition of $L$ on plain regular expressions, one can also define the language interpretation for $\rrexp$s.\begin{center} \begin{tabular}{lcl} $RL \; (\ZERO_r)$ & $\dn$ & $\phi$\\ $RL \; (\ONE_r)$ & $\dn$ & $\{[]\}$\\ $RL \; (c)$ & $\dn$ & $\{[c]\}$\\ $RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\ $RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\ $RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$ \end{tabular}\end{center}\noindentThe main use of $RL$ is to establish some connections between $\rsimp{}$ and $\rnullable{}$:\begin{lemma} The following properties hold: \begin{itemize} \item If $\rnullable{r}$, then $\rsimp{r} \neq \RZERO$. \item $\rnullable{r \backslash s} \quad $ if and only if $\quad \rnullable{\rderssimp{r}{s}}$. \end{itemize}\end{lemma}\begin{proof} The first part is by induction on $r$. The second part is true because property \[ RL \; r = RL \; (\rsimp{r})\] holds.\end{proof}\subsubsection{Simplified $\textit{Rrexp}$s are Good}We formalise the notion of ``good" regular expressions,which means regular expressions thatare fully simplified in terms of our $\textit{rsimp}$ function. For alternative regular expressions that means theydo not contain any nested alternatives, un-eliminated $\RZERO$sor duplicate elements (for example, $r_1 + (r_2 + r_3)$, $\RZERO + r$ and $ \sum [r, r, \ldots]$).The clauses for $\good$ are:\begin{center} \begin{tabular}{@{}lcl@{}} $\good\; \RZERO$ & $\dn$ & $\textit{false}$\\ $\good\; \RONE$ & $\dn$ & $\textit{true}$\\ $\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\ $\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\ $\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\ $\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & $\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\ & & $\land \; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \land \; \, \textit{nonAlt}\; r')$\\ $\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\ $\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\ $\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\ $\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\ $\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\ \end{tabular}\end{center}\noindentWe omit the recursive definition of the predicate $\textit{nonAlt}$,which evaluates to true when the regular expression is not analternative, and false otherwise.The $\good$ property is preserved under $\rsimp_{ALTS}$, provided thatits non-empty argument list of expressions are all good themselves, and $\textit{nonAlt}$, and unique:\begin{lemma}\label{rsimpaltsGood} If $rs \neq []$ and for all $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$, then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$.\end{lemma}\noindentWe also note thatif a regular expression $r$ is good, then $\rflts$ on the singletonlist $[r]$ will not break goodness:\begin{lemma}\label{flts2} If $\good \; r$, then forall $r' \in \rflts \; [r]. \; \good \; r'$ and $\textit{nonAlt} \; r'$.\end{lemma}\begin{proof} By an induction on $r$.\end{proof}\noindentThe other observation we make about $\rsimp{r}$ is that it nevercomes with nested alternatives, which we describe as the $\nonnested$property:\begin{center} \begin{tabular}{lcl} $\nonnested \; \, \sum []$ & $\dn$ & $\btrue$\\ $\nonnested \; \, \sum ((\sum rs_1) :: rs_2)$ & $\dn$ & $\bfalse$\\ $\nonnested \; \, \sum (r :: rs)$ & $\dn$ & $\nonnested (\sum rs)$\\ $\nonnested \; \, r $ & $\dn$ & $\btrue$ \end{tabular}\end{center}\noindentThe $\rflts$ functionalways opens up nested alternatives,which enables $\rsimp$ to be non-nested:\begin{lemma}\label{nonnestedRsimp} It is always the case that \begin{center} $\nonnested \; (\rsimp{r})$ \end{center}\end{lemma}\begin{proof} By induction on $r$.\end{proof}\noindentWith this we can prove that a regular expressionafter simplification and flattening and de-duplication,will not contain any alternative regular expression directly:\begin{lemma}\label{nonaltFltsRd} If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ then $\textit{nonAlt} \; x$.\end{lemma}\begin{proof} By \ref{nonnestedRsimp}.\end{proof}\noindentThe other fact we know is that once $\rsimp{}$ has finishedprocessing an alternative regular expression, it will notcontain any $\RZERO$s. This is because all the recursive calls to the simplification on the children regular expressionsmake the children good, and $\rflts$ will not deleteany $\RZERO$s out of a good regular expression list,and $\rdistinct{}$ will not ``mess'' with the result.\begin{lemma}\label{flts3Obv} The following are true: \begin{itemize} \item If for all $r \in rs. \, \good \; r $ or $r = \RZERO$, then for all $r \in \rflts\; rs. \, \good \; r$. \item If $x \in \rdistinct{\rflts\; (\map \; rsimp{}\; rs)}{\varnothing}$ and for all $y$ such that $\llbracket y \rrbracket_r$ less than $\llbracket rs \rrbracket_r + 1$, either $\good \; (\rsimp{y})$ or $\rsimp{y} = \RZERO$, then $\good \; x$. \end{itemize}\end{lemma}\begin{proof} The first part is by induction, where the inductive cases are the inductive cases of $\rflts$. The second part is a corollary from the first part.\end{proof}This leads to good structural property of $\rsimp{}$,that after simplification, a regular expression iseither good or $\RZERO$:\begin{lemma}\label{good1} For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$.\end{lemma}\begin{proof} By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$. Lemma \ref{rsimpMono} says that $\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to $\llbracket r \rrbracket_r$. Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case, The inductive hypothesis applies to the children regular expressions $r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied by that as well. The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used to ensure that goodness is preserved at the topmost level.\end{proof}We shall prove that any good regular expression is a fixed-point for $\textit{rsimp}$.First we prove an auxiliary lemma:\begin{lemma}\label{goodaltsNonalt} If $\good \; \sum rs$, then $\rflts\; rs = rs$.\end{lemma}\begin{proof} By an induction on $\sum rs$. The inductive rules are the cases for $\good$.\end{proof}\noindentNow we are ready to prove that good regular expressions are invariantwith respect to $\rsimp{}$:\begin{lemma}\label{test} If $\good \;r$ then $\rsimp{r} = r$.\end{lemma}\begin{proof} By an induction on the inductive cases of $\good$, using lemmas \ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}. The lemma \ref{goodaltsNonalt} is used in the alternative case where 2 or more elements are present in the list.\end{proof}\noindentBelow we show a property involving $\rflts$, $\textit{rdistinct}$, $\rsimp{}$ and $\rsimp_{ALTS}$,which requires $\ref{good1}$ to go through smoothly:\begin{lemma}\label{flattenRsimpalts}An application of $\rsimp_{ALTS}$ can be ``absorbed'',if its output is concatenated with a list and then applied to $\rflts$.\begin{center} $\rflts \; ( (\rsimp_{ALTS} \; (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: \map \; \rsimp{} \; rs' ) = \rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ ( \map \; \rsimp{rs'}))$\end{center}\end{lemma}\begin{proof} By \ref{good1}.\end{proof}\noindentWe are also ready to prove that $\textit{rsimp}$ is idempotent.\subsubsection{$\rsimp$ is Idempotent}The idempotency of $\rsimp$ is very useful in manipulating regular expression terms into desiredforms so that key steps allowing further rewriting to closed formsare possible.\begin{lemma}\label{rsimpIdem} $\rsimp{r} = \rsimp{(\rsimp{r})}$\end{lemma}\begin{proof} By \ref{test} and \ref{good1}.\end{proof}\noindentThis property means we do not have to repeatedlyapply simplification in each step, which justifiesour definition of $\blexersimp$.This is in contrast to the work of Sulzmann and Lu wherethe simplification is applied in a fixpoint manner.On the other hand, we can repeat the same $\rsimp{}$ applicationson regular expressions as many times as we want, if we have at leastone simplification applied to it, and apply it wherever we need to:\begin{corollary}\label{headOneMoreSimp} The following properties hold, directly from \ref{rsimpIdem}: \begin{itemize} \item $\map \; \rsimp{(r :: rs)} = \map \; \rsimp{} \; (\rsimp{r} :: rs)$ \item $\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$ \end{itemize}\end{corollary}\noindentThis will be useful in the later closed-form proof's rewriting steps.Similarly, we state the following useful facts below:\begin{lemma} The following equalities hold if $r = \rsimp{r'}$ for some $r'$: \begin{itemize} \item If $r = \sum rs$ then $\rsimpalts \; rs = \sum rs$. \item If $r = \sum rs$ then $\rdistinct{rs}{\varnothing} = rs$. \item $\rsimpalts \; (\rdistinct{\rflts \; [r]}{\varnothing}) = r$. \end{itemize}\end{lemma}\begin{proof} By application of lemmas \ref{rsimpIdem} and \ref{good1}.\end{proof}\noindentWith the idempotency of $\textit{rsimp}$ and its corollaries, we can start proving some key equalities leading to the closed forms.Next we present a few equivalent terms under $\textit{rsimp}$.To make the notation more conciseWe use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$.%\begin{center}%\begin{tabular}{lcl}% $a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$%\end{tabular}%\end{center}%\noindent%\vspace{0em}\begin{lemma} The following equivalence hold: \begin{itemize} \item $\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$ \item $\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$ \item $\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$ \item $\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$ \item $\RALTS{rs} \sequal \RALTS{\map \; \rsimp{} \; rs}$\end{itemize}\end{lemma}\begin{proof} By induction on the lists involved.\end{proof}\noindentThe above allows us to provetwo similar equalities (which are a bit more involved).It says that we could flatten the elementsbefore simplification and still get the same result.\begin{lemma}\label{simpFlatten3} One can flatten the inside $\sum$ of a $\sum$ if it is being simplified. Concretely, \begin{itemize} \item If for all $r \in rs, rs', rs''$, we have $\good \; r $ or $r = \RZERO$, then $\sum (rs' @ rs @ rs'') \sequal \sum (rs' @ [\sum rs] @ rs'')$ holds. As a corollary, \item $\sum (rs' @ [\sum rs] @ rs'') \sequal \sum (rs' @ rs @ rs'')$ \end{itemize}\end{lemma}\begin{proof} By rewriting steps involving the use of \ref{test} and \ref{rdistinctConcatGeneral}. The second sub-lemma is a corollary of the previous.\end{proof}%Rewriting steps not put in--too long and complicated-------------------------------\begin{comment} \begin{center} $\rsimp{\sum (rs' @ rs @ rs'')} \stackrel{def of bsimp}{=}$ \\ $\rsimpalts \; (\rdistinct{\rflts \; ((\map \; \rsimp{}\; rs') @ (\map \; \rsimp{} \; rs ) @ (\map \; \rsimp{} \; rs''))}{\varnothing})$ \\ $\stackrel{by \ref{test}}{=} \rsimpalts \; (\rdistinct{(\rflts \; rs' @ \rflts \; rs @ \rflts \; rs'')}{ \varnothing})$\\ $\stackrel{by \ref{rdistinctConcatGeneral}}{=} \rsimpalts \; (\rdistinct{\rflts \; rs'}{\varnothing} @ \rdistinct{( \rflts\; rs @ \rflts \; rs'')}{\rflts \; rs'})$\\ \end{center}\end{comment}%Rewriting steps not put in--too long and complicated-------------------------------\noindentWe need more equalities like the above to enable a closed form lemma,for which we need to introduce a few rewrite relationsto helpus obtain them.\subsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$}Inspired by the success we had in the correctness proof in \ref{Bitcoded2},we follow suit here, defining atomic simplificationsteps as ``small-step'' rewriting steps. This allows capturing similarities between terms that would be otherwisehard to express.We use $\hrewrite$ for one-step atomic rewrite of regular expression simplification, $\frewrite$ for rewrite of list of regular expressions that include all operations carried out in $\rflts$, and $\grewrite$ forrewriting a list of regular expressions possible in both $\rflts$ and $\textit{rdistinct}$.Their reflexive transitive closures are used to denote zero or many steps,as was the case in the previous chapter.As we have alreadydone something similar, the presentation aboutthese rewriting rules will be more concise than that in \ref{Bitcoded2}.To differentiate between the rewriting steps for annotated regular expressionsand $\rrexp$s, we add characters $h$ and $g$ below the squig arrow symbolto mean atomic simplification transitions of $\rrexp$s and $\rrexp$ lists, respectively.\begin{figure}[H]\begin{center} \begin{mathpar} \inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\} \inferrule[RSEQ0R]{}{r_1 \cdot \RZERO \hrewrite \RZERO\\} \inferrule[RSEQ1]{}{(\RONE \cdot r) \hrewrite r\\}\\ \inferrule[RSEQL]{ r_1 \hrewrite r_2}{r_1 \cdot r_3 \hrewrite r_2 \cdot r_3\\} \inferrule[RSEQR]{ r_3 \hrewrite r_4}{r_1 \cdot r_3 \hrewrite r_1 \cdot r_4\\}\\ \inferrule[RALTSChild]{r \hrewrite r'}{\sum (rs_1 @ [r] @ rs_2) \hrewrite \sum (rs_1 @ [r'] @ rs_2)\\} \inferrule[RALTS0]{}{\sum (rs_a @ [\RZERO] @ rs_b) \hrewrite \sum (rs_a @ rs_b)} \inferrule[RALTSNested]{}{\sum (rs_a @ [\sum rs_1] @ rs_b) \hrewrite \sum (rs_a @ rs_1 @ rs_b)} \inferrule[RALTSNil]{}{ \sum [] \hrewrite \RZERO\\} \inferrule[RALTSSingle]{}{ \sum [r] \hrewrite r\\} \inferrule[RALTSDelete]{\\ r_1 = r_2}{\sum rs_a @ [r_1] @ rs_b @ [r_2] @ rsc \hrewrite \sum rs_a @ [r_1] @ rs_b @ rs_c} \end{mathpar}\end{center}\caption{List of one-step rewrite rules for r-regular expressions ($\hrewrite$)}\label{hRewrite}\end{figure}Like $\rightsquigarrow_s$, it isconvenient to define rewrite rules for a list of regular expressions,where each element can rewrite in many steps to the other (scf stands forli\emph{s}t \emph{c}losed \emph{f}orm). This relation is similar to the $\stackrel{s*}{\rightsquigarrow}$ for annotated regular expressions.\begin{figure}[H]\begin{center} \begin{mathpar} \inferrule{}{[] \scfrewrites [] } \inferrule{r \hrewrites r' \\ rs \scfrewrites rs'}{r :: rs \scfrewrites r' :: rs'} \end{mathpar}\end{center}\caption{List of one-step rewrite rules for a list of r-regular expressions}\label{scfRewrite}\end{figure}%frewriteList of one-step rewrite rules for flattening a list of regular expressions($\frewrite$):\begin{figure}[H]\begin{center} \begin{mathpar} \inferrule{}{\RZERO :: rs \frewrite rs \\} \inferrule{}{(\sum rs) :: rs_a \frewrite rs @ rs_a \\} \inferrule{rs_1 \frewrite rs_2}{r :: rs_1 \frewrite r :: rs_2} \end{mathpar}\end{center}\caption{List of one-step rewrite rules characterising the $\rflts$ operation on a list}\label{fRewrites}\end{figure}Lists of one-step rewrite rules for flattening and de-duplicatinga list of regular expressions ($\grewrite$):\begin{figure}[H]\begin{center} \begin{mathpar} \inferrule{}{\RZERO :: rs \grewrite rs \\} \inferrule{}{(\sum rs) :: rs_a \grewrite rs @ rs_a \\} \inferrule{rs_1 \grewrite rs_2}{r :: rs_1 \grewrite r :: rs_2} \inferrule[dB]{}{rs_a @ [a] @ rs_b @[a] @ rs_c \grewrite rs_a @ [a] @ rsb @ rsc} \end{mathpar}\end{center}\caption{List of one-step rewrite rules characterising the $\rflts$ and $\textit{rdistinct}$operations}\label{gRewrite}\end{figure}\noindentWe definetwo separate list rewriting relations $\frewrite$ and $\grewrite$.The rewriting steps that take place duringflattening are characterised by $\frewrite$.The rewrite relation $\grewrite$ characterises both flattening and de-duplicating.Sometimes $\grewrites$ is slightly too powerfulso we would rather use $\frewrites$ to prove%because we onlyequalities related to $\rflts$.%certain equivalence under the rewriting steps of $\frewrites$.For example, when proving the closed-form for the alternative regular expression,one of the equalities needed is:\begin{center} $\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal \sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) $\end{center}\noindentProving this is by first showing \begin{lemma}\label{earlyLaterDerFrewrites} $\map \; (\_ \backslash x) \; (\rflts \; rs) \frewrites \rflts \; (\map \; (\_ \backslash x) \; rs)$\end{lemma}\noindentand then the equivalence between two termsthat can reduce in many steps to each other:\begin{lemma}\label{frewritesSimpeq} If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal \sum (\rDistinct \; rs_2 \; \varnothing)$.\end{lemma}\noindentThese two lemmas can both be proven using a straightforward induction (andthe proofs for them are therefore omitted).Now the above equalities can be derived with ease: \begin{corollary} $\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal \sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) $\end{corollary}\begin{proof} By lemmas \ref{earlyLaterDerFrewrites} and \ref{frewritesSimpeq}.\end{proof}But this trick will not work for $\grewrites$.For example, a rewriting step in provingclosed forms is:\begin{center} $\rsimp{(\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})))}$\\ $=$ \\ $\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $ \noindent\end{center}For this, one would hope to have a rewriting relation between the two lists involved,similar to \ref{earlyLaterDerFrewrites}. However, it turns out that \begin{center} $\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \; (\_ \backslash x) \; rs) \; ( rset \backslash x)$\end{center}\noindentdoes $\mathbf{not}$ hold in general.For this rewriting step we will introduce some slightly more cumbersomeproof technique later.The point is that $\frewrite$allows us to prove equivalence in a straightforward way that is not possible for $\grewrite$. \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}In this part, we present lemmas statingpairs of r-regular expressions and r-regular expression listswhere one can rewrite from one in many steps to the other.Most of the proofs to these lemmas are straightforward, usingan induction on the corresponding rewriting relations.These proofs will therefore be omitted when this is the case.We present in the following lemma a few pairs of terms that are rewritable via $\grewrites$:\begin{lemma}\label{gstarRdistinctGeneral} \mbox{} \begin{itemize} \item $rs_1 @ rs \grewrites rs_1 @ (\rDistinct \; rs \; rs_1)$ \item $rs \grewrites \rDistinct \; rs \; \varnothing$ \item $rs_a @ (\rDistinct \; rs \; rs_a) \grewrites rs_a @ (\rDistinct \; rs \; (\{\RZERO\} \cup rs_a))$ \item $rs \;\; @ \;\; \rDistinct \; rs_a \; rset \grewrites rs @ \rDistinct \; rs_a \; (rest \cup rs)$ \end{itemize}\end{lemma}\noindentIf a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other,then they are equivalent under $\rsimp{}$:\begin{lemma}\label{grewritesSimpalts} \mbox{} If $rs_1 \grewrites rs_2$, then we have the following equivalence: \begin{itemize} \item $\sum rs_1 \sequal \sum rs_2$ \item $\rsimpalts \; rs_1 \sequal \rsimpalts \; rs_2$ \end{itemize}\end{lemma}\noindentHere are a few connecting lemmas showing thatif a list of regular expressions can be rewritten using $\grewrites$ or $\frewrites $ or$\scfrewrites$,then an alternative constructor taking the list can also be rewritten using $\hrewrites$:\begin{lemma} \begin{itemize} \item If $rs \grewrites rs'$ then $\sum rs \hrewrites \sum rs'$. \item If $rs \grewrites rs'$ then $\sum rs \hrewrites \rsimpalts \; rs'$ \item If $rs_1 \scfrewrites rs_2$ then $\sum (rs @ rs_1) \hrewrites \sum (rs @ rs_2)$ \item If $rs_1 \scfrewrites rs_2$ then $\sum rs_1 \hrewrites \sum rs_2$ \end{itemize}\end{lemma}\noindentNow comes the core of the proof, which says that once two lists are rewritable to each other,then they are equivalent under $\textit{rsimp}$:\begin{lemma} If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$.\end{lemma}\noindentSimilar to what we did in chapter \ref{Bitcoded2}, we prove that if one can rewrite from one r-regular expression ($r$)to the other ($r'$), after taking derivatives one can still rewritethe first ($r\backslash c$) to the other ($r'\backslash c$).\begin{lemma}\label{interleave} If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$\end{lemma}\noindentThis allows us to prove more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$.\begin{lemma}\label{insideSimpRemoval} $\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})} $\end{lemma}\noindent\begin{proof} By \ref{interleave} and \ref{rsimpIdem}.\end{proof}\noindentAnd this unlocks more equivalent terms:\begin{lemma}\label{Simpders} As corollaries of \ref{insideSimpRemoval}, we have \begin{itemize} \item If $s \neq []$ then $\rderssimp{r}{s} = \rsimp{( r \backslash_{rs} s)}$. \item $\rsimpalts \; (\map \; (\_ \backslash_r x) \; (\rdistinct{rs}{\varnothing})) \sequal \rsimpalts \; (\rDistinct \; (\map \; (\_ \backslash_r x) rs) \;\varnothing )$ \end{itemize}\end{lemma}\begin{proof} Part 1 is by lemma \ref{insideSimpRemoval}, part 2 is by lemma \ref{insideSimpRemoval} .%and \ref{distinctDer}.\end{proof}\noindent\subsection{Closed Forms for $\sum rs$, $r_1\cdot r_2$ and $r^*$}Lemma \ref{Simpders} leads to our first closed form,which is for the alternative regular expression:\begin{theorem}\label{altsClosedForm} \mbox{} \begin{center} $\rderssimp{(\sum rs)}{s} \sequal \sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$ \end{center}\end{theorem}\noindent\begin{proof} By a reverse induction on the string $s$. One rewriting step, as we mentioned earlier, involves \begin{center} $\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})) \sequal \rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}) $. \end{center} This can be proven by a combination of \ref{grewritesSimpalts}, \ref{gstarRdistinctGeneral}, \ref{rderRsimpAltsCommute}, and \ref{insideSimpRemoval}.\end{proof}\noindentThis closed form has a variant which can be more convenient in later proofs:\begin{corollary}\label{altsClosedForm1} If $s \neq []$ then $\rderssimp \; (\sum \; rs) \; s = \rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$.\end{corollary}\noindentThe harder closed forms are the sequence and star ones.Before we obtain them, some preliminary definitionsare needed to make proof statements concise.\subsubsection{Closed Form for Sequence Regular Expressions}For the sequence regular expression,let's first look at a series of derivative steps on it (assuming that each time when a derivative is taken,the head of the sequence is always nullable):\begin{center} \begin{tabular}{llll} $r_1 \cdot r_2$ & $\longrightarrow_{\backslash c}$ & $r_1\backslash c \cdot r_2 + r_2 \backslash c$ & $ \longrightarrow_{\backslash c'} $ \\ \\ $(r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc'$ & $\longrightarrow_{\backslash c''} $ & $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''$ & $ \longrightarrow_{\backslash c''} \quad \ldots$\\ \end{tabular}\end{center}Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expressed as a giant alternative taking a list of terms $[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,where the head of the list is always the termrepresenting a match involving only $r_1$, and the tail of the list consisting ofterms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$.This intuition is also echoed by Murugesan and Sundaram \cite{Murugesan2014}, where they gavea pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:\begin{center} \begin{tabular}{lc} $L \; [ (r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) ]$ & $ =$\\ \\ \rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\nullable \; r_1) \; (r_2 \backslash_r c_1) )) \backslash_r (c_2 :: \ldots c_n) ]$ & $=$\\ \\ \rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2))) $ & \\ \\ $\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) \backslash_r (c_3 \ldots c_n) ]$ & $\ldots$ \\ \end{tabular}\end{center}\noindentThe $\delta$ function returns $r$ when the boolean condition$b$ evaluates to true and$\ZERO_r$ otherwise:\begin{center} \begin{tabular}{lcl} $\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\ & $\dn$ & $\ZERO_r \quad otherwise$ \end{tabular}\end{center}\noindentNote that the term\begin{center} \begin{tabular}{lc} \rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2))) $ & \\ \\ $\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) \backslash_r (c_3 \ldots c_n)$ &\\ \end{tabular}\end{center}\noindentdoes not faithfullyrepresent what the intermediate derivatives would actually look likewhen one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not nullable in the head of the sequence.For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable,the regular expression would not look like \[ r_1 \backslash_r c_1c_2\]instead of\[ (r_1 \backslash_r c_1c_2 + \ZERO_r ) + \ZERO_r.\]The redundant $\ZERO_r$s will not be created in thefirst place.In a closed-form one needs to take into account this (becauseclosed forms require exact equality rather than language equivalence)and only generate the $r_2 \backslash_r s''$ terms satisfying the property\begin{center}$\exists s'. such \; that \; s'@s'' = s \;\; \land \;\;r_1 \backslash s' \; is \; nullable$.\end{center}Given the arguments $s$ and $r_1$, we denote the list of strings$s''$ satisfying the above property as $\vsuf{s}{r_1}$.The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{ Perhaps a better name for it would be ``NullablePrefixSuffix'' to differentiate with the list of \emph{all} prefixes of $s$, but that is a bit too long for a function name and we are yet to finda more concise and easy-to-understand name.}\begin{center} \begin{tabular}{lcl} $\vsuf{[]}{\_} $ & $=$ & $[]$\\ $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} \; (\rnullable{r_1}) \; \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ \end{tabular}\end{center}\noindentThe list starts with shorter suffixesand ends with longer ones (in other words, the string elements $s''$in the list $\vsuf{s}{r_1}$ are sortedin the same order as that of the terms $r_2\backslash s''$ appearing in $(r_1\cdot r_2)\backslash s$).In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing the entire result $(r_1 \cdot r_2) \backslash s$, it only stores strings,with each string $s''$ representing a term such that $r_2 \backslash s''$is occurring in $(r_1\cdot r_2)\backslash s$.With $\textit{Suffix}$ we are ready to express thesequence regular expression's closed form,but before doing so more definitions are needed.The first thing is the flattening function $\sflat{\_}$,which takes an alternative regular expression and produces a flattened versionof that alternative regular expression.It is needed to converta left-associative nested sequence of alternatives into a flattened list:\[ \sum(\ldots ((r_1 + r_2) + r_3) + \ldots) \stackrel{\sflat{\_}}{\rightarrow} \sum[r_1, r_2, r_3, \ldots]\]\noindentThe definitions of $\sflat{\_}$ and helper functions$\sflataux{\_}$ and $\llparenthesis \_ \rrparenthesis''$ are given below.\begin{center} \begin{tabular}{lcl} $\sflataux{\sum r :: rs}$ & $\dn$ & $\sflataux{r} @ rs$\\ $\sflataux{\sum []}$ & $ \dn $ & $ []$\\ $\sflataux r$ & $\dn$ & $ [r]$ \end{tabular}\end{center}\begin{center} \begin{tabular}{lcl} $\sflat{(\sum r :: rs)}$ & $\dn$ & $\sum (\sflataux{r} @ rs)$\\ $\sflat{\sum []}$ & $ \dn $ & $ \sum []$\\ $\sflat r$ & $\dn$ & $ r$ \end{tabular}\end{center}\begin{center} \begin{tabular}{lcl} $\sflataux{[]}'$ & $ \dn $ & $ []$\\ $\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\ $\sflataux{r :: rs}'$ & $\dn$ & $ r::rs$ \end{tabular}\end{center}\noindent$\sflataux{\_}$ breaks up nested alternative regular expressions of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shapeinto a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.It will return the singleton list $[r]$ otherwise.$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keepsthe output type a regular expression, not a list.$\sflataux{\_}$ and $\sflat{\_}$ are only recursive on the first element of the list.$\sflataux{\_}'$ takes a list of regular expressions as input, and outputsa list of regular expressions.The use of $\sflataux{\_}$ and $\sflataux{\_}'$ is clear once we have$\textit{createdBySequence}$ defined:\begin{center} \begin{mathpar} \inferrule{\mbox{}}{\textit{createdBySequence}\; (r_1 \cdot r_2)} \inferrule{\textit{createdBySequence} \; r_1}{\textit{createdBySequence} \; (r_1 + r_2)} \end{mathpar}\end{center}\noindentThe predicate $\textit{createdBySequence}$ is used to describe the shape ofthe derivative regular expressions $(r_1\cdot r_2) \backslash s$:\begin{lemma}\label{recursivelyDerseq} It is always the case that \begin{center} $\textit{createdBySequence} \; ( (r_1\cdot r_2) \backslash_r s) $ \end{center} holds.\end{lemma}\begin{proof} By a reverse induction on the string $s$, where the inductive cases are $[]$ and $xs @ [x]$.\end{proof}\noindentIf we have a regular expression $r$ whose shape fits into those described by $\textit{createdBySequence}$,then we can convert between$r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with$\sflataux{\_}'$:\begin{lemma}\label{sfauIdemDer} If $\textit{createdBySequence} \; r$, then \begin{center} $\sflataux{ r \backslash_r c} = \llparenthesis (\map \; (\_ \backslash_r c) \; (\sflataux{r}) ) \rrparenthesis''$ \end{center} holds.\end{lemma}\begin{proof} By a simple induction on the inductive cases of $\textit{createdBySequence}. $\end{proof}Now we are ready to expressthe shape of $r_1 \cdot r_2 \backslash s$\begin{lemma}\label{seqSfau0} $\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$ \end{lemma}\begin{proof} By a reverse induction on the string $s$, where the inductive cases are $[]$ and $xs @ [x]$. For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2) \backslash_r xs)$ holds from lemma \ref{recursivelyDerseq}, which can be used to prove \[ \map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\; \map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1})) \] = \[ \map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1}) \] using lemma \ref{sfauIdemDer}. This equality enables the inductive case to go through.\end{proof}\noindent This lemma says that $(r_1\cdot r_2)\backslash s$can be flattened into a list whose head and tail meet the descriptionwe gave earlier.%Note that this lemma does $\mathbf{not}$ depend on any%specific definitions we used,%allowing people investigating derivatives to get an alternative%view of what $r_1 \cdot r_2$ is.We now use $\textit{createdBySequence}$ and$\sflataux{\_}$ to describe an intuitionbehind the sequence closed form.If two regular expressions only differ in the way theiralternatives are nested, then we should be able to get the same resultonce we apply simplification to both of them:\begin{lemma}\label{sflatRsimpeq} If $r$ is created from a sequence through a series of derivatives (i.e. if $\textit{createdBySequence} \; r$ holds), and that $\sflataux{r} = rs$, then we have that \begin{center} $\textit{rsimp} \; r = \textit{rsimp} \; (\sum \; rs)$ \end{center} holds.\end{lemma}\begin{proof} By an induction on the inductive cases of $\textit{createdBySequence}$. \end{proof}Now we are ready for the closed form for the sequence regular expressions (without the inner applicationsof simplifications):\begin{lemma}\label{seqClosedFormGeneral} $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} } =\rsimp{(\sum ( (r_1 \backslash s) \cdot r_2 :: \map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$\end{lemma}\begin{proof} We know that $\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$ holds by lemma \ref{seqSfau0}. This allows the theorem to go through because of lemma \ref{sflatRsimpeq}.\end{proof}Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),it is possible to convert the above lemma to obtain theproper closed form for $\backslash_{rsimps}$ rather than $\backslash_r$:for derivatives nested with simplification:\begin{theorem}\label{seqClosedForm} $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$\end{theorem}\begin{proof} By a case analysis of the string $s$. When $s$ is an empty list, the rewrite is straightforward. When $s$ is a non-empty list, the lemmas \ref{seqClosedFormGeneral} and \ref{Simpders} apply, making the proof go through.\end{proof}\subsubsection{Closed Forms for Star Regular Expressions}The closed form for the star regular expression involves similar tricksfor the sequence regular expression.The $\textit{Suffix}$ function is now replaced by somethingslightly more complex, because the growth pattern of starregular expressions' derivatives is a bit different:\begin{center} \begin{tabular}{lclc} $r^* $ & $\longrightarrow_{\backslash c}$ & $(r\backslash c) \cdot r^*$ & $\longrightarrow_{\backslash c'}$\\ \\ $r \backslash cc' \cdot r^* + r \backslash c' \cdot r^*$ & $\longrightarrow_{\backslash c''}$ & $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)$ & $\longrightarrow_{\backslash c'''}$ \\ \\ $\ldots$\\ \end{tabular}\end{center}When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,the number of terms in $r^* \backslash s$ will grow exponentially rather than linearlyin the sequence case.The good news is that the function $\textit{rsimp}$ will againignore the difference between different nesting patterns of alternatives,and the exponentially growing star derivative like\begin{center} $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ \end{center}can be treated as\begin{center} $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$\end{center}which can be de-duplicated by $\rDistinct$and therefore bounded finitely.%and then de-duplicate terms of the form ($s'$ being a substring of $s$).%This allows us to use a similar technique as $r_1 \cdot r_2$ case,Now the crux of this section is finding a suitable descriptionfor $rs$ where\begin{center} $\rderssimp{r^*}{s} = \rsimp{\sum rs}$.\end{center}holds.In addition, the list $rs$shall be in the form of $\map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss$.The $Ss$ is a list of strings, and for example in the sequenceclosed form it is specified as $\textit{Suffix} \; s \; r_1$.To get $Ss$ for the star regular expression,we need to introduce $\starupdate$ and $\starupdates$:\begin{center} \begin{tabular}{lcl} $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\ & & $\textit{if} \; (\rnullable \; (r \backslash_{rs} s))$ \\ & & $\textit{then} \;\; (s @ [c]) :: [c] :: ( \starupdate \; c \; r \; Ss)$ \\ & & $\textit{else} \;\; (s @ [c]) :: ( \starupdate \; c \; r \; Ss)$ \end{tabular}\end{center}\begin{center} \begin{tabular}{lcl} $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\ $\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; ( \starupdate \; c \; r \; Ss)$ \end{tabular}\end{center}\noindentAssuming we have that\begin{center} $\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$.\end{center}holds.The idea of $\starupdate$ and $\starupdates$is to update $Ss$ when anotherderivative is taken on $\rderssimp{r^*}{s}$w.r.t a character $c$ and a string $s'$respectively.Both $\starupdate$ and $\starupdates$ take three arguments as input:the new character $c$ or string $s$ to take derivative with, the regular expression$r$ under the star $r^*$, and thelist of strings $Ss$ for the derivative $r^* \backslash s$ up until this point such that \begin{center}$(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ \end{center}is satisfied.Functions $\starupdate$ and $\starupdates$ characterise what the star derivatives will look like once ``straightened out'' into lists.The helper functions for such operations will be similar to$\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence.We use similar symbols to denote them, with a $*$ subscript to mark the difference.\begin{center} \begin{tabular}{lcl} $\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ $\hflataux{r}$ & $\dn$ & $[r]$ \end{tabular}\end{center}\begin{center} \begin{tabular}{lcl} $\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\ $\hflat{r}$ & $\dn$ & $r$ \end{tabular}\end{center}\noindentThese definitions are tailor-made for dealing with alternatives that haveoriginated from a star's derivatives.A typical star derivative always has the structure of a balanced binary tree:\begin{center} $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ \end{center}All of the nested structures of alternativesgenerated from derivatives are binary, and therefore$\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives.$\hflat{\_}$ ``untangles'' like the following:\[ \sum ((r_1 + r_2) + (r_3 + r_4)) + \ldots \; \stackrel{\hflat{\_}}{\longrightarrow} \; \RALTS{[r_1, r_2, \ldots, r_n]} \]Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$,with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists and merges each of the element lists to form a flattened list.}:\begin{lemma}\label{stupdateInduct1} \mbox For a list of strings $Ss$, the following hold. \begin{itemize} \item If we do a derivative on the terms $r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$), the result will be the same as if we apply $\starupdate$ to $Ss$. \begin{center} \begin{tabular}{c} $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x) \circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\; $\\ \\ $=$ \\ \\ $\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \; (\starupdate \; x \; r \; Ss)$ \end{tabular} \end{center} \item $\starupdates$ is ``composable'' w.r.t a derivative. It piggybacks the character $x$ to the tail of the string $xs$. \begin{center} \begin{tabular}{c} $\textit{concat} \; (\map \; \hflataux{\_} \; (\map \; (\_\backslash_r x) \; (\map \; (\lambda s.\;\; (r \backslash_r s) \cdot (r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\ \\ $=$\\ \\ $\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \; (\starupdates \; (xs @ [x]) \; r \; Ss)$ \end{tabular} \end{center} \end{itemize}\end{lemma}\begin{proof} Part 1 is by induction on $Ss$. Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values.\end{proof}Like $\textit{createdBySequence}$, we needa predicate for ``star-created'' regular expressions:\begin{center} \begin{mathpar} \inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} } \inferrule{ \textit{createdByStar} \; r_1\; \land \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) } \end{mathpar}\end{center}\noindentAll regular expressions created by taking derivatives of$r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate:\begin{lemma}\label{starDersCbs} $\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds.\end{lemma}\begin{proof} By a reverse induction on $s$.\end{proof}If a regular expression conforms to the shape of a star's derivative,then we can push an application of $\hflataux{\_}$ inside a derivative of it:\begin{lemma}\label{hfauPushin} If $\textit{createdByStar} \; r$ holds, then \begin{center} $\hflataux{r \backslash_r c} = \textit{concat} \; ( \map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$ \end{center} holds.\end{lemma}\begin{proof} By an induction on the inductive cases of $\textit{createdByStar}$.\end{proof}%This is not entirely true for annotated regular expressions: %%TODO: bsimp bders \neq bderssimp%\begin{center}% $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$%\end{center}%For bit-codes, the order in which simplification is applied%might cause a difference in the location they are placed.%If we want something like%\begin{center}% $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$%\end{center}%Some "canonicalization" procedure is required,%which either pushes all the common bitcodes to nodes%as senior as possible:%\begin{center}% $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $%\end{center}%or does the reverse. However bitcodes are not of interest if we are talking about%the $\llbracket r \rrbracket$ size of a regex.%Therefore for the ease and simplicity of producing a%proof for a size bound, we are happy to restrict ourselves to %unannotated regular expressions, and obtain such equalities as%TODO: rsimp sflat% The simplification of a flattened out regular expression, provided it comes%from the derivative of a star, is the same as the one nested.Now we introduce an inductive propertyfor $\starupdate$ and $\hflataux{\_}$.\begin{lemma}\label{starHfauInduct} If we do derivatives of $r^*$ with a string that starts with $c$, then flatten it out, we obtain a list of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$, where $sS = \starupdates \; s \; r \; [[c]]$. Namely, \begin{center} $\hflataux{(( (\rder{c}{r_0})\cdot(r_0^*))\backslash_{rs} s)} = \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; (\starupdates \; s \; r_0 \; [[c]])$ \end{center}holds.\end{lemma}\begin{proof} By an induction on $s$, the inductive cases being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used.\end{proof}\noindentThe function $\hflataux{\_}$ has a similar effect as $\textit{flatten}$:\begin{lemma}\label{hflatauxGrewrites} $a :: rs \grewrites \hflataux{a} @ rs$\end{lemma}\begin{proof} By induction on $a$. $rs$ is set to take arbitrary values.\end{proof}It is also not surprising that two regular expressions differing only in termsof thenesting of parentheses are equivalent w.r.t. $\textit{rsimp}$:\begin{lemma}\label{cbsHfauRsimpeq1} $\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$\end{lemma}\begin{proof} By using the rewriting relation $\rightsquigarrow$\end{proof}And from this we obtain the following fact: a regular expression created by star is the same as its flattened version, up to equivalence under $\textit{bsimp}$.For example,\begin{lemma}\label{hfauRsimpeq2} $\textit{createdByStar} \; r \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$\end{lemma}\begin{proof} By structural induction on $r$, where the induction rules are these of $\createdByStar{\_}$. Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case.\end{proof}%Here is a corollary that states the lemma in%a more intuitive way:%\begin{corollary}% $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot% (r^*))\; (\starupdates \; c\; r\; [[c]])$%\end{corollary}%\noindent%Note that this is also agnostic of the simplification%function we defined, and is therefore of more general interest.Together with the rewriting relation\begin{lemma}\label{starClosedForm6Hrewrites} We have the following set of rewriting relations or equalities: \begin{itemize} \item $\textit{rsimp} \; (r^* \backslash_r (c::s)) \sequal \sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; ( \starupdates \; s \; r \; [ c::[]] ) ) )$ \item $r \backslash_{rsimps} (c::s) = \textit{rsimp} \; ( ( \sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \; (\starupdates \;s \; r \; [ c::[] ])))))$ \item $\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss)) \sequal \sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \; r^*) \; Ss) )$ \item $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss \scfrewrites \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$ \item $( ( \sum ( ( \map \ (\lambda s. \;\; (\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \; s \; r \; [ c::[] ])))))$\\ $\sequal$\\ $( ( \sum ( ( \map \ (\lambda s. \;\; ( r \backslash_{rsimps} s)) \cdot r^*) \; (\starupdates \; s \; r \; [ c::[] ]))))$\\ \end{itemize}\end{lemma}\begin{proof} Part 1 leads to part 2. The rest of them are routine.\end{proof}\noindentNext the closed form for star regular expressions can be derived:\begin{theorem}\label{starClosedForm} $\rderssimp{r^*}{c::s} = \rsimp{ (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; (\starupdates \; s\; r \; [[c]]) ) ) } $\end{theorem}\begin{proof} By an induction on $s$. The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites} and \ref{hfauRsimpeq2} are used. In \ref{starClosedForm6Hrewrites}, the equalities are used to link the LHS and RHS.\end{proof}%----------------------------------------------------------------------------------------% SECTION ??%----------------------------------------------------------------------------------------%-----------------------------------% SECTION syntactic equivalence under simp%-----------------------------------%----------------------------------------------------------------------------------------% SECTION ALTS CLOSED FORM%----------------------------------------------------------------------------------------%\section{A Closed Form for \textit{ALTS}}%Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.%%%There are a few key steps, one of these steps is%%%%One might want to prove this by something a simple statement like: %%For this to hold we want the $\textit{distinct}$ function to pick up%the elements before and after derivatives correctly:%$r \in rset \equiv (rder x r) \in (rder x rset)$.%which essentially requires that the function $\backslash$ is an injective mapping.%%Unfortunately the function $\backslash c$ is not an injective mapping.%%\subsection{function $\backslash c$ is not injective (1-to-1)}%\begin{center}% The derivative $w.r.t$ character $c$ is not one-to-one.% Formally,% $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$%\end{center}%This property is trivially true for the%character regex example:%\begin{center}% $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$%\end{center}%But apart from the cases where the derivative%output is $\ZERO$, are there non-trivial results%of derivatives which contain strings?%The answer is yes.%For example,%\begin{center}% Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\% where $a$ is not nullable.\\% $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\% $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$%\end{center}%We start with two syntactically different regular expressions,%and end up with the same derivative result.%This is not surprising as we have such %equality as below in the style of Arden's lemma:\\%\begin{center}% $L(A^*B) = L(A\cdot A^* \cdot B + B)$%\end{center}\section{Bounding Closed Forms}In this section, we introduce how we formalised the boundon closed forms.We first show that in general the number of regular expressions up to a certain size is finite.Then we prove that functions such as $\rflts$will not cause the size of r-regular expressions to grow.Putting this together with a general bound on the finiteness of distinct regular expressionsup to a specific size, we obtain a bound on the closed forms.\subsection{Finiteness of Distinct Regular Expressions}We define the set of regular expressions whose size is no more thana certain size $N$ as $\textit{sizeNregex} \; N$:\[ \textit{sizeNregex} \; N \dn \{r\; \mid \; \llbracket r \rrbracket_r \leq N \}\]We have that $\textit{sizeNregex} \; N$ is always a finite set:\begin{lemma}\label{finiteSizeN} $\textit{finite} \; (\textit{sizeNregex} \; N)$ holds.\end{lemma}\begin{proof} By splitting the set $\textit{sizeNregex} \; (N + 1)$ into subsets by their categories: $\{\ZERO_r, \ONE_r, c\}$, $\{r^* \mid r \in \textit{sizeNregex} \; N\}$, and so on. Each of these subsets is finitely bounded.\end{proof}\noindentFrom this we get a corollary thatif forall $r \in rs$, $\rsize{r} \leq N$, then the output of $\rdistinct{rs}{\varnothing}$ is a list of regularexpressions of finite size depending on $N$ only. \begin{corollary}\label{finiteSizeNCorollary} $\rsize{\rdistinct{rs}{\varnothing}} \leq c_N * N$ holds, where the constant $c_N$ is equal to $\textit{card} \; (\textit{sizeNregex} \; N)$.\end{corollary}\begin{proof} For all $r$ in $\textit{set} \; (\rdistinct{rs}{\varnothing})$, it is always the case that $\rsize{r} \leq N$. In addition, the list length is bounded by $c_N$, yielding the desired bound.\end{proof}\noindentThis fact will be handy in estimating the closed form sizes.%We have proven that the size of the%output of $\textit{rdistinct} \; rs' \; \varnothing$%is bounded by a constant $N * c_N$ depending only on $N$,%provided that each of $rs'$'s element%is bounded by $N$.\subsection{$\textit{rsimp}$ Does Not Increase the Size}Although it seems evident, we need a seriesof non-trivial lemmas to establish that functions such as $\rflts$do not cause the regular expressions to grow.\begin{lemma}\label{rsimpMonoLemmas} \mbox{} \begin{itemize} \item \[ \llbracket \rsimpalts \; rs \rrbracket_r \leq \llbracket \sum \; rs \rrbracket_r \] \item \[ \llbracket \rsimpseq \; r_1 \; r_2 \rrbracket_r \leq \llbracket r_1 \cdot r_2 \rrbracket_r \] \item \[ \llbracket \rflts \; rs \rrbracket_r \leq \llbracket rs \rrbracket_r \] \item \[ \llbracket \rDistinct \; rs \; ss \rrbracket_r \leq \llbracket rs \rrbracket_r \] \item If all elements $a$ in the set $as$ satisfy the property that $\llbracket \textit{rsimp} \; a \rrbracket_r \leq \llbracket a \rrbracket_r$, then we have \[ \llbracket \; \rsimpalts \; (\textit{rdistinct} \; (\textit{rflts} \; (\textit{map}\;\textit{rsimp} as)) \{\}) \rrbracket \leq \llbracket \; \sum \; (\rDistinct \; (\rflts \;(\map \; \textit{rsimp} \; x))\; \{ \} ) \rrbracket_r \] \end{itemize}\end{lemma}\begin{proof} Points 1, 3, and 4 can be proven by an induction on $rs$. Point 2 is by case analysis on $r_1$ and $r_2$. The last part is a corollary of the previous ones.\end{proof}\noindentWith the lemmas for each inductive case in place, we are ready to get the non-increasing property as a corollary:\begin{corollary}\label{rsimpMono} $\llbracket \textit{rsimp} \; r \rrbracket_r \leq \llbracket r \rrbracket_r$\end{corollary}\begin{proof} By \ref{rsimpMonoLemmas}.\end{proof}\subsection{Estimating the Closed Forms' sizes}We recap the closed forms we obtainedearlier:\begin{itemize} \item $\rderssimp{(\sum rs)}{s} \sequal \sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$ \item $\rderssimp{(r_1 \cdot r_2)}{s} \sequal \sum ((r_1 \backslash s) \cdot r_2 ) :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1}))$ \item $\rderssimp{r^*}{c::s} = \rsimp{ (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; (\starupdates \; s\; r \; [[c]]) ) ) } $\end{itemize} \noindent The closed forms on the left-hand-sideare all of the same shape: $\rsimp{ (\sum rs)} $.Such regular expression will be bounded by the size of $\sum rs'$, where every element in $rs'$ is distinct, and each element can be described by some inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$).which will each have a size upper bound according to the inductive hypothesis, which controls $r \backslash s$.We elaborate the above reasoning by a series of lemmasbelow, where straightforward proofs are omitted.%We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.We show that $\textit{rdistinct}$ and $\rflts$working together is at least as good as $\textit{rdistinct}$ alone, which can be written as\begin{center} $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r \leq \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $.\end{center}We need this so that we know the outcome of our real simplification is better than or equal to a rough estimate,and therefore can be bounded by that estimate.This is a bit harder to establish compared to proving$\textit{flts}$ does not make a list larger (which canbe proven using routine induction):\begin{center} $\llbracket \textit{rflts}\; rs \rrbracket_r \leq \llbracket \textit{rs} \rrbracket_r$\end{center}We cannot simply prove how each helper functionreduces the size and then put them together:From\begin{center}$\llbracket \textit{rflts}\; rs \rrbracket_r \leq \llbracket \textit{rs} \rrbracket_r$\end{center}and\begin{center} $\llbracket \textit{rdistinct} \; rs \; \varnothing \leq \llbracket rs \rrbracket_r$\end{center}one cannot infer \begin{center} $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r \leq \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $.\end{center}What we can infer is that \begin{center} $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r \leq \llbracket rs \rrbracket_r$\end{center}but this estimate is too rough and $\llbracket rs \rrbracket_r$ is unbounded.The way we get around this is by first proving a more general lemma (so that the inductive case goes through):\begin{lemma}\label{fltsSizeReductionAlts} If we have three accumulator sets: $noalts\_set$, $alts\_set$ and $corr\_set$, satisfying: \begin{itemize} \item $\forall r \in noalts\_set. \; \nexists xs.\; r = \sum xs$ \item $\forall r \in alts\_set. \; \exists xs. \; r = \sum xs \; \textit{and} \; set \; xs \subseteq corr\_set$ \end{itemize} then we have that \begin{center} \begin{tabular}{lcl} $\llbracket (\textit{rdistinct} \; (\textit{rflts} \; as) \; (noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\ $\llbracket (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup \{ \ZERO_r \} )) \rrbracket_r$ & & \\ \end{tabular} \end{center} holds.\end{lemma}\noindentWe split the accumulator into two parts: the partwhich contains alternative regular expressions ($alts\_set$), and the part without any of them($noalts\_set$).This is because $\rflts$ opens up the alternatives in $as$,causing the accumulators on both sides of the inequalityto diverge slightly.If we want to compare the accumulators that are notperfectly in sync, we need to consider the alternatives and non-alternativesseparately.The set $corr\_set$ is the corresponding setof $alts\_set$ with all elements under the alternative constructorspilled out.\begin{proof} By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}.\end{proof}By setting all three sets to the empty set, one gets the desired size estimate:\begin{corollary}\label{interactionFltsDB} $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r \leq \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $.\end{corollary}\begin{proof} By using the lemma \ref{fltsSizeReductionAlts}.\end{proof}\noindentThe intuition for why this is trueis that if we remove duplicates from the $\textit{LHS}$, at least the same amount of duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:\begin{lemma}\label{altsSimpControl} $\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$\end{lemma}\begin{proof} By using corollary \ref{interactionFltsDB}.\end{proof}\noindentThis is a key lemma in establishing the bounds of all the closed forms.With this we are now ready to control the sizes of$(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$.\begin{theorem}\label{rBound} For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$\end{theorem}\noindent\begin{proof} We prove this by induction on $r$. The base cases for $\RZERO$, $\RONE $ and $\RCHAR{c}$ are straightforward. In the sequence $r_1 \cdot r_2$ case, the inductive hypotheses state $\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and $\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. When the string $s$ is not empty, we can reason as follows % \begin{center} \begin{tabular}{lcll}& & $ \llbracket \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\& $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash_{rsimps} s \cdot r_2 \; \; :: \; \; \map \; (r_2\backslash_{rsimps} \_)\; (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\ & $\leq$ & $\llbracket \rdistinct{(r_1 \backslash_{rsimps} s \cdot r_2 \; \; :: \; \; \map \; (r_2\backslash_{rsimps} \_)\; (\vsuf{s}{r}))}{\varnothing} \rrbracket_r + 1$ & (2) \\ & $\leq$ & $2 + N_1 + \rsize{r_2} + (N_2 * (card\;(\sizeNregex \; N_2)))$ & (3)\\\end{tabular}\end{center}\noindent(1) is by theorem \ref{seqClosedForm}.(2) is by \ref{altsSimpControl}.(3) is by \ref{finiteSizeNCorollary}.Combining the cases when $s = []$ and $s \neq []$, we get (4):\begin{center} \begin{tabular}{lcll} $\rsize{(r_1 \cdot r_2) \backslash_r s}$ & $\leq$ & $max \; (2 + N_1 + \llbracket r_2 \rrbracket_r + N_2 * (card\; (\sizeNregex \; N_2))) \; \rsize{r_1\cdot r_2}$ & (4) \end{tabular}\end{center}We reason similarly for $\STAR$.The inductive hypothesis is$\exists N. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N$.Let $n_r = \llbracket r^* \rrbracket_r$.When $s = c :: cs$ is not empty,\begin{center} \begin{tabular}{lcll}& & $ \llbracket \rderssimp{r^* }{c::cs} \rrbracket_r $\\& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\lambda s. (r \backslash_{rsimps} s) \cdot r^*) \; (\starupdates\; cs \; r \; [[c]] )) )} \rrbracket_r $ & (5) \\ & $\leq$ & $\llbracket \rdistinct{ (\map \; (\lambda s. (r \backslash_{rsimps} s) \cdot r^*) \; (\starupdates\; cs \; r \; [[c]] ) )} {\varnothing} \rrbracket_r + 1$ & (6) \\ & $\leq$ & $1 + (\textit{card} (\sizeNregex \; (N + n_r))) * (1 + (N + n_r)) $ & (7)\\\end{tabular}\end{center}\noindent(5) is by theorem \ref{starClosedForm}.(6) is by \ref{altsSimpControl}.(7) is by corollary \ref{finiteSizeNCorollary}.Combining with the case when $s = []$, one obtains\begin{center} \begin{tabular}{lcll} $\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r))) * (1 + (N + n_r)) $ & (8)\\ \end{tabular}\end{center}\noindentThe alternative case is slightly less involved.The inductive hypothesis is equivalent to $\exists N. \forall r \in (\map \; (\_ \backslash_r s) \; rs). \rsize{r} \leq N$.In the case when $s = c::cs$, we have \begin{center} \begin{tabular}{lcll}& & $ \llbracket \rderssimp{\sum rs }{c::cs} \rrbracket_r $\\& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\_ \backslash_{rsimps} s) \; rs) )} \rrbracket_r $ & (9) \\ & $\leq$ & $\llbracket (\sum (\map \; (\_ \backslash_{rsimps} s) \; rs) ) \rrbracket_r $ & (10) \\& $\leq$ & $1 + N * (length \; rs) $ & (11)\\ \end{tabular}\end{center}\noindent(9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis.Combining with the case when $s = []$, we obtain \begin{center} \begin{tabular}{lcll} $\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ & (12)\\ \end{tabular}\end{center}We have all the inductive cases proven.\end{proof}This leads to our main result on the size bound:\begin{corollary} For any annotated regular expression $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$\end{corollary}\begin{proof} By lemma \ref{sizeRelations} and theorem \ref{rBound}.\end{proof}\noindent%-----------------------------------% SECTION 2%-----------------------------------\section{Bounded Repetitions}We have promised in chapter \ref{Introduction}that our lexing algorithm can potentially be extendedto handle bounded repetitionsin natural and elegant ways.Now we fulfill our promise by adding support for the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$.We add clauses in our derivatives-based lexing algorithms (with simplifications)introduced in chapter \ref{Bitcoded2}.\subsection{Augmented Definitions}There are a number of definitions that need to be augmented.The most notable one would be the POSIX rules for $r^{\{n\}}$:\begin{center} \begin{mathpar} \inferrule{\forall v \in vs_1. \vdash v:r \land |v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\ \textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \; (vs_1 @ vs_2) : r^{\{n\}} } \end{mathpar}\end{center}As Ausaf had pointed out \cite{Ausaf},sometimes empty iterations have to be taken to geta match with exactly $n$ repetitions,and hence the $vs_2$ part.Another important definition would be the size:\begin{center} \begin{tabular}{lcl} $\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ & $\llbracket r \rrbracket_r + n$\\ \end{tabular}\end{center}\noindentArguably we should use $\log \; n$ for the size becausethe number of digits increases logarithmically w.r.t $n$.For simplicity we choose to add the counter directly to the size.The derivative w.r.t a bounded regular expressionis given as \begin{center} \begin{tabular}{lcl} $r^{\{n\}} \backslash_r c$ & $\dn$ & $r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\ & & $\RZERO \;\quad \quad\quad \quad \textit{otherwise}$\\ \end{tabular}\end{center}\noindentFor brevity, we sometimes use NTIMES to refer to bounded regular expressions.The $\mkeps$ function clause for NTIMES would be\begin{center} \begin{tabular}{lcl} $\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \; (\textit{replicate} \; n\; (\mkeps \; r))$ \end{tabular}\end{center}\noindentThe injection looks like\begin{center} \begin{tabular}{lcl} $\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ & $\dn$ & $\Stars \; ((\inj \; r \;c \;v ) :: vs)$ \end{tabular}\end{center}\noindent\subsection{Proofs for the Augmented Lexing Algorithm}We need to maintain two proofs with the additional $r^{\{n\}}$construct: the correctness proof in chapter \ref{Bitcoded2},and the finiteness proof in chapter \ref{Finite}.\subsubsection{Correctness Proof Augmentation}The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitionshave been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}.As they have commented, once the definitions are in place,the proofs given for the basic regular expressions will extend tobounded regular expressions, and there are no ``surprises''.We confirm this point because the correctness theorem would alsoextend without surprise to $\blexersimp$.The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so ondo not need to be changed,and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to add one more line which can be solved by the Sledgehammer toolto solve the $r^{\{n\}}$ inductive case.\subsubsection{Finiteness Proof Augmentation}The bounded repetitions arevery similar to stars, and therefore the treatmentis similar, with minor changes to handle some slight complicationswhen the counter reaches 0.The exponential growth is similar:\begin{center} \begin{tabular}{ll} $r^{\{n\}} $ & $\longrightarrow_{\backslash c}$\\ $(r\backslash c) \cdot r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c'}$\\ \\ $r \backslash cc' \cdot r^{\{n - 2\}}* + r \backslash c' \cdot r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c''}$\\ \\ $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* + r \backslash c''\cdot r^{\{n-1\}}) + (r \backslash c'c'' \cdot r^{\{n-2\}}* + r \backslash c'' \cdot r^{\{n-1\}}*)$ & $\longrightarrow_{\backslash c'''}$ \\ \\ $\ldots$\\ \end{tabular}\end{center}Again, we assume that $r\backslash c$, $r \backslash cc'$ and so onare all nullable.The flattened list of terms for $r^{\{n\}} \backslash_{rs} s$\begin{center} $[r_1 \backslash cc'c'' \cdot r^{\{n-3\}}*,\; r \backslash c''\cdot r^{\{n-1\}}, \; r \backslash c'c'' \cdot r^{\{n-2\}}*, \; r \backslash c'' \cdot r^{\{n-1\}}*,\; \ldots ]$ \end{center}that comes from \begin{center} $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* + r \backslash c''\cdot r^{\{n-1\}}) + (r \backslash c'c'' \cdot r^{\{n-2\}}* + r \backslash c'' \cdot r^{\{n-1\}}*)+ \ldots$ \end{center}are made of sequences with different tails, where the countersmight differ.The observation for maintaining the bound is thatthese counters never exceed $n$, the originalcounter. With the number of counters staying finite,$\rDistinct$ will deduplicate and keep the list finite.We introduce this idea as a lemma once we describe allthe necessary helper functions.Similar to the star case, we want\begin{center} $\rderssimp{r^{\{n\}}}{s} = \rsimp{\sum rs}$.\end{center}where $rs$shall be in the form of $\map \; f \; Ss$, where $f$ is a function and$Ss$ a list of objects to act on.For star, the object's datatype is string.The list of strings $Ss$is generated using functions $\starupdate$ and $\starupdates$.The function that takes a string and returns a regular expressionis the anonymous function $(\lambda s'. \; r\backslash s' \cdot r^{\{m\}})$.In the NTIMES setting,the $\starupdate$ and $\starupdates$ functions are replaced by $\textit{nupdate}$ and $\textit{nupdates}$:\begin{center} \begin{tabular}{lcl} $\nupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ $\nupdate \; c \; r \; (\Some \; (s, \; n + 1) \; :: \; Ss)$ & $\dn$ & %\\ $\textit{if} \; (\rnullable \; (r \backslash_{rs} s))$ \\ & & $\;\;\textit{then} \;\; \Some \; (s @ [c], n + 1) :: \Some \; ([c], n) :: ( \nupdate \; c \; r \; Ss)$ \\ & & $\textit{else} \;\; \Some \; (s @ [c], n+1) :: ( \nupdate \; c \; r \; Ss)$\\ $\nupdate \; c \; r \; (\textit{None} :: Ss)$ & $\dn$ & $(\None :: \nupdate \; c \; r \; Ss)$\\ & & \\ %\end{tabular}%\end{center}%\begin{center} %\begin{tabular}{lcl} $\nupdates \; [] \; r \; Ss$ & $\dn$ & $Ss$\\ $\nupdates \; (c :: cs) \; r \; Ss$ & $\dn$ & $\nupdates \; cs \; r \; ( \nupdate \; c \; r \; Ss)$ \end{tabular}\end{center}\noindentwhich take into account when a subterm\begin{center} $r \backslash_s s \cdot r^{\{n\}}$\end{center}counter $n$is 0, and therefore expands to \begin{center}$r \backslash_s (s@[c]) \cdot r^{\{n\}} \;+\; \ZERO$ \end{center}after taking a derivative.The object now has type \begin{center}$\textit{option} \;(\textit{string}, \textit{nat})$\end{center}and therefore the function for converting such an option intoa regular expression term is called $\opterm$:\begin{center} \begin{tabular}{lcl} $\opterm \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ & & $\;\;\Some \; (s, n) \Rightarrow (r\backslash_{rs} s)\cdot r^{\{n\}}$\\ & & $\;\;\None \Rightarrow \ZERO$\\ \end{tabular}\end{center}\noindentPut together, the list $\map \; f \; Ss$ is instantiated as\begin{center} $\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$.\end{center}For the closed form to be bounded, we would likesimplification to be applied to each term in the list.Therefore we introduce some variants of $\opterm$,which help conveniently express the rewriting steps needed in the closed form proof.We have $\optermOsimp$, $\optermosimp$ and $\optermsimp$with slightly different spellings because they help the proof to go through:\begin{center} \begin{tabular}{lcl} $\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ & & $\;\;\Some \; (s, n) \Rightarrow \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\ & & $\;\;\None \Rightarrow \ZERO$\\ \\ $\optermosimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ & & $\;\;\Some \; (s, n) \Rightarrow (\textit{rsimp} \; (r\backslash_{rs} s)) \cdot r^{\{n\}}$\\ & & $\;\;\None \Rightarrow \ZERO$\\ \\ $\optermsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ & & $\;\;\Some \; (s, n) \Rightarrow (r\backslash_{rsimps} s)\cdot r^{\{n\}}$\\ & & $\;\;\None \Rightarrow \ZERO$\\ \end{tabular}\end{center}For a list of $\textit{option} \;(\textit{string}, \textit{nat})$ elements,we define the highest power for it recursively:\begin{center} \begin{tabular}{lcl} $\hpa \; [] \; n $ & $\dn$ & $n$\\ $\hpa \; (\None :: os) \; n $ & $\dn$ & $\hpa \; os \; n$\\ $\hpa \; (\Some \; (s, n) :: os) \; m$ & $\dn$ & $\hpa \;os \; (\textit{max} \; n\; m)$\\ \\ $\hpower \; rs $ & $\dn$ & $\hpa \; rs \; 0$\\ \end{tabular}\end{center}Now the intuition that an NTIMES regular expression's powerdoes not increase can be easily expressed as\begin{lemma}\label{nupdatesMono2} $\hpower \; (\nupdates \;s \; r \; [\Some \; ([c], n)]) \leq n$\end{lemma}\begin{proof} Note that the power is non-increasing after a $\nupdate$ application: \begin{center} $\hpa \;\; (\nupdate \; c \; r \; Ss)\;\; m \leq \hpa\; \; Ss \; m$. \end{center} This is also the case for $\nupdates$: \begin{center} $\hpa \;\; (\nupdates \; s \; r \; Ss)\;\; m \leq \hpa\; \; Ss \; m$. \end{center} Therefore we have that \begin{center} $\hpower \;\; (\nupdates \; s \; r \; Ss) \leq \hpower \;\; Ss$ \end{center} which leads to the lemma being proven. \end{proof}We also define the inductive rules forthe shape of derivatives of the NTIMES regular expressions:\\[-3em]\begin{center} \begin{mathpar} \inferrule{\mbox{}}{\cbn \;\ZERO} \inferrule{\mbox{}}{\cbn \; \; r_a \cdot (r^{\{n\}})} \inferrule{\cbn \; r_1 \;\; \; \cbn \; r_2}{\cbn \; r_1 + r_2} \inferrule{\cbn \; r}{\cbn \; r + \ZERO} \end{mathpar}\end{center}\noindentA derivative of NTIMES fits into the shape described by $\cbn$:\begin{lemma}\label{ntimesDersCbn} $\cbn \; ((r' \cdot r^{\{n\}}) \backslash_{rs} s)$ holds.\end{lemma}\begin{proof} By a reverse induction on $s$. For the inductive case, note that if $\cbn \; r$ holds, then $\cbn \; (r\backslash_r c)$ holds.\end{proof}\noindentIn addition, for $\cbn$-shaped regular expressions, one can flattenthem:\begin{lemma}\label{ntimesHfauPushin} If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} = \textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \; (\hflataux{r})})$\end{lemma}\begin{proof} By an induction on the inductive cases of $\cbn$.\end{proof}\noindentThis time we do not need to define the flattening functions for NTIMES only,because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already.\begin{lemma}\label{ntimesHfauInduct}$\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} = \map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$\end{lemma}\begin{proof} By a reverse induction on $s$. The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used.\end{proof}\noindentWe have a recursive property for NTIMES with $\nupdate$ similar to that for STAR,and one for $\nupdates $ as well:\begin{lemma}\label{nupdateInduct1} \mbox{} \begin{itemize} \item \begin{center} $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( \opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \; c \; r \; Ss)$\\ \end{center} holds.\item \begin{center} $\textit{concat} \; (\map \; \hflataux{\_}\; \map \; (\_\backslash_r x) \; (\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\ $=$\\ $\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$ \end{center} holds. \end{itemize}\end{lemma}\begin{proof} (i) is by an induction on $Ss$. (ii) is by an induction on $xs$.\end{proof}\noindentThe $\nString$ predicate is defined for convenientlyexpressing that there are no empty strings in the$\Some \;(s, n)$ elements generated by $\nupdate$:\begin{center} \begin{tabular}{lcl} $\nString \; \None$ & $\dn$ & $ \textit{true}$\\ $\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\ $\nString \; (\Some \; (c::s, n))$ & $\dn$ & $ \textit{true}$\\ \end{tabular}\end{center}\begin{lemma}\label{nupdatesNonempty} If for all elements $o \in \textit{set} \; Ss$, $\nString \; o$ holds, then we have that for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$, $\nString \; o'$ holds.\end{lemma}\begin{proof} By an induction on $s$, where $Ss$ is set to vary over all possible values.\end{proof}\noindent\begin{lemma}\label{ntimesClosedFormsSteps} The following list of equalities or rewriting relations hold:\\ (i) $r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; (\sum (\map \; (\opterm \;r \;\_) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])))$\\ (ii) \begin{center} $\sum (\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [ \Some \; ([c], n)]))$ \\ $ \sequal$\\ $\sum (\map \; (\textit{rsimp} \circ (\opterm \; r))\; (\nupdates \; s\;r \; [\Some \; ([c], n)]))$\\ \end{center} (iii) \begin{center} $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \; ([c], n)]))$\\ $\sequal$\\ $\sum \;(\map \; (\optermsimp r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])) $\\ \end{center} (iv) \begin{center} $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \; ([c], n)])) $ \\ $\sequal$\\ $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])) $\\ \end{center} (v) \begin{center} $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])) $ \\ $\sequal$\\ $\sum \; (\map \; (\textit{rsimp} \circ (\opterm \; r)) \; (\nupdates \; s \; r \; [\Some \; ([c], n)]))$ \end{center}\end{lemma}\begin{proof} Routine. (iii) and (iv) make use of the fact that all the strings $s$ inside $\Some \; (s, m)$ which are elements of the list $\nupdates \; s\;r\;[\Some\; ([c], n)]$ are non-empty, which is from lemma \ref{nupdatesNonempty}. Once the string in $o = \Some \; (s, n)$ is nonempty, $\optermsimp \; r \;o$, $\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed to be equal. (v) uses \ref{nupdateInduct1}.\end{proof}\noindentNow we are ready to present the closed form for NTIMES:\begin{theorem}\label{ntimesClosedForm} The derivative of $r^{\{n+1\}}$ can be described as an alternative containing a list of terms:\\ $r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; ( \sum (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])))$\end{theorem}\begin{proof} By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}.\end{proof}\noindentThe key observation for bounding this closed formis that the counter on $r^{\{n\}}$ will only decrement during derivatives:\begin{lemma}\label{nupdatesNLeqN} For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$, either $o = \None$, or $o = \Some \; (s', m)$ for some string $s'$ and number $m \leq n$.\end{lemma}\noindentThe proof is routine and therefore omitted.This allows us to say what kind of termsare in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)]))$:only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$with a small $m$:\begin{lemma}\label{ntimesClosedFormListElemShape} For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; ( \nupdates \; s \; r \; [\Some \; ([c], n)]))$, we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot r^{\{m\}}$ for some string $s'$ and number $m \leq n$.\end{lemma}\begin{proof} Using lemma \ref{nupdatesNLeqN}.\end{proof}\begin{theorem}\label{ntimesClosedFormBounded} Assuming that for any string $s$, $\llbracket r \backslash_{rsimps} s \rrbracket_r \leq N$ holds, then we have that\\ $\llbracket r^{\{n+1\}} \backslash_{rsimps} s \rrbracket_r \leq \textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$, where $c_N = \textit{card} \; (\textit{sizeNregex} \; ( N + \llbracket r^{\{n\}} \rrbracket_r+1))$.\end{theorem}\begin{proof}We have that for all regular expressions $r'$ in \begin{center}$\textit{set} \; (\map \; (\optermsimp \; r) \; ( \nupdates \; s \; r \; [\Some \; ([c], n)]))$,\end{center} $r'$'s size is less than or equal to $N + \llbracket r^{\{n\}} \rrbracket_r + 1$because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdotr^{\{m\}}$ for some string $s'$ and number $m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}).In addition, we know that the list $\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most$c_N = \textit{card} \; (\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$.This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r\leq N * c_N$.\end{proof}We aim to formalise the correctness and size boundfor constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$and so on, which is still work in progress.They should more or less follow the same recipe described in this section.Once we know how to deal with them recursively using suitable auxiliarydefinitions, we can routinely establish the proofs.%----------------------------------------------------------------------------------------% SECTION 3%----------------------------------------------------------------------------------------\section{Comments and Future Improvements}\subsection{Some Experimental Results}What guarantee does this bound give us?It states that whatever the regex is, it will not grow indefinitely.Take our previous example $(a + aa)^*$ as an example:\begin{center} \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} \begin{tikzpicture} \begin{axis}[ xlabel={number of $a$'s}, x label style={at={(1.05,-0.05)}}, ylabel={regex size}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax= 40, ytick={0,10,...,40}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={$(a + aa)^*$}, legend pos=south east, legend cell align=left] \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; \end{axis} \end{tikzpicture} \end{tabular}\end{center}We are able to limit the size of the regex $(a + aa)^*$'s derivativeswith our simplificationrules very effectively.In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the boundis $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function$f(x) = x * 2^x$.This means the bound we have will surge up at leasttower-exponentially with a linear increase of the depth.One might be pretty skepticafl about what this non-elementarybound can bring us.It turns out that the giant bounds are far from being hit.Here we have some test data from randomly generated regular expressions:\begin{figure}[H] \begin{tabular}{@{}c@{\hspace{2mm}}c@{\hspace{0mm}}c@{}} \begin{tikzpicture} \begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, ylabel={regex size}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, %ymax=1000, %ytick={0,100,...,1000}, scaled ticks=false, axis lines=left, width=4.75cm, height=3.8cm, legend entries={regex1}, legend pos=north east, legend cell align=left] \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; \end{axis} \end{tikzpicture} & \begin{tikzpicture} \begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, %ymax=1000, %ytick={0,100,...,1000}, scaled ticks=false, axis lines=left, width=4.75cm, height=3.8cm, legend entries={regex2}, legend pos=south east, legend cell align=left] \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; \end{axis} \end{tikzpicture} & \begin{tikzpicture} \begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, %ymax=1000, %ytick={0,100,...,1000}, scaled ticks=false, axis lines=left, width=4.75cm, height=3.8cm, legend entries={regex3}, legend pos=south east, legend cell align=left] \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; \end{axis} \end{tikzpicture}\\ \multicolumn{3}{c}{} \end{tabular} \caption{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length. The x-axis represents the length of the input.}\end{figure} \noindentMost of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the original size.We will discuss improvements to this bound in the next chapter.\subsection{Possible Further Improvements}There are two problems with this finiteness result, though:\\(i) First, it is not yet a direct formalisation of our lexer's complexity, as a complexity proof would require looking into the time it takes to execute {\bf all} the operations involved in the lexer (simp, collect, decode), not just the derivative.\\(ii) Second, the bound is not yet tight, and we seek to improve $N_a$ so that it is polynomial on $\llbracket a \rrbracket$.\\Still, we believe this contribution is useful,because\begin{itemize} \item The size proof can serve as a starting point for a complexity formalisation. Derivatives are the most important phases of our lexer algorithm. Size properties about derivatives cover the majority of the algorithm and is therefore a good indication of the complexity of the entire program. \item The bound is already a strong indication that catastrophic backtracking is much less likely to occur in our $\blexersimp$ algorithm. We refine $\blexersimp$ with $\blexerStrong$ in the next chapter so that we conjecture the bound becomes polynomial.\end{itemize}%----------------------------------------------------------------------------------------% SECTION 4%----------------------------------------------------------------------------------------One might wonder about the actual bound rather than the loose bound we gavefor the convenience of a more straightforward proof.How much can the regex $r^* \backslash s$ grow? As earlier graphs have shown,%TODO: reference that graph where size grows quicklythey can grow at a maximum speedexponential $w.r.t$ the number of characters, but will eventually level off when the string $s$ is long enough.If they grow to a size exponential $w.r.t$ the original regex, our algorithmwould still be slow.And unfortunately, we have concrete exampleswhere such regular expressions grew exponentially large before levelling off:\begin{center}$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + (\underbrace{a \ldots a}_{\text{n a's}})^*)^*$ \end{center}will already have a maximumsize that is exponential on the number $n$ under our current simplification rules:%TODO: graph of a regex whose size increases exponentially.\begin{center} \begin{tikzpicture} \begin{axis}[ height=0.5\textwidth, width=\textwidth, xlabel=number of a's, xtick={0,...,9}, ylabel=maximum size, ymode=log, log basis y={2} ] \addplot[mark=*,blue] table {re-chengsong.data}; \end{axis} \end{tikzpicture}\end{center}For convenience we use $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.The exponential size is triggered by that the regex$\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$inside the $(\ldots) ^*$ having exponentially manydifferent derivatives, despite those differences being minor.$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$will therefore contain the following terms (after flattening out all nested alternatives):\begin{center}$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\[1mm] $(1 \leq m' \leq m )$\end{center}There are at least exponentiallymany such terms.\footnote{To be exact, these terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted,but the point is that the number is exponential.} With each new input character taking the derivative against the intermediate result, more and more such distinctterms will accumulate.The function $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms \begin{center}$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\end{center}\noindentwhere $m' \neq m''$as they are slightly different.This means that with our current simplification methods,we will not be able to control the derivative so that$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial. %\leq O((\llbracket r\rrbacket)^c)$These terms are similar in the sense that the head of those termsare all consisted of sub-terms of the form: $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $.For $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most$n * (n + 1) / 2$ such terms. For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivativescan be described by 6 terms:$a^*$, $a\cdot (aa)^*$, $ (aa)^*$, $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.The total number of different "head terms", $n * (n + 1) / 2$,is proportional to the number of characters in the regex $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.If we can improve our deduplication process so that it becomes smarterand only keep track of these $n * (n+1) /2$ terms, then we can keepthe size growth polynomial again.This example also suggests a slightly different notion of size, which we call the alphabetic width:\begin{center} \begin{tabular}{lcl} $\textit{awidth} \; \ZERO$ & $\dn$ & $0$\\ $\textit{awidth} \; \ONE$ & $\dn$ & $0$\\ $\textit{awidth} \; c$ & $\dn$ & $1$\\ $\textit{awidth} \; r_1 + r_2$ & $\dn$ & $\textit{awidth} \; r_1 + \textit{awidth} \; r_2$\\ $\textit{awidth} \; r_1 \cdot r_2$ & $\dn$ & $\textit{awidth} \; r_1 + \textit{awidth} \; r_2$\\ $\textit{awidth} \; r^*$ & $\dn$ & $\textit{awidth} \; r$\\ \end{tabular}\end{center}Antimirov\parencite{Antimirov95} has proven that $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$,where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subtermscreated by doing derivatives of $r$ against all possible strings.If we can make sure that at any moment in our lexing algorithm our intermediate result hold at most one copy of each of the subterms then we can get the same bound as Antimirov's.This leads to the algorithm in the next chapter.%----------------------------------------------------------------------------------------% SECTION 1%----------------------------------------------------------------------------------------%-----------------------------------% SUBSECTION 1%-----------------------------------%\subsection{Syntactic Equivalence Under $\simp$}%We prove that minor differences can be annihilated%by $\simp$.%For example,%\begin{center}% $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = % \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$%\end{center}