ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 618 233cf2b97d1a
parent 614 d5e9bcb384ec
child 620 ae6010c14e49
equal deleted inserted replaced
614:d5e9bcb384ec 618:233cf2b97d1a
     5 \label{Finite} 
     5 \label{Finite} 
     6 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
     6 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
     7 %of our bitcoded algorithm, that is a finite bound on the size of any 
     7 %of our bitcoded algorithm, that is a finite bound on the size of any 
     8 %regex's derivatives. 
     8 %regex's derivatives. 
     9 
     9 
    10 \begin{figure}
    10 
    11 \begin{center}
    11 In this chapter we give a guarantee in terms of size of derivatives: 
    12 	\begin{tabular}{ccc}
       
    13 		$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
       
    14 		$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
       
    15 		$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
       
    16 		$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
       
    17 		$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
       
    18 		$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$.
       
    19 	\end{tabular}
       
    20 \end{center}
       
    21 \caption{The size function of bitcoded regular expressions}\label{brexpSize}
       
    22 \end{figure}
       
    23 In this chapter we give a guarantee in terms of size: 
       
    24 given an annotated regular expression $a$, for any string $s$
    12 given an annotated regular expression $a$, for any string $s$
    25 our algorithm $\blexersimp$'s internal annotated regular expression 
    13 our algorithm $\blexersimp$'s internal annotated regular expression 
    26 size  is finitely bounded
    14 size  is finitely bounded
    27 by a constant $N_a$ that only depends on $a$:
    15 by a constant  that only depends on $a$.
       
    16 Formally we show that there exists an $N_a$ such that
    28 \begin{center}
    17 \begin{center}
    29 	$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
    18 	$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
    30 \end{center}
    19 \end{center}
    31 \noindent
    20 \noindent
    32 where the size of an annotated regular expression is defined
    21 where the size ($\llbracket \_ \rrbracket$) of 
    33 in terms of the number of nodes in its tree structure (see figure \ref{brexpSize}).
    22 an annotated regular expression is defined
       
    23 in terms of the number of nodes in its 
       
    24 tree structure (its recursive definition given in the next page).
    34 We believe this size bound
    25 We believe this size bound
    35 is important in the context of POSIX lexing, because 
    26 is important in the context of POSIX lexing, because 
    36 \begin{itemize}
    27 \begin{itemize}
    37 	\item
    28 	\item
    38 		It is a stepping stone towards an ``absence of catastrophic-backtracking''
    29 		It is a stepping stone towards the goal 
    39 		guarantee. 
    30 		of eliminating ``catastrophic backtracking''. 
    40 		If the internal data structures used by our algorithm cannot grow very large,
    31 		If the internal data structures used by our algorithm
    41 		then our algorithm (which traverses those structures) cannot be too slow.
    32 		grows beyond a finite bound, then clearly 
       
    33 		the algorithm (which traverses these structures) will
       
    34 		be slow.
    42 		The next step would be to refine the bound $N_a$ so that it
    35 		The next step would be to refine the bound $N_a$ so that it
    43 		is polynomial on $\llbracket a\rrbracket$.
    36 		is polynomial on $\llbracket a\rrbracket$.
    44 	\item
    37 	\item
    45 		Having it formalised gives us a higher confidence that
    38 		Having the finite bound formalised 
       
    39 		gives us a higher confidence that
    46 		our simplification algorithm $\simp$ does not ``mis-behave''
    40 		our simplification algorithm $\simp$ does not ``mis-behave''
    47 		like $\simpsulz$ does.
    41 		like $\simpsulz$ does.
    48 		The bound is universal, which is an advantage over work which 
    42 		The bound is universal for a given regular expression, 
    49 		only gives empirical evidence on some test cases.
    43 		which is an advantage over work which 
       
    44 		only gives empirical evidence on some test cases (see Verbatim++\cite{Verbatimpp}).
    50 \end{itemize}
    45 \end{itemize}
       
    46 In the next section we describe in more detail
       
    47 what the finite bound means in our algorithm
       
    48 and why the size of the internal data structures of
       
    49 a typical derivative-based lexer such as
       
    50 Sulzmann and Lu's need formal treatment.
       
    51 
    51 \section{Formalising About Size}
    52 \section{Formalising About Size}
    52 \noindent
    53 \noindent
    53 In our lexer ($\blexersimp$),
    54 In our lexer ($\blexersimp$),
    54 we take an annotated regular expression as input,
    55 we take an annotated regular expression as input,
    55 and repeately take derivative of and simplify it:
    56 and repeately take derivative of and simplify it.
    56 \begin{figure}[H]
    57 \begin{figure}
       
    58 	\begin{center}
       
    59 		\begin{tabular}{lcl}
       
    60 			$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
       
    61 			$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
       
    62 			$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
       
    63 			$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
       
    64 			$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
       
    65 			$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$.
       
    66 		\end{tabular}
       
    67 	\end{center}
       
    68 	\caption{The size function of bitcoded regular expressions}\label{brexpSize}
       
    69 \end{figure}
       
    70 
       
    71 \begin{figure}
    57 	\begin{tikzpicture}[scale=2,
    72 	\begin{tikzpicture}[scale=2,
    58 		every node/.style={minimum size=11mm},
    73 		every node/.style={minimum size=11mm},
    59 		->,>=stealth',shorten >=1pt,auto,thick
    74 		->,>=stealth',shorten >=1pt,auto,thick
    60 		]
    75 		]
    61 		\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
    76 		\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
    77 		\node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$};
    92 		\node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$};
    78 		\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode};
    93 		\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode};
    79 	\end{tikzpicture}
    94 	\end{tikzpicture}
    80 	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks}
    95 	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks}
    81 \end{figure}
    96 \end{figure}
       
    97 
    82 \noindent
    98 \noindent
    83 Each time
    99 Each time
    84 a derivative is taken, the regular expression might grow.
   100 a derivative is taken, the regular expression might grow.
    85 However, the simplification that is immediately afterwards will always shrink it so that 
   101 However, the simplification that is immediately afterwards will always shrink it so that 
    86 it stays small.
   102 it stays relatively small.
    87 This intuition is depicted by the relative size
   103 This intuition is depicted by the relative size
    88 change between the black and blue nodes:
   104 change between the black and blue nodes:
    89 After $\simp$ the node always shrinks.
   105 After $\simp$ the node always shrinks.
    90 Our proof says that all the blue nodes
   106 Our proof states that all the blue nodes
    91 stay below a size bound $N_a$ determined by the input $a$.
   107 stay below a size bound $N_a$ determined by the input $a$.
    92 
   108 
    93 \noindent
   109 \noindent
    94 Sulzmann and Lu's assumed a similar picture about their algorithm,
   110 Sulzmann and Lu's assumed a similar picture about their algorithm,
    95 though in fact their algorithm's size might be better depicted by the following graph:
   111 though in fact their algorithm's size might be better depicted by the following graph:
   124 	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}
   140 	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}
   125 \end{figure}
   141 \end{figure}
   126 \noindent
   142 \noindent
   127 The picture means that on certain cases their lexer (where they use $\simpsulz$ 
   143 The picture means that on certain cases their lexer (where they use $\simpsulz$ 
   128 as the simplification function)
   144 as the simplification function)
   129 will have an indefinite size explosion, causing the running time 
   145 will have a size explosion, causing the running time 
   130 of each derivative step to grow continuously (for example 
   146 of each derivative step to grow continuously (for example 
   131 in \ref{SulzmannLuLexerTime}).
   147 in \ref{SulzmannLuLexerTime}).
   132 They tested out the run time of their
   148 They tested out the run time of their
   133 lexer on particular examples such as $(a+b+ab)^*$
   149 lexer on particular examples such as $(a+b+ab)^*$
   134 and claimed that their algorithm is linear w.r.t to the input.
   150 and claimed that their algorithm is linear w.r.t to the input.
   135 With our mecahnised proof, we avoid this type of unintentional
   151 With our mecahnised proof, we avoid this type of unintentional
   136 generalisation.\\
   152 generalisation.\\
   137 
   153 
   138 Before delving in to the details of the formalisation,
   154 Before delving in to the details of the formalisation,
   139 we are going to provide an overview of it.
   155 we are going to provide an overview of it.
   140 In the next subsection, we draw a picture of the bird's eye view
   156 In the next subsection, we give a high-level view
   141 of the proof.
   157 of the proof.
   142 
   158 
   143 
   159 
   144 \subsection{Overview of the Proof}
   160 \subsection{Overview of the Proof}
   145 Here is a bird's eye view of the main components of the finiteness proof,
   161 A high-level overview of the main components of the finiteness proof
   146 which involves three steps:
   162 is as follows:
   147 \begin{figure}[H]
   163 \begin{figure}[H]
   148 	\begin{tikzpicture}[scale=1,font=\bf,
   164 	\begin{tikzpicture}[scale=1,font=\bf,
   149 		node/.style={
   165 		node/.style={
   150 			rectangle,rounded corners=3mm,
   166 			rectangle,rounded corners=3mm,
   151 			ultra thick,draw=black!50,minimum height=18mm, 
   167 			ultra thick,draw=black!50,minimum height=18mm, 
   177 We explain the steps one by one:
   193 We explain the steps one by one:
   178 \begin{itemize}
   194 \begin{itemize}
   179 	\item
   195 	\item
   180 		We first introduce the operations such as 
   196 		We first introduce the operations such as 
   181 		derivatives, simplification, size calculation, etc.
   197 		derivatives, simplification, size calculation, etc.
   182 		associated with $\rrexp$s, which we have given
   198 		associated with $\rrexp$s, which we have introduced
   183 		a very brief introduction to in chapter \ref{Bitcoded2}.
   199 		in chapter \ref{Bitcoded2}.
   184 		The operations on $\rrexp$s are identical to those on
   200 		The operations on $\rrexp$s are identical to those on
   185 		annotated regular expressions except that they are unaware
   201 		annotated regular expressions except that they dispense with
   186 		of bitcodes. This means that all proofs about size of $\rrexp$s will apply to
   202 		bitcodes. This means that all proofs about size of $\rrexp$s will apply to
   187 		annotated regular expressions.
   203 		annotated regular expressions, because the size of a regular
       
   204 		expression is independent of the bitcodes.
   188 	\item
   205 	\item
   189 		We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$,
   206 		We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$,
   190 		where $\textit{ClosedForm}(r, s)$ is entirely 
   207 		where $\textit{ClosedForm}(r, s)$ is entirely 
   191 		written in the derivatives of their children regular 
   208 		given as the derivatives of their children regular 
   192 		expressions.
   209 		expressions.
   193 		We call the right-hand-side the \emph{Closed Form}
   210 		We call the right-hand-side the \emph{Closed Form}
   194 		of the derivative $\rderssimp{r}{s}$.
   211 		of the derivative $\rderssimp{r}{s}$.
   195 	\item
   212 	\item
   196 		We estimate $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$.
   213 		Formally we give an estimate of 
       
   214 		$\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$.
   197 		The key observation is that $\distinctBy$'s output is
   215 		The key observation is that $\distinctBy$'s output is
   198 		a list with a constant length bound.
   216 		a list with a constant length bound.
   199 \end{itemize}
   217 \end{itemize}
   200 We will expand on these steps in the next sections.\\
   218 We will expand on these steps in the next sections.\\
   201 
   219 
   202 \section{The $\textit{Rrexp}$ Datatype}
   220 \section{The $\textit{Rrexp}$ Datatype}
   203 The first step is to define 
   221 The first step is to define 
   204 $\textit{rrexp}$s.
   222 $\textit{rrexp}$s.
   205 They are without bitcodes,
   223 They are annotated regular expressions without bitcodes,
   206 allowing a much simpler size bound proof.
   224 allowing a much simpler size bound proof.
   207 Of course, the bits which encode the lexing information 
   225 %Of course, the bits which encode the lexing information 
   208 would grow linearly with respect 
   226 %would grow linearly with respect 
   209 to the input, which should be taken into account when we wish to tackle the runtime comlexity.
   227 %to the input, which should be taken into account when we wish to tackle the runtime comlexity.
   210 But for the sake of the structural size 
   228 %But for the sake of the structural size 
   211 we can safely ignore them.\\
   229 %we can safely ignore them.\\
   212 To recapitulate, the datatype 
   230 The datatype 
   213 definition of the $\rrexp$, called
   231 definition of the $\rrexp$, called
   214 \emph{r-regular expressions},
   232 \emph{r-regular expressions},
   215 was initially defined in \ref{rrexpDef}.
   233 was initially defined in \ref{rrexpDef}.
   216 The reason for the prefix $r$ is
   234 The reason for the prefix $r$ is
   217 to make a distinction  
   235 to make a distinction  
   224 \]
   242 \]
   225 The size of an r-regular expression is
   243 The size of an r-regular expression is
   226 written $\llbracket r\rrbracket_r$, 
   244 written $\llbracket r\rrbracket_r$, 
   227 whose definition mirrors that of an annotated regular expression. 
   245 whose definition mirrors that of an annotated regular expression. 
   228 \begin{center}
   246 \begin{center}
   229 	\begin{tabular}{ccc}
   247 	\begin{tabular}{lcl}
   230 		$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
   248 		$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
   231 		$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
   249 		$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
   232 		$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
   250 		$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
   233 		$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\
   251 		$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\
   234 		$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as   + 1$\\
   252 		$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as   + 1$\\
   252 		$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$
   270 		$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$
   253 	\end{tabular}
   271 	\end{tabular}
   254 \end{center}
   272 \end{center}
   255 
   273 
   256 \subsection{Why a New Datatype?}
   274 \subsection{Why a New Datatype?}
   257 The reason we take all the trouble 
   275 The reason we
   258 defining a new datatype is that $\erase$ makes things harder.
   276 define a new datatype is that 
       
   277 the $\erase$ function 
       
   278 does not preserve the structure of annotated
       
   279 regular expressions.
   259 We initially started by using 
   280 We initially started by using 
   260 plain regular expressions and tried to prove
   281 plain regular expressions and tried to prove
   261 the lemma \ref{rsizeAsize},
   282 lemma \ref{rsizeAsize},
   262 however the $\erase$ function unavoidbly messes with the structure of the 
   283 however the $\erase$ function messes with the structure of the 
   263 annotated regular expression.
   284 annotated regular expression.
   264 The $+$ constructor
   285 The $+$ constructor
   265 of basic regular expressions is binary whereas $\sum$ 
   286 of basic regular expressions is only binary, whereas $\sum$ 
   266 takes a list, and one has to convert between them:
   287 takes a list. Therefore we need to convert between
   267 \begin{center}
   288 annotated and normal regular expressions as follows:
   268 	\begin{tabular}{ccc}
   289 \begin{center}
       
   290 	\begin{tabular}{lcl}
   269 		$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
   291 		$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
   270 		$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
   292 		$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
   271 		$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
   293 		$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
   272 	\end{tabular}
   294 	\end{tabular}
   273 \end{center}
   295 \end{center}
   274 \noindent
   296 \noindent
   275 An alternative regular expression with an empty argument list
   297 As can be seen alternative regular expression with an empty argument list
   276 will be turned into a $\ZERO$.
   298 will be turned into a $\ZERO$.
   277 The singleton alternative $\sum [r]$ would have $r$ during the
   299 The singleton alternative $\sum [r]$ becomes $r$ during the
   278 $\erase$ function.
   300 $\erase$ function.
   279 The  annotated regular expression $\sum[a, b, c]$ would turn into
   301 The  annotated regular expression $\sum[a, b, c]$ would turn into
   280 $(a+(b+c))$.
   302 $(a+(b+c))$.
   281 All these operations change the size and structure of
   303 All these operations change the size and structure of
   282 an annotated regular expressions, adding unnecessary 
   304 an annotated regular expressions, adding unnecessary 
   283 complexities to the size bound proof.\\
   305 complexities to the size bound proof.\\
   284 For example, if we define the size of a basic plain regular expression 
   306 For example, if we define the size of a basic plain regular expression 
   285 in the usual way,
   307 in the usual way,
   286 \begin{center}
   308 \begin{center}
   287 	\begin{tabular}{ccc}
   309 	\begin{tabular}{lcl}
   288 		$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
   310 		$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
   289 		$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
   311 		$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
   290 		$\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
   312 		$\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
   291 		$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
   313 		$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
   292 		$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
   314 		$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
   301 does not hold.
   323 does not hold.
   302 With $\textit{rerase}$, however, 
   324 With $\textit{rerase}$, however, 
   303 only the bitcodes are thrown away.
   325 only the bitcodes are thrown away.
   304 Everything about the structure remains intact.
   326 Everything about the structure remains intact.
   305 Therefore it does not change the size
   327 Therefore it does not change the size
   306 of an annotated regular expression:
   328 of an annotated regular expression and we have:
   307 \begin{lemma}\label{rsizeAsize}
   329 \begin{lemma}\label{rsizeAsize}
   308 	$\rsize{\rerase a} = \asize a$
   330 	$\rsize{\rerase a} = \asize a$
   309 \end{lemma}
   331 \end{lemma}
   310 \begin{proof}
   332 \begin{proof}
   311 	By routine structural induction on $a$.
   333 	By routine structural induction on $a$.
   315 $\llbracket a \rrbracket  \leq \llbracket  a_\downarrow \rrbracket_p $
   337 $\llbracket a \rrbracket  \leq \llbracket  a_\downarrow \rrbracket_p $
   316 and then estimate $\llbracket  a_\downarrow \rrbracket_p$,
   338 and then estimate $\llbracket  a_\downarrow \rrbracket_p$,
   317 but we found our approach more straightforward.\\
   339 but we found our approach more straightforward.\\
   318 
   340 
   319 \subsection{Functions for R-regular Expressions}
   341 \subsection{Functions for R-regular Expressions}
   320 We shall define the r-regular expression version
   342 In this section we shall define the r-regular expression version
   321 of $\blexer$ and $\blexersimp$ related functions.
   343 of $\blexer$, and $\blexersimp$ related functions.
   322 We use $r$ as the prefix or subscript to differentiate
   344 We use $r$ as the prefix or subscript to differentiate
   323 with the bitcoded version.
   345 with the bitcoded version.
   324 For example,$\backslash_r$, $\rdistincts$, and $\rsimp$
   346 %For example,$\backslash_r$, $\rdistincts$, and $\rsimp$
   325 as opposed to $\backslash$, $\distinctBy$, and $\bsimp$.
   347 %as opposed to $\backslash$, $\distinctBy$, and $\bsimp$.
   326 As promised, they are much simpler than their bitcoded counterparts.
   348 %As promised, they are much simpler than their bitcoded counterparts.
   327 %The operations on r-regular expressions are 
   349 %The operations on r-regular expressions are 
   328 %almost identical to those of the annotated regular expressions,
   350 %almost identical to those of the annotated regular expressions,
   329 %except that no bitcodes are used. For example,
   351 %except that no bitcodes are used. For example,
   330 The derivative operation becomes simpler:\\
   352 The derivative operation for an r-regular expression is\\
   331 \begin{center}
   353 \begin{center}
   332 	\begin{tabular}{@{}lcl@{}}
   354 	\begin{tabular}{@{}lcl@{}}
   333 		$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\  
   355 		$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\  
   334 		$(\ONE)\,\backslash_r c$ & $\dn$ &
   356 		$(\ONE)\,\backslash_r c$ & $\dn$ &
   335 		$\textit{if}\;c=d\; \;\textit{then}\;
   357 		$\textit{if}\;c=d\; \;\textit{then}\;
   345 		$( r\,\backslash_r c)\cdot
   367 		$( r\,\backslash_r c)\cdot
   346 		(_{[]}r^*))$
   368 		(_{[]}r^*))$
   347 	\end{tabular}    
   369 	\end{tabular}    
   348 \end{center}  
   370 \end{center}  
   349 \noindent
   371 \noindent
   350 Similarly, $\distinctBy$ does not need 
   372 where we omit the definition of $\textit{rnullable}$.
       
   373 
       
   374 The function $\distinctBy$ for r-regular expressions does not need 
   351 a function checking equivalence because
   375 a function checking equivalence because
   352 there are no bit annotations causing superficial differences
   376 there are no bit annotations.
   353 between syntactically equal terms.
   377 Therefore we have
   354 \begin{center}
   378 \begin{center}
   355 	\begin{tabular}{lcl}
   379 	\begin{tabular}{lcl}
   356 		$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
   380 		$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
   357 		$\rdistinct{r :: rs}{rset}$ & $\dn$ & 
   381 		$\rdistinct{r :: rs}{rset}$ & $\dn$ & 
   358 		$\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
   382 		$\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
   360 					    r::\rdistinct{rs}{(rset \cup \{r\})}$
   384 					    r::\rdistinct{rs}{(rset \cup \{r\})}$
   361 	\end{tabular}
   385 	\end{tabular}
   362 \end{center}
   386 \end{center}
   363 %TODO: definition of rsimp (maybe only the alternative clause)
   387 %TODO: definition of rsimp (maybe only the alternative clause)
   364 \noindent
   388 \noindent
   365 We would like to make clear
   389 %We would like to make clear
   366 a difference between our $\rdistincts$ and
   390 %a difference between our $\rdistincts$ and
   367 the Isabelle $\textit {distinct}$ predicate.
   391 %the Isabelle $\textit {distinct}$ predicate.
   368 In Isabelle $\textit{distinct}$ is a function that returns a boolean
   392 %In Isabelle $\textit{distinct}$ is a function that returns a boolean
   369 rather than a list.
   393 %rather than a list.
   370 It tests if all the elements of a list are unique.\\
   394 %It tests if all the elements of a list are unique.\\
   371 With $\textit{rdistinct}$,
   395 With $\textit{rdistinct}$ in place,
   372 and the flatten function for $\rrexp$s:
   396 the flatten function for $\rrexp$ is as follows:
   373  \begin{center}
   397  \begin{center}
   374   \begin{tabular}{@{}lcl@{}}
   398   \begin{tabular}{@{}lcl@{}}
   375   $\textit{rflts} \; (\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $as \; @ \; \textit{rflts} \; as' $ \\
   399   $\textit{rflts} \; (\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $as \; @ \; \textit{rflts} \; as' $ \\
   376   $\textit{rflts} \; \ZERO :: as'$ & $\dn$ & $ \textit{rflts} \;  \textit{as'} $ \\
   400   $\textit{rflts} \; \ZERO :: as'$ & $\dn$ & $ \textit{rflts} \;  \textit{as'} $ \\
   377     $\textit{rflts} \; a :: as'$ & $\dn$ & $a :: \textit{rflts} \; \textit{as'}$ \quad(otherwise) 
   401     $\textit{rflts} \; a :: as'$ & $\dn$ & $a :: \textit{rflts} \; \textit{as'}$ \quad(otherwise) 
   378 \end{tabular}    
   402 \end{tabular}    
   379 \end{center}  
   403 \end{center}  
   380 \noindent
   404 \noindent
   381 one can chain together all the other modules
   405 The function 
   382 such as $\rsimpalts$:
   406 $\rsimpalts$ corresponds to $\textit{bsimp}_{ALTS}$:
   383 \begin{center}
   407 \begin{center}
   384   \begin{tabular}{@{}lcl@{}}
   408   \begin{tabular}{@{}lcl@{}}
   385 	  $\rsimpalts \;\; nil$ & $\dn$ & $\RZERO$\\
   409 	  $\rsimpalts \;\; nil$ & $\dn$ & $\RZERO$\\
   386 	  $\rsimpalts \;\; r::nil$ & $\dn$ & $r$\\
   410 	  $\rsimpalts \;\; r::nil$ & $\dn$ & $r$\\
   387 	  $\rsimpalts \;\; rs$ & $\dn$ & $\sum rs$\\
   411 	  $\rsimpalts \;\; rs$ & $\dn$ & $\sum rs$\\
   388 \end{tabular}    
   412 \end{tabular}    
   389 \end{center}  
   413 \end{center}  
   390 \noindent
   414 \noindent
   391 and $\rsimpseq$:
   415 Similarly, we have $\rsimpseq$ which corresponds to $\textit{bsimp}_{SEQ}$:
   392 \begin{center}
   416 \begin{center}
   393   \begin{tabular}{@{}lcl@{}}
   417   \begin{tabular}{@{}lcl@{}}
   394 	  $\rsimpseq \;\; \RZERO \; \_ $ &   $=$ &   $\RZERO$\\
   418 	  $\rsimpseq \;\; \RZERO \; \_ $ &   $=$ &   $\RZERO$\\
   395 	  $\rsimpseq \;\; \_ \; \RZERO $ &   $=$ &   $\RZERO$\\
   419 	  $\rsimpseq \;\; \_ \; \RZERO $ &   $=$ &   $\RZERO$\\
   396 	  $\rsimpseq \;\; \RONE \cdot r_2$ & $\dn$ & $r_2$\\
   420 	  $\rsimpseq \;\; \RONE \cdot r_2$ & $\dn$ & $r_2$\\
   418 $r \backslash_{rsimps} [\,] $ & $\dn$ & $r$
   442 $r \backslash_{rsimps} [\,] $ & $\dn$ & $r$
   419 	\end{tabular}
   443 	\end{tabular}
   420 \end{center}
   444 \end{center}
   421 \noindent
   445 \noindent
   422 We do not define an r-regular expression version of $\blexersimp$,
   446 We do not define an r-regular expression version of $\blexersimp$,
   423 as our proof does not involve its use 
   447 as our proof does not depend on it.
   424 (and there is no bitcode to decode into a lexical value). 
       
   425 Now we are ready to introduce how r-regular expressions allow
   448 Now we are ready to introduce how r-regular expressions allow
   426 us to prove the size bound on bitcoded regular expressions.
   449 us to prove the size bound on bitcoded regular expressions.
   427 
   450 
   428 \subsection{Using R-regular Expressions to Bound Bit-coded Regular Expressions}
   451 \subsection{Using R-regular Expressions to Bound Bit-coded Regular Expressions}
   429 Everything about the size of annotated regular expressions after the application
   452 Everything about the size of annotated regular expressions after the application
   430 of function $\bsimp$ and $\backslash_{simps}$
   453 of function $\bsimp$ and $\backslash_{simps}$
   431 can be calculated via the size of r-regular expressions after the application
   454 can be calculated via the size of r-regular expressions after the application
   432 of $\rsimp$ and $\backslash_{rsimps}$:
   455 of $\rsimp$ and $\backslash_{rsimps}$:
   433 \begin{lemma}\label{sizeRelations}
   456 \begin{lemma}\label{sizeRelations}
   434 	The following equalities hold:
   457 	The following two equalities hold:
   435 	\begin{itemize}
   458 	\begin{itemize}
   436 		\item
   459 		\item
   437 			$\asize{\bsimps \; a} = \rsize{\rsimp{ \rerase{a}}}$
   460 			$\asize{\bsimps \; a} = \rsize{\rsimp{ \rerase{a}}}$
   438 		\item
   461 		\item
   439 			$\asize{\bderssimp{a}{s}} =  \rsize{\rderssimp{\rerase{a}}{s}}$
   462 			$\asize{\bderssimp{a}{s}} =  \rsize{\rderssimp{\rerase{a}}{s}}$
   454 \begin{center}
   477 \begin{center}
   455 	$\rsize{\rderssimp{\rerase{a}}{s}} \leq N_r \quad$
   478 	$\rsize{\rderssimp{\rerase{a}}{s}} \leq N_r \quad$
   456 	implies
   479 	implies
   457 	$\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$.
   480 	$\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$.
   458 \end{center}
   481 \end{center}
   459 From now on we 
   482 %From now on we 
   460 Unless stated otherwise in the rest of this 
   483 %Unless stated otherwise in the rest of this 
   461 chapter all regular expressions without
   484 %chapter all regular expressions without
   462 bitcodes are seen as r-regular expressions ($\rrexp$s).
   485 %bitcodes are seen as r-regular expressions ($\rrexp$s).
   463 For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,
   486 %For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,
   464 we use the notation $r_1 + r_2$
   487 %we use the notation $r_1 + r_2$
   465 for brevity.
   488 %for brevity.
   466 
   489 
   467 
   490 
   468 %-----------------------------------
   491 %-----------------------------------
   469 %	SUB SECTION ROADMAP RREXP BOUND
   492 %	SUB SECTION ROADMAP RREXP BOUND
   470 %-----------------------------------
   493 %-----------------------------------
   519 \end{center}
   542 \end{center}
   520 The problem is that it is not clear what 
   543 The problem is that it is not clear what 
   521 $(r_1\cdot r_2) \backslash_{rsimps} s$ looks like,
   544 $(r_1\cdot r_2) \backslash_{rsimps} s$ looks like,
   522 and therefore $N_{r_1}$ and $N_{r_2}$ in the
   545 and therefore $N_{r_1}$ and $N_{r_2}$ in the
   523 inductive hypotheses cannot be directly used.
   546 inductive hypotheses cannot be directly used.
   524 We have already seen that $(r_1 \cdot r_2)\backslash s$ 
   547 %We have already seen that $(r_1 \cdot r_2)\backslash s$ 
   525 and $(r^*)\backslash s$ can grow in a wild way.
   548 %and $(r^*)\backslash s$ can grow in a wild way.
   526 
   549 
   527 The point is that they will be equivalent to a list of
   550 The point however, is that they will be equivalent to a list of
   528 terms $\sum rs$, where each term in $rs$ will
   551 terms $\sum rs$, where each term in $rs$ will
   529 be made of $r_1 \backslash s' $, $r_2\backslash s'$,
   552 be made of $r_1 \backslash s' $, $r_2\backslash s'$,
   530 and $r \backslash s'$ with $s' \in \textit{SubString} \; s$.
   553 and $r \backslash s'$ with $s' \in \textit{SubString} \; s$ (which stands
       
   554 for the set of substrings of $s$).
   531 The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$
   555 The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$
   532 in the simplification which saves $rs$ from growing indefinitely.
   556 in the simplification, which prevents the $rs$ from growing indefinitely.
   533 
   557 
   534 Based on this idea, we develop a proof in two steps.
   558 Based on this idea, we develop a proof in two steps.
   535 First, we show the equality (where
   559 First, we show the equality (where
   536 $f$ and $g$ are functions that do not increase the size of the input)
   560 $f$ and $g$ are functions that do not increase the size of the input)
   537 \begin{center}
   561 \begin{center}
   545 	\end{center}
   569 	\end{center}
   546 We call the right-hand-side the 
   570 We call the right-hand-side the 
   547 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
   571 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
   548 Second, we will bound the closed form of r-regular expressions
   572 Second, we will bound the closed form of r-regular expressions
   549 using some estimation techniques
   573 using some estimation techniques
   550 and then piece it together
   574 and then apply
   551 with lemma \ref{sizeRelations} to show the bitcoded regular expressions
   575 lemma \ref{sizeRelations} to show that the bitcoded regular expressions
   552 in our $\blexersimp$ are finitely bounded.
   576 in our $\blexersimp$ are finitely bounded.
   553 
   577 
   554 We will flesh out the first step of the proof we 
   578 We will describe in detail the first step of the proof
   555 sketched just now in the next section.
   579 in the next section.
   556 
   580 
   557 \section{Closed Forms}
   581 \section{Closed Forms}
   558 In this section we introduce in detail
   582 In this section we introduce in detail
   559 how the closed forms are obtained for regular expressions'
   583 how we express the string derivatives
   560 derivatives.
   584 of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string
       
   585 rather than a single character) in a different way than 
       
   586 our previous definition.
       
   587 In previous chapters, the derivative of a 
       
   588 regular expression $r$ w.r.t a string $s$
       
   589 was recursively defined on the string:
       
   590 \begin{center}
       
   591 	$r \backslash_s (c::s) \dn (r \backslash c) \backslash_s s$
       
   592 \end{center}
       
   593 The problem is that 
       
   594 this definition does not provide much information
       
   595 on what $r \backslash_s s$ looks like.
       
   596 If we are interested in the size of a derivative like 
       
   597 $(r_1 \cdot r_2)\backslash s$,
       
   598 we have to somehow get a more concrete form to begin.
       
   599 We call such more concrete representations the ``closed forms'' of
       
   600 string derivatives as opposed to their original definitions.
       
   601 The terminology ``closed form'' is borrowed from mathematics,
       
   602 which usually describe expressions that are solely comprised of
       
   603 well-known and easy-to-compute operations such as 
       
   604 additions, multiplications, exponential functions.
       
   605 
   561 We start by proving some basic identities
   606 We start by proving some basic identities
   562 involving the simplification functions for r-regular expressions.
   607 involving the simplification functions for r-regular expressions.
   563 After that we introduce the rewrite relations
   608 After that we introduce the rewrite relations
   564 $\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$
   609 $\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$
   565 $\rightsquigarrow_f$ and $\rightsquigarrow_g$.
   610 $\rightsquigarrow_f$ and $\rightsquigarrow_g$.
   566 These relations involves similar techniques in chapter \ref{Bitcoded2}.
   611 These relations involves similar techniques as in chapter \ref{Bitcoded2}
       
   612 for annotated regular expressions.
   567 Finally, we use these identities to establish the
   613 Finally, we use these identities to establish the
   568 closed forms of the alternative regular expression,
   614 closed forms of the alternative regular expression,
   569 the sequence regular expression, and the star regular expression.
   615 the sequence regular expression, and the star regular expression.
   570 %$r_1\cdot r_2$, $r^*$ and $\sum rs$.
   616 %$r_1\cdot r_2$, $r^*$ and $\sum rs$.
   571 
   617 
   572 
   618 
   573 
   619 
   574 \subsection{Some Basic Identities}
   620 \subsection{Some Basic Identities}
   575 
   621 
   576 We now introduce lemmas 
   622 In what follows we will often convert between lists
   577 that are repeatedly used in later proofs.
   623 and sets.
   578 Note that for the $\textit{rdistinct}$ function there
   624 We use Isabelle's $set$ to refere to the 
   579 will be a lot of conversion from lists to sets.
       
   580 We use $set$ to refere to the 
       
   581 function that converts a list $rs$ to the set
   625 function that converts a list $rs$ to the set
   582 containing all the elements in $rs$.
   626 containing all the elements in $rs$.
   583 \subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication}
   627 \subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication}
   584 The $\textit{rdistinct}$ function, as its name suggests, will
   628 The $\textit{rdistinct}$ function, as its name suggests, will
   585 de-duplicate an r-regular expression list.
   629 de-duplicate an r-regular expression list.
   587 is already in the accumulator set.
   631 is already in the accumulator set.
   588 \begin{lemma}\label{rdistinctDoesTheJob}
   632 \begin{lemma}\label{rdistinctDoesTheJob}
   589 	%The function $\textit{rdistinct}$ satisfies the following
   633 	%The function $\textit{rdistinct}$ satisfies the following
   590 	%properties:
   634 	%properties:
   591 	Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its
   635 	Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its
   592 	recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.} 
   636 	recursive definition here. Its Isabelle counterpart would be $\textit{distinct}$.} 
   593 	readily defined
       
   594 	for testing
   637 	for testing
   595 	whether a list's elements are unique. Then the following
   638 	whether a list's elements are unique. Then the following
   596 	properties about $\textit{rdistinct}$ hold:
   639 	properties about $\textit{rdistinct}$ hold:
   597 	\begin{itemize}
   640 	\begin{itemize}
   598 		\item
   641 		\item
   673 \end{lemma}
   716 \end{lemma}
   674 \noindent
   717 \noindent
   675 \begin{proof}
   718 \begin{proof}
   676 	By induction on $rs$. All other variables are allowed to be arbitrary.
   719 	By induction on $rs$. All other variables are allowed to be arbitrary.
   677 	The second part of the lemma requires the first.
   720 	The second part of the lemma requires the first.
   678 	Note that for each part, the two sub-propositions need to be proven concurrently,
   721 	Note that for each part, the two sub-propositions need to be proven 
       
   722 	at the same time,
   679 	so that the induction goes through.
   723 	so that the induction goes through.
   680 \end{proof}
   724 \end{proof}
   681 \noindent
   725 \noindent
   682 This allows us to prove a few more equivalence relations involving 
   726 This allows us to prove a few more equivalence relations involving 
   683 $\textit{rdistinct}$ (it will be useful later):
   727 $\textit{rdistinct}$ (they will be useful later):
   684 \begin{lemma}\label{rdistinctConcatGeneral}
   728 \begin{lemma}\label{rdistinctConcatGeneral}
   685 	\mbox{}
   729 	\mbox{}
   686 	\begin{itemize}
   730 	\begin{itemize}
   687 		\item
   731 		\item
   688 			$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{((\rdistinct{rs}{\varnothing})@ rs')}{\varnothing}$
   732 			$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{((\rdistinct{rs}{\varnothing})@ rs')}{\varnothing}$
   721 	By an induction on 
   765 	By an induction on 
   722 	$rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary.
   766 	$rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary.
   723 \end{proof}
   767 \end{proof}
   724 \noindent
   768 \noindent
   725 $\textit{rdistinct}$ needs to be applied only once, and 
   769 $\textit{rdistinct}$ needs to be applied only once, and 
   726 applying it multiple times does not cause any difference:
   770 applying it multiple times does not make any difference:
   727 \begin{corollary}\label{distinctOnceEnough}
   771 \begin{corollary}\label{distinctOnceEnough}
   728 	$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; (rdistinct \; 
   772 	$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; (rdistinct \; 
   729 	rs \{ \} @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$
   773 	rs \{ \} @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$
   730 \end{corollary}
   774 \end{corollary}
   731 \begin{proof}
   775 \begin{proof}
   746 %When the function $\textit{Rflts}$ 
   790 %When the function $\textit{Rflts}$ 
   747 %is applied to the concatenation of two lists, the output can be calculated by first applying the
   791 %is applied to the concatenation of two lists, the output can be calculated by first applying the
   748 %functions on two lists separately, and then concatenating them together.
   792 %functions on two lists separately, and then concatenating them together.
   749 $\textit{Rflts}$ is composable in terms of concatenation:
   793 $\textit{Rflts}$ is composable in terms of concatenation:
   750 \begin{lemma}\label{rfltsProps}
   794 \begin{lemma}\label{rfltsProps}
   751 	The function $\rflts$ has the below properties:\\
   795 	The function $\rflts$ has the properties below:\\
   752 	\begin{itemize}
   796 	\begin{itemize}
   753 		\item
   797 		\item
   754 			$\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$
   798 			$\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$
   755 		\item
   799 		\item
   756 			If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$
   800 			If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$
   777 	last sub-lemma.
   821 	last sub-lemma.
   778 \end{proof}
   822 \end{proof}
   779 \noindent
   823 \noindent
   780 Now we introduce the property that the operations 
   824 Now we introduce the property that the operations 
   781 derivative and $\rsimpalts$
   825 derivative and $\rsimpalts$
   782 commute, this will be used later in deriving the closed form for
   826 commute, this will be used later on, when deriving the closed form for
   783 the alternative regular expression:
   827 the alternative regular expression:
   784 \begin{lemma}\label{rderRsimpAltsCommute}
   828 \begin{lemma}\label{rderRsimpAltsCommute}
   785 	$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$
   829 	$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$
   786 \end{lemma}
   830 \end{lemma}
       
   831 \begin{proof}
       
   832 	By induction on $rs$.
       
   833 \end{proof}
   787 \noindent
   834 \noindent
   788 
   835 
   789 \subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s}
   836 \subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s}
   790 Much like the definition of $L$ on plain regular expressions, one could also 
   837 Much like the definition of $L$ on plain regular expressions, one can also 
   791 define the language interpretation of $\rrexp$s.
   838 define the language interpretation of $\rrexp$s.
   792 \begin{center}
   839 \begin{center}
   793 	\begin{tabular}{lcl}
   840 	\begin{tabular}{lcl}
   794 		$RL \; (\ZERO)$ & $\dn$ & $\phi$\\
   841 		$RL \; (\ZERO)$ & $\dn$ & $\phi$\\
   795 		$RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\
   842 		$RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\
   818 \end{proof}
   865 \end{proof}
   819 
   866 
   820 \subsubsection{Simplified $\textit{Rrexp}$s are Good}
   867 \subsubsection{Simplified $\textit{Rrexp}$s are Good}
   821 We formalise the notion of ``good" regular expressions,
   868 We formalise the notion of ``good" regular expressions,
   822 which means regular expressions that
   869 which means regular expressions that
   823 are not fully simplified. For alternative regular expressions that means they
   870 are fully simplified in terms of our $\textit{rsimp}$ function. 
   824 do not contain any nested alternatives like 
   871 For alternative regular expressions that means they
   825 \[ r_1 + (r_2 + r_3) \], un-removed $\RZERO$s like \[\RZERO + r\]
   872 do not contain any nested alternatives, un-eliminated $\RZERO$s
   826 or duplicate elements in a children regular expression list like \[ \sum [r, r, \ldots]\]:
   873 or duplicate elements (for example, 
       
   874 $r_1 + (r_2 + r_3)$, $\RZERO + r$ and $ \sum [r, r, \ldots]$).
       
   875 The clauses for $\good$ are:
   827 \begin{center}
   876 \begin{center}
   828 	\begin{tabular}{@{}lcl@{}}
   877 	\begin{tabular}{@{}lcl@{}}
   829 		$\good\; \RZERO$ & $\dn$ & $\textit{false}$\\
   878 		$\good\; \RZERO$ & $\dn$ & $\textit{false}$\\
   830 		$\good\; \RONE$ & $\dn$ & $\textit{true}$\\
   879 		$\good\; \RONE$ & $\dn$ & $\textit{true}$\\
   831 		$\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\
   880 		$\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\
   832 		$\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\
   881 		$\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\
   833 		$\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\
   882 		$\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\
   834 		$\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & 
   883 		$\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & 
   835 		$\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\
   884 		$\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\
   836 						   & & $\textit{and}\; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \,  \textit{and}\; \, \textit{nonAlt}\; r')$\\
   885 						   & & $\land \; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \,  \land \; \, \textit{nonAlt}\; r')$\\
   837 		$\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\
   886 		$\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\
   838 		$\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\
   887 		$\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\
   839 		$\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\
   888 		$\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\
   840 		$\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\
   889 		$\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\
   841 		$\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\
   890 		$\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\
   842 	\end{tabular}
   891 	\end{tabular}
   843 \end{center}
   892 \end{center}
   844 \noindent
   893 \noindent
   845 The predicate $\textit{nonAlt}$ evaluates to true when the regular expression is not an
   894 We omit the recursive definition of the predicate $\textit{nonAlt}$,
       
   895 which evaluates to true when the regular expression is not an
   846 alternative, and false otherwise.
   896 alternative, and false otherwise.
   847 The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that
   897 The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that
   848 its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, 
   898 its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, 
   849 and unique:
   899 and unique:
   850 \begin{lemma}\label{rsimpaltsGood}
   900 \begin{lemma}\label{rsimpaltsGood}
   877 The $\rflts$ function
   927 The $\rflts$ function
   878 always opens up nested alternatives,
   928 always opens up nested alternatives,
   879 which enables $\rsimp$ to be non-nested:
   929 which enables $\rsimp$ to be non-nested:
   880 
   930 
   881 \begin{lemma}\label{nonnestedRsimp}
   931 \begin{lemma}\label{nonnestedRsimp}
   882 	$\nonnested \; (\rsimp{r})$
   932 	It is always the case that
   883 \end{lemma}
   933 	\begin{center}
   884 \begin{proof}
   934 		$\nonnested \; (\rsimp{r})$
   885 	By an induction on $r$.
   935 	\end{center}
       
   936 \end{lemma}
       
   937 \begin{proof}
       
   938 	By induction on $r$.
   886 \end{proof}
   939 \end{proof}
   887 \noindent
   940 \noindent
   888 With this we could prove that a regular expressions
   941 With this we could prove that a regular expressions
   889 after simplification and flattening and de-duplication,
   942 after simplification and flattening and de-duplication,
   890 will not contain any alternative regular expression directly:
   943 will not contain any alternative regular expression directly:
   894 \end{lemma}
   947 \end{lemma}
   895 \begin{proof}
   948 \begin{proof}
   896 	By \ref{nonnestedRsimp}.
   949 	By \ref{nonnestedRsimp}.
   897 \end{proof}
   950 \end{proof}
   898 \noindent
   951 \noindent
   899 The other thing we know is that once $\rsimp{}$ had finished
   952 The other fact we know is that once $\rsimp{}$ has finished
   900 processing an alternative regular expression, it will not
   953 processing an alternative regular expression, it will not
   901 contain any $\RZERO$s, this is because all the recursive 
   954 contain any $\RZERO$s. This is because all the recursive 
   902 calls to the simplification on the children regular expressions
   955 calls to the simplification on the children regular expressions
   903 make the children good, and $\rflts$ will not take out
   956 make the children good, and $\rflts$ will not delete
   904 any $\RZERO$s out of a good regular expression list,
   957 any $\RZERO$s out of a good regular expression list,
   905 and $\rdistinct{}$ will not mess with the result.
   958 and $\rdistinct{}$ will not ``mess'' with the result.
   906 \begin{lemma}\label{flts3Obv}
   959 \begin{lemma}\label{flts3Obv}
   907 	The following are true:
   960 	The following are true:
   908 	\begin{itemize}
   961 	\begin{itemize}
   909 		\item
   962 		\item
   910 			If for all $r \in rs. \, \good \; r $ or $r = \RZERO$,
   963 			If for all $r \in rs. \, \good \; r $ or $r = \RZERO$,
   929 \begin{lemma}\label{good1}
   982 \begin{lemma}\label{good1}
   930 	For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$.
   983 	For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$.
   931 \end{lemma}
   984 \end{lemma}
   932 \begin{proof}
   985 \begin{proof}
   933 	By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$.
   986 	By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$.
   934 	Lemma \ref{rsimpSize} says that 
   987 	Lemma \ref{rsimpMono} says that 
   935 	$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to
   988 	$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to
   936 	$\llbracket r \rrbracket_r$.
   989 	$\llbracket r \rrbracket_r$.
   937 	Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case,
   990 	Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case,
   938 	Inductive hypothesis applies to the children regular expressions
   991 	Inductive hypothesis applies to the children regular expressions
   939 	$r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied
   992 	$r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied
   951 	By an induction on $\sum rs$. The inductive rules are the cases
  1004 	By an induction on $\sum rs$. The inductive rules are the cases
   952 	for $\good$.
  1005 	for $\good$.
   953 \end{proof}
  1006 \end{proof}
   954 \noindent
  1007 \noindent
   955 Now we are ready to prove that good regular expressions are invariant
  1008 Now we are ready to prove that good regular expressions are invariant
   956 of $\rsimp{}$ application:
  1009 with respect to $\rsimp{}$:
   957 \begin{lemma}\label{test}
  1010 \begin{lemma}\label{test}
   958 	If $\good \;r$ then $\rsimp{r} = r$.
  1011 	If $\good \;r$ then $\rsimp{r} = r$.
   959 \end{lemma}
  1012 \end{lemma}
   960 \begin{proof}
  1013 \begin{proof}
   961 	By an induction on the inductive cases of $\good$, using lemmas
  1014 	By an induction on the inductive cases of $\good$, using lemmas
   962 	\ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}.
  1015 	\ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}.
   963 	The lemma \ref{goodaltsNonalt} is used in the alternative
  1016 	The lemma \ref{goodaltsNonalt} is used in the alternative
   964 	case where 2 or more elements are present in the list.
  1017 	case where 2 or more elements are present in the list.
   965 \end{proof}
  1018 \end{proof}
   966 \noindent
  1019 \noindent
   967 Given below is a property involving $\rflts$, $\rdistinct{}{}$, $\rsimp{}$ and $\rsimp_{ALTS}$,
  1020 Given below is a property involving $\rflts$, $\textit{rdistinct}$, 
   968 which requires $\ref{good1}$ to go through smoothly.
  1021 $\rsimp{}$ and $\rsimp_{ALTS}$,
   969 It says that an application of $\rsimp_{ALTS}$ can be "absorbed",
  1022 which requires $\ref{good1}$ to go through smoothly:
   970 if it its output is concatenated with a list and then applied to $\rflts$.
       
   971 \begin{lemma}\label{flattenRsimpalts}
  1023 \begin{lemma}\label{flattenRsimpalts}
       
  1024 An application of $\rsimp_{ALTS}$ can be ``absorbed'',
       
  1025 if its output is concatenated with a list and then applied to $\rflts$.
       
  1026 \begin{center}
   972 	$\rflts \; ( (\rsimp_{ALTS} \; 
  1027 	$\rflts \; ( (\rsimp_{ALTS} \; 
   973 	(\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: 
  1028 	(\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: 
   974 	\map \; \rsimp{} \; rs' ) = 
  1029 	\map \; \rsimp{} \; rs' ) = 
   975 	\rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ (
  1030 	\rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ (
   976 	\map \; \rsimp{rs'}))$
  1031 	\map \; \rsimp{rs'}))$
       
  1032 \end{center}
   977 
  1033 
   978 
  1034 
   979 \end{lemma}
  1035 \end{lemma}
   980 \begin{proof}
  1036 \begin{proof}
   981 	By \ref{good1}.
  1037 	By \ref{good1}.
   991 The idempotency of $\rsimp$ is very useful in 
  1047 The idempotency of $\rsimp$ is very useful in 
   992 manipulating regular expression terms into desired
  1048 manipulating regular expression terms into desired
   993 forms so that key steps allowing further rewriting to closed forms
  1049 forms so that key steps allowing further rewriting to closed forms
   994 are possible.
  1050 are possible.
   995 \begin{lemma}\label{rsimpIdem}
  1051 \begin{lemma}\label{rsimpIdem}
   996 	$\rsimp{r} = \rsimp{\rsimp{r}}$
  1052 	$\rsimp{r} = \rsimp{(\rsimp{r})}$
   997 \end{lemma}
  1053 \end{lemma}
   998 
  1054 
   999 \begin{proof}
  1055 \begin{proof}
  1000 	By \ref{test} and \ref{good1}.
  1056 	By \ref{test} and \ref{good1}.
  1001 \end{proof}
  1057 \end{proof}
  1003 This property means we do not have to repeatedly
  1059 This property means we do not have to repeatedly
  1004 apply simplification in each step, which justifies
  1060 apply simplification in each step, which justifies
  1005 our definition of $\blexersimp$.
  1061 our definition of $\blexersimp$.
  1006 
  1062 
  1007 
  1063 
  1008 On the other hand, we could repeat the same $\rsimp{}$ applications
  1064 On the other hand, we can repeat the same $\rsimp{}$ applications
  1009 on regular expressions as many times as we want, if we have at least
  1065 on regular expressions as many times as we want, if we have at least
  1010 one simplification applied to it, and apply it wherever we would like to:
  1066 one simplification applied to it, and apply it wherever we would like to:
  1011 \begin{corollary}\label{headOneMoreSimp}
  1067 \begin{corollary}\label{headOneMoreSimp}
  1012 	The following properties hold, directly from \ref{rsimpIdem}:
  1068 	The following properties hold, directly from \ref{rsimpIdem}:
  1013 
  1069 
  1017 		\item
  1073 		\item
  1018 			$\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$
  1074 			$\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$
  1019 	\end{itemize}
  1075 	\end{itemize}
  1020 \end{corollary}
  1076 \end{corollary}
  1021 \noindent
  1077 \noindent
  1022 This will be useful in later closed form proof's rewriting steps.
  1078 This will be useful in the later closed form proof's rewriting steps.
  1023 Similarly, we point out the following useful facts below:
  1079 Similarly, we state the following useful facts below:
  1024 \begin{lemma}
  1080 \begin{lemma}
  1025 	The following equalities hold if $r = \rsimp{r'}$ for some $r'$:
  1081 	The following equalities hold if $r = \rsimp{r'}$ for some $r'$:
  1026 	\begin{itemize}
  1082 	\begin{itemize}
  1027 		\item
  1083 		\item
  1028 			If $r  = \sum rs$ then $\rsimpalts \; rs = \sum rs$.
  1084 			If $r  = \sum rs$ then $\rsimpalts \; rs = \sum rs$.
  1039 \noindent
  1095 \noindent
  1040 With the idempotency of $\rsimp{}$ and its corollaries, 
  1096 With the idempotency of $\rsimp{}$ and its corollaries, 
  1041 we can start proving some key equalities leading to the 
  1097 we can start proving some key equalities leading to the 
  1042 closed forms.
  1098 closed forms.
  1043 Now presented are a few equivalent terms under $\rsimp{}$.
  1099 Now presented are a few equivalent terms under $\rsimp{}$.
  1044 We use $r_1 \sequal r_2 $ here to denote $\rsimp{r_1} = \rsimp{r_2}$.
  1100 To make the notation more concise
       
  1101 We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$:
       
  1102 \begin{center}
       
  1103 \begin{tabular}{lcl}
       
  1104 	$a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$
       
  1105 \end{tabular}
       
  1106 \end{center}
       
  1107 \noindent
       
  1108 %\vspace{0em}
  1045 \begin{lemma}
  1109 \begin{lemma}
       
  1110 	The following equivalence hold:
  1046 	\begin{itemize}
  1111 	\begin{itemize}
  1047 		The following equivalence hold:
       
  1048 	\item
  1112 	\item
  1049 		$\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$
  1113 		$\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$
  1050 	\item
  1114 	\item
  1051 		$\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$
  1115 		$\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$
  1052 	\item
  1116 	\item
  1061 	By induction on the lists involved.
  1125 	By induction on the lists involved.
  1062 \end{proof}
  1126 \end{proof}
  1063 \noindent
  1127 \noindent
  1064 The above allows us to prove
  1128 The above allows us to prove
  1065 two similar equalities (which are a bit more involved).
  1129 two similar equalities (which are a bit more involved).
  1066 It says that we could flatten out the elements
  1130 It says that we could flatten the elements
  1067 before simplification and still get the same result.
  1131 before simplification and still get the same result.
  1068 \begin{lemma}\label{simpFlatten3}
  1132 \begin{lemma}\label{simpFlatten3}
  1069 	One can flatten the inside $\sum$ of a $\sum$ if it is being 
  1133 	One can flatten the inside $\sum$ of a $\sum$ if it is being 
  1070 	simplified. Concretely,
  1134 	simplified. Concretely,
  1071 	\begin{itemize}
  1135 	\begin{itemize}
  1097 \end{comment}
  1161 \end{comment}
  1098 %Rewriting steps not put in--too long and complicated-------------------------------
  1162 %Rewriting steps not put in--too long and complicated-------------------------------
  1099 \noindent
  1163 \noindent
  1100 
  1164 
  1101 
  1165 
  1102 We need more equalities like the above to enable a closed form,
  1166 We need more equalities like the above to enable a closed form lemma,
  1103 for which we need to introduce a few rewrite relations
  1167 for which we need to introduce a few rewrite relations
  1104 to help
  1168 to help
  1105 us obtain them.
  1169 us obtain them.
  1106 
  1170 
  1107 \subsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$}
  1171 \subsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$}
  1209 \end{center}
  1273 \end{center}
  1210 \caption{List of one-step rewrite rules characterising the $\rflts$ and $\textit{rdistinct}$
  1274 \caption{List of one-step rewrite rules characterising the $\rflts$ and $\textit{rdistinct}$
  1211 operations}\label{gRewrite}
  1275 operations}\label{gRewrite}
  1212 \end{figure}
  1276 \end{figure}
  1213 \noindent
  1277 \noindent
  1214 We defined
  1278 We define
  1215 two separate list rewriting relations $\frewrite$ and $\grewrite$.
  1279 two separate list rewriting relations $\frewrite$ and $\grewrite$.
  1216 The rewriting steps that take place during
  1280 The rewriting steps that take place during
  1217 flattening is characterised by $\frewrite$.
  1281 flattening is characterised by $\frewrite$.
  1218 $\grewrite$ characterises both flattening and de-duplicating.
  1282 The rewrite relation $\grewrite$ characterises both flattening and de-duplicating.
  1219 Sometimes $\grewrites$ is slightly too powerful
  1283 Sometimes $\grewrites$ is slightly too powerful
  1220 so we would rather use $\frewrites$ to prove
  1284 so we would rather use $\frewrites$ to prove
  1221 %because we only
  1285 %because we only
  1222 equalities related to $\rflts$.
  1286 equalities related to $\rflts$.
  1223 %certain equivalence under the rewriting steps of $\frewrites$.
  1287 %certain equivalence under the rewriting steps of $\frewrites$.
  1234 	$\map \; (\_ \backslash x) \;  (\rflts \; rs) \frewrites
  1298 	$\map \; (\_ \backslash x) \;  (\rflts \; rs) \frewrites
  1235 	\rflts \; (\map \; (\_ \backslash x) \; rs)$
  1299 	\rflts \; (\map \; (\_ \backslash x) \; rs)$
  1236 \end{lemma}
  1300 \end{lemma}
  1237 \noindent
  1301 \noindent
  1238 and then the equivalence between two terms
  1302 and then the equivalence between two terms
  1239 that can reduce in many steps to each other.
  1303 that can reduce in many steps to each other:
  1240 \begin{lemma}\label{frewritesSimpeq}
  1304 \begin{lemma}\label{frewritesSimpeq}
  1241 	If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal 
  1305 	If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal 
  1242 	\sum (\rDistinct \;  rs_2 \;  \varnothing)$.
  1306 	\sum (\rDistinct \;  rs_2 \;  \varnothing)$.
  1243 \end{lemma}
  1307 \end{lemma}
  1244 \noindent
  1308 \noindent
  1245 
  1309 These two lemmas can both be proven using a straightforward induction (and
       
  1310 the proofs for them are therefore omitted).
       
  1311 
       
  1312 Now the above equalities can be derived like a breeze:
  1246 \begin{corollary}
  1313 \begin{corollary}
  1247 	$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
  1314 	$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
  1248 	\sum (\rDistinct \;\;  (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
  1315 	\sum (\rDistinct \;\;  (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
  1249 	$
  1316 	$
  1250 \end{corollary}
  1317 \end{corollary}
  1251 
  1318 \begin{proof}
       
  1319 	By lemmas \ref{earlyLaterDerFrewrites} and \ref{frewritesSimpeq}.
       
  1320 \end{proof}
  1252 But this trick will not work for $\grewrites$.
  1321 But this trick will not work for $\grewrites$.
  1253 For example, a rewriting step in proving
  1322 For example, a rewriting step in proving
  1254 closed forms is:
  1323 closed forms is:
  1255 \begin{center}
  1324 \begin{center}
  1256 	$\rsimp{(\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})))}$\\
  1325 	$\rsimp{(\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})))}$\\
  1257 	$=$ \\
  1326 	$=$ \\
  1258 	$\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $
  1327 	$\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $
  1259 	\noindent
  1328 	\noindent
  1260 \end{center}
  1329 \end{center}
  1261 For this one would hope to have a rewriting relation between the two lists involved,
  1330 For this, one would hope to have a rewriting relation between the two lists involved,
  1262 similar to \ref{earlyLaterDerFrewrites}. However, it turns out that 
  1331 similar to \ref{earlyLaterDerFrewrites}. However, it turns out that 
  1263 \begin{center}
  1332 \begin{center}
  1264 	$\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \;
  1333 	$\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \;
  1265 	(\_ \backslash x) \; rs) \; ( rset \backslash x)$
  1334 	(\_ \backslash x) \; rs) \; ( rset \backslash x)$
  1266 \end{center}
  1335 \end{center}
  1267 \noindent
  1336 \noindent
  1268 does $\mathbf{not}$ hold in general.
  1337 does $\mathbf{not}$ hold in general.
  1269 For this rewriting step we will introduce some slightly more cumbersome
  1338 For this rewriting step we will introduce some slightly more cumbersome
  1270 proof technique in later sections.
  1339 proof technique later.
  1271 The point is that $\frewrite$
  1340 The point is that $\frewrite$
  1272 allows us to prove equivalence in a straightforward way that is 
  1341 allows us to prove equivalence in a straightforward way that is 
  1273 not possible for $\grewrite$. 
  1342 not possible for $\grewrite$. 
  1274 
  1343 
  1275 
  1344 
  1276 \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}
  1345 \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}
  1277 In this part, we present lemmas stating
  1346 In this part, we present lemmas stating
  1278 pairs of r-regular expressions or r-regular expression lists
  1347 pairs of r-regular expressions and r-regular expression lists
  1279 where one could rewrite from one in many steps to the other.
  1348 where one could rewrite from one in many steps to the other.
  1280 Most of the proofs to these lemmas are straightforward, using
  1349 Most of the proofs to these lemmas are straightforward, using
  1281 an induction on the inductive cases of the corresponding rewriting relations.
  1350 an induction on the corresponding rewriting relations.
  1282 These proofs will therefore be omitted when this is the case.
  1351 These proofs will therefore be omitted when this is the case.
  1283 We present in the below lemma a few pairs of terms that are rewritable via 
  1352 We present in the following lemma a few pairs of terms that are rewritable via 
  1284 $\grewrites$:
  1353 $\grewrites$:
  1285 \begin{lemma}\label{gstarRdistinctGeneral}
  1354 \begin{lemma}\label{gstarRdistinctGeneral}
  1286 	\mbox{}
  1355 	\mbox{}
  1287 	\begin{itemize}
  1356 	\begin{itemize}
  1288 		\item
  1357 		\item
  1300 \end{lemma}
  1369 \end{lemma}
  1301 \noindent
  1370 \noindent
  1302 If a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other,
  1371 If a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other,
  1303 then they are equivalent under $\rsimp{}$:
  1372 then they are equivalent under $\rsimp{}$:
  1304 \begin{lemma}\label{grewritesSimpalts}
  1373 \begin{lemma}\label{grewritesSimpalts}
       
  1374 	\mbox{}
  1305 	If $rs_1 \grewrites rs_2$, then
  1375 	If $rs_1 \grewrites rs_2$, then
  1306 	we have the following equivalence:
  1376 	we have the following equivalence:
  1307 	\begin{itemize}
  1377 	\begin{itemize}
  1308 		\item
  1378 		\item
  1309 			$\sum rs_1 \sequal \sum rs_2$
  1379 			$\sum rs_1 \sequal \sum rs_2$
  1328 			If $rs_1 \scfrewrites rs_2$ then $\sum rs_1 \hrewrites \sum rs_2$
  1398 			If $rs_1 \scfrewrites rs_2$ then $\sum rs_1 \hrewrites \sum rs_2$
  1329 
  1399 
  1330 	\end{itemize}
  1400 	\end{itemize}
  1331 \end{lemma}
  1401 \end{lemma}
  1332 \noindent
  1402 \noindent
  1333 Here comes the meat of the proof, 
  1403 Now comes the core of the proof, 
  1334 which says that once two lists are rewritable to each other,
  1404 which says that once two lists are rewritable to each other,
  1335 then they are equivalent under $\rsimp{}$:
  1405 then they are equivalent under $\rsimp{}$:
  1336 \begin{lemma}
  1406 \begin{lemma}
  1337 	If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$.
  1407 	If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$.
  1338 \end{lemma}
  1408 \end{lemma}
  1339 
  1409 
  1340 \noindent
  1410 \noindent
  1341 And similar to \ref{Bitcoded2} one can preserve rewritability after taking derivative
  1411 Similar to what we did in \ref{Bitcoded2}, 
  1342 of two regular expressions on both sides:
  1412 we prove that if one can rewrite from one r-regular expression ($r$)
       
  1413 to the other ($r'$), after taking derivatives one could still rewrite
       
  1414 the first ($r\backslash c$) to the other ($r'\backslash c$).
  1343 \begin{lemma}\label{interleave}
  1415 \begin{lemma}\label{interleave}
  1344 	If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
  1416 	If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
  1345 \end{lemma}
  1417 \end{lemma}
  1346 \noindent
  1418 \noindent
  1347 This allows proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$ now.
  1419 This allows us proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$.
  1348 \begin{lemma}\label{insideSimpRemoval}
  1420 \begin{lemma}\label{insideSimpRemoval}
  1349 	$\rsimp{\rder{c}{\rsimp{r}}} = \rsimp{\rder{c}{r}}  $
  1421 	$\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})}  $
  1350 \end{lemma}
  1422 \end{lemma}
  1351 \noindent
  1423 \noindent
  1352 \begin{proof}
  1424 \begin{proof}
  1353 	By \ref{interleave} and \ref{rsimpIdem}.
  1425 	By \ref{interleave} and \ref{rsimpIdem}.
  1354 \end{proof}
  1426 \end{proof}
  1371 	part 2 is by lemma \ref{insideSimpRemoval} .%and \ref{distinctDer}.
  1443 	part 2 is by lemma \ref{insideSimpRemoval} .%and \ref{distinctDer}.
  1372 \end{proof}
  1444 \end{proof}
  1373 \noindent
  1445 \noindent
  1374 
  1446 
  1375 \subsection{Closed Forms for $\sum rs$, $r_1\cdot r_2$ and $r^*$}
  1447 \subsection{Closed Forms for $\sum rs$, $r_1\cdot r_2$ and $r^*$}
  1376 \subsubsection{Closed Form for Alternative Regular Expression}
  1448 Lemma \ref{Simpders} leads to our first closed form,
  1377 Lemma \ref{Simpders} leads to the first closed form--
  1449 which is for the alternative regular expression:
  1378 \begin{lemma}\label{altsClosedForm}
  1450 \begin{theorem}\label{altsClosedForm}
       
  1451 	\mbox{}
  1379 	\begin{center}
  1452 	\begin{center}
  1380 		$\rderssimp{(\sum rs)}{s} \sequal
  1453 		$\rderssimp{(\sum rs)}{s} \sequal
  1381 		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
  1454 		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
  1382 	\end{center}
  1455 	\end{center}
  1383 \end{lemma}
  1456 \end{theorem}
  1384 
       
  1385 \noindent
  1457 \noindent
  1386 \begin{proof}
  1458 \begin{proof}
  1387 	By a reverse induction on the string $s$.
  1459 	By a reverse induction on the string $s$.
  1388 	One rewriting step, as we mentioned earlier,
  1460 	One rewriting step, as we mentioned earlier,
  1389 	involves
  1461 	involves
  1400 	\ref{grewritesSimpalts}, \ref{gstarRdistinctGeneral}, \ref{rderRsimpAltsCommute}, and
  1472 	\ref{grewritesSimpalts}, \ref{gstarRdistinctGeneral}, \ref{rderRsimpAltsCommute}, and
  1401 	\ref{insideSimpRemoval}.
  1473 	\ref{insideSimpRemoval}.
  1402 \end{proof}
  1474 \end{proof}
  1403 \noindent
  1475 \noindent
  1404 This closed form has a variant which can be more convenient in later proofs:
  1476 This closed form has a variant which can be more convenient in later proofs:
  1405 \begin{corollary}{altsClosedForm1}
  1477 \begin{corollary}\label{altsClosedForm1}
  1406 	If $s \neq []$ then 
  1478 	If $s \neq []$ then 
  1407 	$\rderssimp \; (\sum \; rs) \; s = 
  1479 	$\rderssimp \; (\sum \; rs) \; s = 
  1408 	\rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$.
  1480 	\rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$.
  1409 \end{corollary}
  1481 \end{corollary}
  1410 \noindent
  1482 \noindent
  1412 Before we obtain them, some preliminary definitions
  1484 Before we obtain them, some preliminary definitions
  1413 are needed to make proof statements concise.
  1485 are needed to make proof statements concise.
  1414 
  1486 
  1415 
  1487 
  1416 \subsubsection{Closed Form for Sequence Regular Expressions}
  1488 \subsubsection{Closed Form for Sequence Regular Expressions}
  1417 First let's look at a series of derivatives steps on a sequence 
  1489 For the sequence regular expression,
  1418 regular expression, assuming that each time the first
  1490 let's first look at a series of derivative steps on it 
  1419 component of the sequence is always nullable):
  1491 (assuming that each time when a derivative is taken,
  1420 \begin{center}
  1492 the head of the sequence is always nullable):
  1421 
  1493 \begin{center}
  1422 	$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
  1494 	\begin{tabular}{llll}
  1423 	$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
  1495 		$r_1 \cdot r_2$ &  
  1424 	\ldots$
  1496 		$\longrightarrow_{\backslash c}$ &  
  1425 
  1497 		$r_1\backslash c \cdot r_2 + r_2 \backslash c$ &
       
  1498 		$ \longrightarrow_{\backslash c'} $ \\ 
       
  1499 		\\
       
  1500 		$(r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc'$ & 
       
  1501 		$\longrightarrow_{\backslash c''} $ &
       
  1502 		$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') 
       
  1503 		+ r_2 \backslash cc'c''$ & 
       
  1504 		$   \longrightarrow_{\backslash c''} \quad \ldots$\\
       
  1505 	\end{tabular}
  1426 \end{center}
  1506 \end{center}
  1427 Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as 
  1507 Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as 
  1428 a giant alternative taking a list of terms 
  1508 a giant alternative taking a list of terms 
  1429 $[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,
  1509 $[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,
  1430 where the head of the list is always the term
  1510 where the head of the list is always the term
  1431 representing a match involving only $r_1$, and the tail of the list consisting of
  1511 representing a match involving only $r_1$, and the tail of the list consisting of
  1432 terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$.
  1512 terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$.
  1433 This intuition is also echoed by IndianPaper, where they gave
  1513 This intuition is also echoed by Murugesan and Sundaram \cite{Murugesan2014}, 
       
  1514 where they gave
  1434 a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:
  1515 a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:
  1435 \begin{center}
  1516 \begin{center}
  1436 	\begin{tabular}{c}
  1517 	\begin{tabular}{lc}
  1437 		$(r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) \myequiv$\\ 
  1518 		$L \; [ (r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) ]$ & $ =$\\
  1438 		\rule{0pt}{3ex} $((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1) \; r_2 \backslash_r c_1)) \backslash_r (c_2 :: \ldots c_n) 
  1519 		\\
  1439 		\myequiv$\\
  1520 		\rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1) \cdot r_2 + 
  1440 		\rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable \; r_1) \; r_2 \backslash_r c_1c_2))
  1521 		(\delta\; (\nullable \; r_1) \; (r_2 \backslash_r c_1) )) \backslash_r 
  1441 		+ (\delta \ (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)) \backslash_r (c_3 \ldots c_n)
  1522 		(c_2 :: \ldots c_n) ]$ &
  1442 		$
  1523 		$=$\\
  1443 	\end{tabular}
  1524 		\\
  1444 \end{center}
  1525 		\rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1c_2 \cdot r_2 + 
  1445 \noindent
  1526 		(\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2)))
  1446 The equality in above should be interpretated
  1527 		$ & \\
  1447 as language equivalence. 
  1528 		\\
  1448 The $\delta$ function works similarly to that of
  1529 		$\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) 
  1449 a Kronecker delta function:
  1530 		\backslash_r (c_3 \ldots c_n) ]$ & $\ldots$ \\
  1450 \[ \delta \; b\; r\]
  1531 	\end{tabular}
  1451 will produce $r$
  1532 \end{center}
  1452 if $b$ evaluates to true, 
  1533 \noindent
  1453 and $\RZERO$ otherwise.
  1534 The $\delta$ function 
  1454 Note that their formulation  
  1535 returns $r$ when the boolean condition
  1455 \[
  1536 $b$ evaluates to true and
  1456 	((r_1 \backslash_r \, c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
  1537 $\ZERO$ otherwise:
  1457 	+ (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)
  1538 \begin{center}
  1458 \]
  1539 	\begin{tabular}{lcl}
       
  1540 		$\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\
       
  1541 				  & $\dn$ & $\ZERO \quad otherwise$
       
  1542 	\end{tabular}
       
  1543 \end{center}
       
  1544 \noindent
       
  1545 Note that the term
       
  1546 \begin{center}
       
  1547 	\begin{tabular}{lc}
       
  1548 		\rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + 
       
  1549 		(\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2)))
       
  1550 		$ & \\
       
  1551 		\\
       
  1552 		$\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) 
       
  1553 		\backslash_r (c_3 \ldots c_n)$ &\\
       
  1554 	\end{tabular}
       
  1555 \end{center}
       
  1556 \noindent
  1459 does not faithfully
  1557 does not faithfully
  1460 represent what the intermediate derivatives would actually look like
  1558 represent what the intermediate derivatives would actually look like
  1461 when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not 
  1559 when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not 
  1462 nullable in the head of the sequence.
  1560 nullable in the head of the sequence.
  1463 For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable,
  1561 For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable,
  1464 the regular expression would not look like 
  1562 the regular expression would not look like 
  1465 \[
  1563 \[
  1466 	(r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO,
  1564 	r_1 \backslash_r c_1c_2
  1467 \]
  1565 \]
  1468 but actually $r_1 \backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the
  1566 instead of
       
  1567 \[
       
  1568 	(r_1 \backslash_r c_1c_2 + \ZERO ) + \ZERO.
       
  1569 \]
       
  1570 The redundant $\ZERO$s will not be created in the
  1469 first place.
  1571 first place.
  1470 In a closed-form one would want to take into account this 
  1572 In a closed-form one needs to take into account this (because
  1471 and generate the list of
  1573 closed forms require exact equality rather than language equivalence)
  1472 regular expressions $r_2 \backslash_r s''$ with
  1574 and only generate the 
  1473 string pairs $(s', s'')$ where $s'@s'' = s$ and
  1575 $r_2 \backslash_r s''$ terms satisfying the property
  1474 $r_1 \backslash s'$ nullable.
  1576 \begin{center}
  1475 We denote the list consisting of such 
  1577 $\exists  s'.  such \; that \; s'@s'' = s \;\; \land \;\;
  1476 strings $s''$ as $\vsuf{s}{r_1}$.
  1578 r_1 \backslash s' \; is \; nullable$.
  1477 
  1579 \end{center}
  1478 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
  1580 Given the arguments $s$ and $r_1$, we denote the list of strings
       
  1581 $s''$ satisfying the above property as $\vsuf{s}{r_1}$.
       
  1582 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{
       
  1583 	Perhaps a better name of it would be ``NullablePrefixSuffix''
       
  1584 	to differentiate with the list of \emph{all} prefixes of $s$, but
       
  1585 	that is a bit too long for a function name and we are yet to find
       
  1586 a more concise and easy-to-understand name.}
  1479 \begin{center}
  1587 \begin{center}
  1480 	\begin{tabular}{lcl}
  1588 	\begin{tabular}{lcl}
  1481 		$\vsuf{[]}{\_} $ & $=$ &  $[]$\\
  1589 		$\vsuf{[]}{\_} $ & $=$ &  $[]$\\
  1482 		$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
  1590 		$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} \; (\rnullable{r_1}) \; \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
  1483 				    && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
  1591 				    && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
  1484 	\end{tabular}
  1592 	\end{tabular}
  1485 \end{center}
  1593 \end{center}
  1486 \noindent
  1594 \noindent
  1487 The list is sorted in the order $r_2\backslash s''$ 
  1595 The list starts with shorter suffixes
  1488 appears in $(r_1\cdot r_2)\backslash s$.
  1596 and ends with longer ones (in other words, the string elements $s''$
       
  1597 in the list $\vsuf{s}{r_1}$ are sorted
       
  1598 in the same order as that of the terms $r_2\backslash s''$ 
       
  1599 appearing in $(r_1\cdot r_2)\backslash s$).
  1489 In essence, $\vsuf{\_}{\_}$ is doing a 
  1600 In essence, $\vsuf{\_}{\_}$ is doing a 
  1490 "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
  1601 "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
  1491 the entire result $(r_1 \cdot r_2) \backslash s$, 
  1602 the entire result $(r_1 \cdot r_2) \backslash s$, 
  1492 it only stores all the strings $s''$ such that $r_2 \backslash s''$
  1603 it only stores strings,
  1493 are occurring terms in $(r_1\cdot r_2)\backslash s$.
  1604 with each string $s''$ representing a term such that $r_2 \backslash s''$
  1494 
  1605 is occurring in $(r_1\cdot r_2)\backslash s$.
  1495 To make the closed form representation 
  1606 
  1496 more straightforward,
  1607 With $\textit{Suffix}$ we are ready to express the
  1497 the flattetning function $\sflat{\_}$ is used to enable the transformation from 
  1608 sequence regular expression's closed form,
       
  1609 but before doing so 
       
  1610 more devices are needed.
       
  1611 The first thing is the flattening function $\sflat{\_}$,
       
  1612 which takes an alternative regular expression and produces a flattened version
       
  1613 of that alternative regular expression.
       
  1614 It is needed to convert
  1498 a left-associative nested sequence of alternatives into 
  1615 a left-associative nested sequence of alternatives into 
  1499 a flattened list:
  1616 a flattened list:
  1500 \[
  1617 \[
  1501 	\sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{\_}}{\rightarrow} 
  1618 	\sum(\ldots ((r_1 + r_2) + r_3) + \ldots)
  1502 	(\ldots ((r_1 + r_2) + r_3) + \ldots)
  1619 	\stackrel{\sflat{\_}}{\rightarrow} 
       
  1620 	\sum[r_1, r_2, r_3, \ldots]
  1503 \]
  1621 \]
  1504 \noindent
  1622 \noindent
  1505 The definitions $\sflat{\_}$, $\sflataux{\_}$ are given below.
  1623 The definitions of $\sflat{\_}$ and helper functions
       
  1624 $\sflataux{\_}$ and $\llparenthesis \_ \rrparenthesis''$ are given below.
  1506 \begin{center}  
  1625 \begin{center}  
  1507 	\begin{tabular}{ccc}
  1626 	\begin{tabular}{lcl}
  1508 		$\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
  1627 		$\sflataux{\sum r :: rs}$ & $\dn$ & $\sflataux{r} @ rs$\\
  1509 		$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
  1628 		$\sflataux{\sum []}$ & $ \dn $ & $ []$\\
  1510 		$\sflataux r$ & $=$ & $ [r]$
  1629 		$\sflataux r$ & $\dn$ & $ [r]$
  1511 	\end{tabular}
  1630 	\end{tabular}
  1512 \end{center}
  1631 \end{center}
  1513 
  1632 
  1514 \begin{center} 
  1633 \begin{center} 
  1515 	\begin{tabular}{ccc}
  1634 	\begin{tabular}{lcl}
  1516 		$\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\
  1635 		$\sflat{(\sum r :: rs)}$ & $\dn$ & $\sum (\sflataux{r} @ rs)$\\
  1517 		$\sflat{\sum []}$ & $ = $ & $ \sum []$\\
  1636 		$\sflat{\sum []}$ & $ \dn $ & $ \sum []$\\
  1518 		$\sflat r$ & $=$ & $ r$
  1637 		$\sflat r$ & $\dn$ & $ r$
       
  1638 	\end{tabular}
       
  1639 \end{center}
       
  1640 
       
  1641 \begin{center}  
       
  1642 	\begin{tabular}{lcl}
       
  1643 		$\sflataux{[]}'$ & $ \dn $ & $ []$\\
       
  1644 		$\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\
       
  1645 		$\sflataux{r :: rs}$ & $\dn$ & $ r::rs$
  1519 	\end{tabular}
  1646 	\end{tabular}
  1520 \end{center}
  1647 \end{center}
  1521 \noindent
  1648 \noindent
  1522 $\sflataux{\_}$ breaks up nested alternative regular expressions 
  1649 $\sflataux{\_}$ breaks up nested alternative regular expressions 
  1523 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
  1650 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
  1525 It will return the singleton list $[r]$ otherwise.
  1652 It will return the singleton list $[r]$ otherwise.
  1526 $\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
  1653 $\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
  1527 the output type a regular expression, not a list.
  1654 the output type a regular expression, not a list.
  1528 $\sflataux{\_}$  and $\sflat{\_}$ are only recursive on the  
  1655 $\sflataux{\_}$  and $\sflat{\_}$ are only recursive on the  
  1529 first element of the list.
  1656 first element of the list.
  1530 
  1657 $\sflataux{\_}'$ takes a list of regular expressions as input, and outputs
  1531 With $\sflataux{}$ a preliminary to the closed form can be stated,
  1658 a list of regular expressions.
  1532 where the derivative of $r_1 \cdot r_2 \backslash s$ can be
  1659 The use of $\sflataux{\_}$ and $\sflataux{\_}'$ is clear once we have
  1533 flattened into a list whose head and tail meet the description
  1660 $\textit{createdBySequence}$ defined:
  1534 we gave earlier.
  1661 \begin{center}
       
  1662 	\begin{mathpar}
       
  1663 		\inferrule{\mbox{}}{\textit{createdBySequence}\; (r_1 \cdot r_2)}
       
  1664 
       
  1665 		\inferrule{\textit{createdBySequence} \; r_1}{\textit{createdBySequence} \;
       
  1666 		(r_1 + r_2)}
       
  1667 	\end{mathpar}
       
  1668 \end{center}
       
  1669 \noindent
       
  1670 The predicate $\textit{createdBySequence}$ is used to describe the shape of
       
  1671 the derivative regular expressions $(r_1\cdot r_2) \backslash s$:
       
  1672 \begin{lemma}\label{recursivelyDerseq}
       
  1673 	It is always the case that
       
  1674 	\begin{center}
       
  1675 		$\textit{createdBySequence} \; ( (r_1\cdot r_2) \backslash_r s) $
       
  1676 	\end{center}
       
  1677 	holds.
       
  1678 \end{lemma}
       
  1679 \begin{proof}
       
  1680 	By a reverse induction on the string $s$, where the inductive cases are $[]$
       
  1681 	and $xs  @ [x]$.
       
  1682 \end{proof}
       
  1683 \noindent
       
  1684 If we have a regular expression $r$ whose shpae 
       
  1685 fits into those described by $\textit{createdBySequence}$,
       
  1686 then we can convert between 
       
  1687 $r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with
       
  1688 $\sflataux{\_}'$:
       
  1689 \begin{lemma}\label{sfauIdemDer}
       
  1690 	If $\textit{createdBySequence} \; r$, then 
       
  1691 	\begin{center}
       
  1692 		$\sflataux{ r \backslash_r c} = 
       
  1693 		\llparenthesis (\map \; (\_ \backslash_r c) \; (\sflataux{r}) ) \rrparenthesis''$
       
  1694 	\end{center}
       
  1695 	holds.
       
  1696 \end{lemma}
       
  1697 \begin{proof}
       
  1698 	By a simple induction on the inductive cases of $\textit{createdBySequence}.
       
  1699 	$
       
  1700 \end{proof}
       
  1701 
       
  1702 Now we are ready to express
       
  1703 the shape of $r_1 \cdot r_2 \backslash s$
  1535 \begin{lemma}\label{seqSfau0}
  1704 \begin{lemma}\label{seqSfau0}
  1536 	$\sflataux{\rders{(r_1 \cdot r_2) \backslash s }} = (r_1 \backslash_r s) \cdot r_2 
  1705 	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 
  1537 	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ 
  1706 	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ 
  1538 \end{lemma}
  1707 \end{lemma}
  1539 \begin{proof}
  1708 \begin{proof}
  1540 	By an induction on the string $s$, where the inductive cases 
  1709 	By a reverse induction on the string $s$, where the inductive cases 
  1541 	are split as $[]$ and $xs @ [x]$.
  1710 	are $[]$ and $xs @ [x]$.
  1542 	Note the key identify holds:
  1711 	For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2)
       
  1712 	\backslash_r xs)$ holds from lemma \ref{recursivelyDerseq},
       
  1713 	which can be used to prove
  1543 	\[
  1714 	\[
  1544 		\map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\;
  1715 		\map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\;
  1545 		\map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1}))
  1716 		\map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1}))
  1546 	\]
  1717 	\]
  1547 	=
  1718 	=
  1548 	\[
  1719 	\[
  1549 		\map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1})
  1720 		\map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1})
  1550 	\]
  1721 	\]
  1551 	This enables the inductive case to go through.
  1722 	using lemma \ref{sfauIdemDer}.
       
  1723 	This equality enables the inductive case to go through.
  1552 \end{proof}
  1724 \end{proof}
  1553 \noindent 
  1725 \noindent 
  1554 Note that this lemma does $\mathbf{not}$ depend on any
  1726 This lemma says that $(r_1\cdot r_2)\backslash s$
  1555 specific definitions we used,
  1727 can be flattened into a list whose head and tail meet the description
  1556 allowing people investigating derivatives to get an alternative
  1728 we gave earlier.
  1557 view of what $r_1 \cdot r_2$ is.
  1729 %Note that this lemma does $\mathbf{not}$ depend on any
  1558 
  1730 %specific definitions we used,
  1559 Now we are able to use this for the intuition that 
  1731 %allowing people investigating derivatives to get an alternative
  1560 the different ways in which regular expressions are 
  1732 %view of what $r_1 \cdot r_2$ is.
  1561 nested do not matter under $\rsimp{}$:
  1733 
  1562 \begin{center}
  1734 We now use $\textit{createdBySequence}$ and
  1563 	$\rsimp{r} \stackrel{?}{\sequal} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$ 
  1735 $\sflataux{\_}$ to describe an intuition
  1564 	and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$
  1736 behind the closed form of the sequence closed form.
  1565 \end{center}
  1737 If two regular expressions only differ in the way their
  1566 Simply wrap with $\sum$ constructor and add 
  1738 alternatives are nested, then we should be able to get the same result
  1567 simplifications to both sides of \ref{seqSfau0}
  1739 once we apply simplification to both of them:
  1568 and one gets
  1740 \begin{lemma}\label{sflatRsimpeq}
  1569 \begin{corollary}\label{seqClosedFormGeneral}
  1741 	If $r$ is created from a sequence through
       
  1742 	a series of derivatives 
       
  1743 	(i.e. if $\textit{createdBySequence} \; r$ holds), 
       
  1744 	and that $\sflataux{r} = rs$,
       
  1745 	then we have
       
  1746 	that 
       
  1747 	\begin{center}
       
  1748 		$\textit{rsimp} \; r = \textit{rsimp} \; (\sum \; rs)$
       
  1749 	\end{center}
       
  1750 	holds.
       
  1751 \end{lemma}
       
  1752 \begin{proof}
       
  1753 	By an induction on the inductive cases of $\textit{createdBySequence}$. 
       
  1754 \end{proof}
       
  1755 
       
  1756 Now we are ready for the closed form 
       
  1757 for the sequence regular expressions (without the inner applications
       
  1758 of simplifications):
       
  1759 \begin{lemma}\label{seqClosedFormGeneral}
  1570 	$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }
  1760 	$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }
  1571 	=\rsimp{(\sum (  (r_1 \backslash s) \cdot r_2 :: 
  1761 	=\rsimp{(\sum (  (r_1 \backslash s) \cdot r_2 :: 
  1572 	\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
  1762 	\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
  1573 \end{corollary}
  1763 \end{lemma}
       
  1764 \begin{proof}
       
  1765 	We know that 
       
  1766 	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 
       
  1767 	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$
       
  1768 	holds
       
  1769 	by lemma \ref{seqSfau0}.
       
  1770 	This allows the theorem to go through because of lemma \ref{sflatRsimpeq}.
       
  1771 \end{proof}
  1574 Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
  1772 Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
  1575 it is possible to convert the above lemma to obtain a "closed form"
  1773 it is possible to convert the above lemma to obtain the
       
  1774 proper closed form for $\backslash_{rsimps}$ rather than $\backslash_r$:
  1576 for  derivatives nested with simplification:
  1775 for  derivatives nested with simplification:
  1577 \begin{lemma}\label{seqClosedForm}
  1776 \begin{theorem}\label{seqClosedForm}
  1578 	$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) 
  1777 	$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) 
  1579 	:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$
  1778 	:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$
  1580 \end{lemma}
  1779 \end{theorem}
  1581 \begin{proof}
  1780 \begin{proof}
  1582 	By a case analysis of string $s$.
  1781 	By a case analysis of the string $s$.
  1583 	When $s$ is empty list, the rewrite is straightforward.
  1782 	When $s$ is an empty list, the rewrite is straightforward.
  1584 	When $s$ is a list, one could use the corollary \ref{seqSfau0},
  1783 	When $s$ is a non-empty list, the
  1585 	and lemma \ref{Simpders} to rewrite the left-hand-side.
  1784 	lemmas \ref{seqClosedFormGeneral} and \ref{Simpders} apply,
  1586 \end{proof}
  1785 	making the proof go through.
  1587 As a corollary for this closed form, one can estimate the size 
  1786 \end{proof}
  1588 of the sequence derivative $r_1 \cdot r_2 \backslash_r s$ using 
       
  1589 an easier-to-handle expression:
       
  1590 \begin{corollary}\label{seqEstimate1}
       
  1591 	\begin{center}
       
  1592 
       
  1593 		$\llbracket \rderssimp{(r_1 \cdot r_2)}{s} \rrbracket_r = \llbracket \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) 
       
  1594 		:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$
       
  1595 
       
  1596 	\end{center}
       
  1597 \end{corollary}
       
  1598 \noindent
       
  1599 \subsubsection{Closed Forms for Star Regular Expressions}
  1787 \subsubsection{Closed Forms for Star Regular Expressions}
  1600 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
  1788 The closed form for the star regular expression involves similar tricks
  1601 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
  1789 for the sequence regular expression.
  1602 the property of the $\distinct$ function.
  1790 The $\textit{Suffix}$ function is now replaced by something
  1603 Now we try to get a bound on $r^* \backslash s$ as well.
  1791 slightly more complex, because the growth pattern of star
  1604 Again, we first look at how a star's derivatives evolve, if they grow maximally: 
  1792 regular expressions' derivatives is a bit different:
  1605 \begin{center}
  1793 \begin{center}
  1606 
  1794 	\begin{tabular}{lclc}
  1607 	$r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
  1795 		$r^* $ & $\longrightarrow_{\backslash c}$ & 
  1608 	r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
  1796 		$(r\backslash c)  \cdot  r^*$ & $\longrightarrow_{\backslash c'}$\\
  1609 	(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
  1797 		\\
  1610 	\quad \ldots$
  1798 		$r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*$ &
  1611 
  1799 		$\longrightarrow_{\backslash c''}$ & 
       
  1800 		$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
       
  1801 		(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)$ & 
       
  1802 		$\longrightarrow_{\backslash c'''}$ \\
       
  1803 		\\
       
  1804 		$\ldots$\\
       
  1805 	\end{tabular}
  1612 \end{center}
  1806 \end{center}
  1613 When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
  1807 When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
  1614 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
  1808 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
  1615 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
  1809 the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly
  1616 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
  1810 in the sequence case.
  1617 count the possible size explosions of $r \backslash c$ themselves.
  1811 The good news is that the function $\textit{rsimp}$ will again
  1618 
  1812 ignore the difference between differently nesting patterns of alternatives,
  1619 Thanks to $\rflts$ and $\rDistinct$, we are able to open up regular expressions like
  1813 and the exponentially growing star derivative like
  1620 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
  1814 \begin{center}
  1621 (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
  1815 	$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
  1622 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', 
  1816 	(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
  1623 r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
  1817 \end{center}
  1624 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
  1818 can be treated as
       
  1819 \begin{center}
       
  1820 	$\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', 
       
  1821 	r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
       
  1822 \end{center}
       
  1823 which can be de-duplicated by $\rDistinct$
       
  1824 and therefore bounded finitely.
       
  1825 
       
  1826 and then de-duplicate terms of the form  ($s'$ being a substring of $s$).
  1625 This allows us to use a similar technique as $r_1 \cdot r_2$ case,
  1827 This allows us to use a similar technique as $r_1 \cdot r_2$ case,
  1626 where the crux is to get an equivalent form of 
  1828 
  1627 $\rderssimp{r^*}{s}$ with shape $\rsimp{\sum rs}$.
  1829 Now the crux of this section is finding a suitable description
  1628 This requires generating 
  1830 for $rs$ where
  1629 all possible sub-strings $s'$ of $s$
  1831 \begin{center}
  1630 such that $r\backslash s' \cdot r^*$ will appear 
  1832 	$\rderssimp{r^*}{s} = \rsimp{\sum rs}$.
  1631 as a term in $(r^*) \backslash s$.
  1833 \end{center}
  1632 The first function we define is a single-step
  1834 holds.
  1633 updating function $\starupdate$, which takes three arguments as input:
  1835 In addition, the list $rs$
  1634 the new character $c$ to take derivative with, 
  1836 shall be in the form of 
  1635 the regular expression
  1837 $\map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss$.
  1636 $r$ directly under the star $r^*$, and the
  1838 The $Ss$ is a list of strings, and for example in the sequence
  1637 list of strings $sSet$ for the derivative $r^* \backslash s$ 
  1839 closed form it is specified as $\textit{Suffix} \; s \; r_1$.
  1638 up til this point  
  1840 To get $Ss$ for the star regular expression,
  1639 such that $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ 
  1841 we need to introduce $\starupdate$ and $\starupdates$:
  1640 (the equality is not exact, more on this later).
       
  1641 \begin{center}
  1842 \begin{center}
  1642 	\begin{tabular}{lcl}
  1843 	\begin{tabular}{lcl}
  1643 		$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
  1844 		$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
  1644 		$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
  1845 		$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
  1645 						     & & $\textit{if} \; 
  1846 						     & & $\textit{if} \; 
  1648 						     \starupdate \; c \; r \; Ss)$ \\
  1849 						     \starupdate \; c \; r \; Ss)$ \\
  1649 						     & & $\textit{else} \;\; (s @ [c]) :: (
  1850 						     & & $\textit{else} \;\; (s @ [c]) :: (
  1650 						     \starupdate \; c \; r \; Ss)$
  1851 						     \starupdate \; c \; r \; Ss)$
  1651 	\end{tabular}
  1852 	\end{tabular}
  1652 \end{center}
  1853 \end{center}
  1653 \noindent
       
  1654 As a generalisation from characters to strings,
       
  1655 $\starupdates$ takes a string instead of a character
       
  1656 as the first input argument, and is otherwise the same
       
  1657 as $\starupdate$.
       
  1658 \begin{center}
  1854 \begin{center}
  1659 	\begin{tabular}{lcl}
  1855 	\begin{tabular}{lcl}
  1660 		$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
  1856 		$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
  1661 		$\starupdates \; (c :: cs) \; r \; Ss$ &  $=$ &  $\starupdates \; cs \; r \; (
  1857 		$\starupdates \; (c :: cs) \; r \; Ss$ &  $=$ &  $\starupdates \; cs \; r \; (
  1662 		\starupdate \; c \; r \; Ss)$
  1858 		\starupdate \; c \; r \; Ss)$
  1663 	\end{tabular}
  1859 	\end{tabular}
  1664 \end{center}
  1860 \end{center}
  1665 \noindent
  1861 \noindent
  1666 For the star regular expression,
  1862 Assuming we have that
  1667 its derivatives can be seen as  a nested gigantic
  1863 \begin{center}
  1668 alternative similar to that of sequence regular expression's derivatives, 
  1864 	$\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$.
  1669 and therefore need
  1865 \end{center}
  1670 to be ``straightened out" as well.
  1866 holds.
  1671 The function for this would be $\hflat{}$ and $\hflataux{}$.
  1867 The idea of $\starupdate$ and $\starupdates$
       
  1868 is to update $Ss$ when another
       
  1869 derivative is taken on $\rderssimp{r^*}{s}$
       
  1870 w.r.t a character $c$ and a string $s'$
       
  1871 respectively.
       
  1872 Both $\starupdate$ and $\starupdates$ take three arguments as input:
       
  1873 the new character $c$ or string $s$ to take derivative with, 
       
  1874 the regular expression
       
  1875 $r$ under the star $r^*$, and the
       
  1876 list of strings $Ss$ for the derivative $r^* \backslash s$ 
       
  1877 up until this point  
       
  1878 such that 
       
  1879 \begin{center}
       
  1880 $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ 
       
  1881 \end{center}
       
  1882 is satisfied.
       
  1883 
       
  1884 Functions $\starupdate$ and $\starupdates$ characterise what the 
       
  1885 star derivatives will look like once ``straightened out'' into lists.
       
  1886 The helper functions for such operations will be similar to
       
  1887 $\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence.
       
  1888 We use similar symbols to denote them, with a $*$ subscript to mark the difference.
  1672 \begin{center}
  1889 \begin{center}
  1673 	\begin{tabular}{lcl}
  1890 	\begin{tabular}{lcl}
  1674 		$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
  1891 		$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
  1675 		$\hflataux{r}$ & $\dn$ & $[r]$
  1892 		$\hflataux{r}$ & $\dn$ & $[r]$
  1676 	\end{tabular}
  1893 	\end{tabular}
  1681 		$\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\
  1898 		$\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\
  1682 		$\hflat{r}$ & $\dn$ & $r$
  1899 		$\hflat{r}$ & $\dn$ & $r$
  1683 	\end{tabular}
  1900 	\end{tabular}
  1684 \end{center}
  1901 \end{center}
  1685 \noindent
  1902 \noindent
  1686 %MAYBE TODO: introduce createdByStar
  1903 These definitions are tailor-made for dealing with alternatives that have
  1687 Again these definitions are tailor-made for dealing with alternatives that have
  1904 originated from a star's derivatives.
  1688 originated from a star's derivatives, so we do not attempt to open up all possible 
  1905 A typical star derivative always has the structure of a balanced binary tree:
  1689 regular expressions of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
  1906 \begin{center}
  1690 elements.
  1907 	$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
  1691 We give a predicate for such "star-created" regular expressions:
  1908 	(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
  1692 \begin{center}
  1909 \end{center}
  1693 	\begin{tabular}{lcr}
  1910 All of the nested structures of alternatives
  1694 	 &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
  1911 generated from derivatives are binary, and therefore
  1695 		$\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
  1912 $\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives.
  1696 	\end{tabular}
  1913 $\hflat{\_}$ ``untangles'' like the following:
  1697 \end{center}
  1914 \[
  1698 
  1915 	\sum ((r_1 + r_2) + (r_3 + r_4))  + \ldots \;
  1699 These definitions allows us the flexibility to talk about 
  1916 	\stackrel{\hflat{\_}}{\longrightarrow} \;
  1700 regular expressions in their most convenient format,
  1917 	\RALTS{[r_1, r_2, \ldots, r_n]} 
  1701 for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
  1918 \]
  1702 instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
  1919 Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$,
  1703 These definitions help express that certain classes of syntatically 
  1920 with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists 
  1704 distinct regular expressions are actually the same under simplification.
  1921 			and merges each of the element lists to form a flattened list.}:
  1705 This is not entirely true for annotated regular expressions: 
  1922 \begin{lemma}\label{stupdateInduct1}
  1706 %TODO: bsimp bders \neq bderssimp
  1923 	\mbox
  1707 \begin{center}
  1924 	For a list of strings $Ss$, the following hold.
  1708 	$(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
  1925 	\begin{itemize}
  1709 \end{center}
  1926 		\item
  1710 For bit-codes, the order in which simplification is applied
  1927 			If we do a derivative on the terms 
  1711 might cause a difference in the location they are placed.
  1928 			$r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$),
  1712 If we want something like
  1929 			the result will be the same as if we apply $\starupdate$ to $Ss$.
  1713 \begin{center}
  1930 			\begin{center}
  1714 	$\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
  1931 				\begin{tabular}{c}
  1715 \end{center}
  1932 			$\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x)
  1716 Some "canonicalization" procedure is required,
  1933 			\circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\;
  1717 which either pushes all the common bitcodes to nodes
  1934 			$\\
  1718 as senior as possible:
  1935 			\\
  1719 \begin{center}
  1936 			$=$ \\
  1720 	$_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
  1937 			\\
  1721 \end{center}
  1938 			$\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \; 
  1722 or does the reverse. However bitcodes are not of interest if we are talking about
  1939 			(\starupdate \; x \; r \; Ss)$
  1723 the $\llbracket r \rrbracket$ size of a regex.
  1940 				\end{tabular}
  1724 Therefore for the ease and simplicity of producing a
  1941 			\end{center}
  1725 proof for a size bound, we are happy to restrict ourselves to 
  1942 		\item
  1726 unannotated regular expressions, and obtain such equalities as
  1943 			$\starupdates$ is ``composable'' w.r.t a derivative.
  1727 \begin{lemma}
  1944 			It piggybacks the character $x$ to the tail of the string
  1728 	$\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
  1945 			$xs$.
       
  1946 			\begin{center}
       
  1947 				\begin{tabular}{c}
       
  1948 					$\textit{concat} \; (\map \; \hflataux{\_} \; 
       
  1949 					(\map \; (\_\backslash_r x) \; 
       
  1950 					(\map \; (\lambda s.\;\; (r \backslash_r s) \cdot 
       
  1951 					(r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\
       
  1952 					\\
       
  1953 					$=$\\
       
  1954 					\\
       
  1955 					$\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \;
       
  1956 					(\starupdates \; (xs @ [x]) \; r \; Ss)$
       
  1957 				\end{tabular}
       
  1958 			\end{center}
       
  1959 	\end{itemize}
       
  1960 \end{lemma}
       
  1961 			
       
  1962 \begin{proof}
       
  1963 	Part 1 is by induction on $Ss$.
       
  1964 	Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values.
       
  1965 \end{proof}
       
  1966 			
       
  1967 
       
  1968 Like $\textit{createdBySequence}$, we need
       
  1969 a predicate for ``star-created'' regular expressions:
       
  1970 \begin{center}
       
  1971 	\begin{mathpar}
       
  1972 		\inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} }
       
  1973 
       
  1974 		\inferrule{  \textit{createdByStar} \; r_1\; \land  \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) } 
       
  1975 	\end{mathpar}
       
  1976 \end{center}
       
  1977 \noindent
       
  1978 All regular expressions created by taking derivatives of
       
  1979 $r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate:
       
  1980 \begin{lemma}\label{starDersCbs}
       
  1981 	$\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds.
       
  1982 \end{lemma}
       
  1983 \begin{proof}
       
  1984 	By a reverse induction on $s$.
       
  1985 \end{proof}
       
  1986 If a regular expression conforms to the shape of a star's derivative,
       
  1987 then we can push an application of $\hflataux{\_}$ inside a derivative of it:
       
  1988 \begin{lemma}\label{hfauPushin}
       
  1989 	If $\textit{createdByStar} \; r$ holds, then
       
  1990 	\begin{center}
       
  1991 		$\hflataux{r \backslash_r c} = \textit{concat} \; (
       
  1992 		\map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$
       
  1993 	\end{center}
       
  1994 	holds.
       
  1995 \end{lemma}
       
  1996 \begin{proof}
       
  1997 	By an induction on the inductivev cases of $\textit{createdByStar}$.
       
  1998 \end{proof}
       
  1999 %This is not entirely true for annotated regular expressions: 
       
  2000 %%TODO: bsimp bders \neq bderssimp
       
  2001 %\begin{center}
       
  2002 %	$(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
       
  2003 %\end{center}
       
  2004 %For bit-codes, the order in which simplification is applied
       
  2005 %might cause a difference in the location they are placed.
       
  2006 %If we want something like
       
  2007 %\begin{center}
       
  2008 %	$\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
       
  2009 %\end{center}
       
  2010 %Some "canonicalization" procedure is required,
       
  2011 %which either pushes all the common bitcodes to nodes
       
  2012 %as senior as possible:
       
  2013 %\begin{center}
       
  2014 %	$_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
       
  2015 %\end{center}
       
  2016 %or does the reverse. However bitcodes are not of interest if we are talking about
       
  2017 %the $\llbracket r \rrbracket$ size of a regex.
       
  2018 %Therefore for the ease and simplicity of producing a
       
  2019 %proof for a size bound, we are happy to restrict ourselves to 
       
  2020 %unannotated regular expressions, and obtain such equalities as
       
  2021 %TODO: rsimp sflat
       
  2022 % The simplification of a flattened out regular expression, provided it comes
       
  2023 %from the derivative of a star, is the same as the one nested.
       
  2024 
       
  2025 
       
  2026 
       
  2027 Now we introduce an inductive property
       
  2028 for $\starupdate$ and $\hflataux{\_}$.
       
  2029 \begin{lemma}\label{starHfauInduct}
       
  2030 	If we do derivatives of $r^*$
       
  2031 	with a string that starts with $c$,
       
  2032 	then flatten it out,
       
  2033 	we obtain a list
       
  2034 	of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$,
       
  2035 	where $sS = \starupdates \; s \; r \; [[c]]$. Namely,
       
  2036 	\begin{center}
       
  2037 		$\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = 
       
  2038 		\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; 
       
  2039 		(\starupdates \; s \; r_0 \; [[c]])$
       
  2040 	\end{center}
       
  2041 holds.
       
  2042 \end{lemma}
       
  2043 \begin{proof}
       
  2044 	By an induction on $s$, the inductive cases
       
  2045 	being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used.
       
  2046 \end{proof}
       
  2047 \noindent
       
  2048 
       
  2049 $\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
       
  2050 \begin{lemma}\label{hflatauxGrewrites}
       
  2051 	$a :: rs \grewrites \hflataux{a} @ rs$
       
  2052 \end{lemma}
       
  2053 \begin{proof}
       
  2054 	By induction on $a$. $rs$ is set to take arbitrary values.
       
  2055 \end{proof}
       
  2056 It is also not surprising that $\textit{rsimp}$ will cover
       
  2057 the effect of $\hflataux{\_}$:
       
  2058 \begin{lemma}\label{cbsHfauRsimpeq1}
       
  2059 	$\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$
  1729 \end{lemma}
  2060 \end{lemma}
  1730 
  2061 
  1731 \begin{proof}
  2062 \begin{proof}
  1732 	By using the rewriting relation $\rightsquigarrow$
  2063 	By using the rewriting relation $\rightsquigarrow$
  1733 \end{proof}
  2064 \end{proof}
  1734 %TODO: rsimp sflat
       
  1735 And from this we obtain a proof that a star's derivative will be the same
  2065 And from this we obtain a proof that a star's derivative will be the same
  1736 as if it had all its nested alternatives created during deriving being flattened out:
  2066 as if it had all its nested alternatives created during deriving being flattened out:
  1737 For example,
  2067 For example,
  1738 \begin{lemma}
  2068 \begin{lemma}\label{hfauRsimpeq2}
  1739 	$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
  2069 	$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
  1740 \end{lemma}
  2070 \end{lemma}
  1741 \begin{proof}
  2071 \begin{proof}
  1742 	By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
  2072 	By structural induction on $r$, where the induction rules 
  1743 \end{proof}
  2073 	are these of $\createdByStar{\_}$.
  1744 % The simplification of a flattened out regular expression, provided it comes
  2074 	Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case.
  1745 %from the derivative of a star, is the same as the one nested.
  2075 \end{proof}
  1746 
  2076 
  1747 
  2077 
  1748 
  2078 %Here is a corollary that states the lemma in
  1749 We first introduce an inductive property
  2079 %a more intuitive way:
  1750 for $\starupdate$ and $\hflataux{\_}$, 
  2080 %\begin{corollary}
  1751 it says if we do derivatives of $r^*$
  2081 %	$\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
  1752 with a string that starts with $c$,
  2082 %	(r^*))\; (\starupdates \; c\; r\; [[c]])$
  1753 then flatten it out,
  2083 %\end{corollary}
  1754 we obtain a list
  2084 %\noindent
  1755 of the shape $\sum_{s' \in sSet} (r\backslash_r s') \cdot r^*$,
  2085 %Note that this is also agnostic of the simplification
  1756 where $sSet = \starupdates \; s \; r \; [[c]]$.
  2086 %function we defined, and is therefore of more general interest.
  1757 \begin{lemma}\label{starHfauInduct}
  2087 
  1758 	$\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = 
       
  1759 	\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; 
       
  1760 	(\starupdates \; s \; r_0 \; [[c]])$
       
  1761 \end{lemma}
       
  1762 \begin{proof}
       
  1763 	By an induction on $s$, the inductive cases
       
  1764 	being $[]$ and $s@[c]$.
       
  1765 \end{proof}
       
  1766 \noindent
       
  1767 Here is a corollary that states the lemma in
       
  1768 a more intuitive way:
       
  1769 \begin{corollary}
       
  1770 	$\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
       
  1771 	(r^*))\; (\starupdates \; c\; r\; [[c]])$
       
  1772 \end{corollary}
       
  1773 \noindent
       
  1774 Note that this is also agnostic of the simplification
       
  1775 function we defined, and is therefore of more general interest.
       
  1776 
       
  1777 Now adding the $\rsimp{}$ bit for closed forms,
       
  1778 we have
       
  1779 \begin{lemma}
       
  1780 	$a :: rs \grewrites \hflataux{a} @ rs$
       
  1781 \end{lemma}
       
  1782 \noindent
       
  1783 giving us
       
  1784 \begin{lemma}\label{cbsHfauRsimpeq1}
       
  1785 	$\rsimp{a+b} = \rsimp{(\sum \hflataux{a} @ \hflataux{b})}$.
       
  1786 \end{lemma}
       
  1787 \noindent
       
  1788 This yields
       
  1789 \begin{lemma}\label{hfauRsimpeq2}
       
  1790 	$\rsimp{r} = \rsimp{(\sum \hflataux{r})}$
       
  1791 \end{lemma}
       
  1792 \noindent
       
  1793 Together with the rewriting relation
  2088 Together with the rewriting relation
  1794 \begin{lemma}\label{starClosedForm6Hrewrites}
  2089 \begin{lemma}\label{starClosedForm6Hrewrites}
  1795 	$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
  2090 	We have the following set of rewriting relations or equalities:
  1796 	\scfrewrites
  2091 	\begin{itemize}
  1797 	\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
  2092 		\item
  1798 \end{lemma}
  2093 			$\textit{rsimp} \; (r^* \backslash_r (c::s)) 
  1799 \noindent
  2094 			\sequal
  1800 We obtain the closed form for star regular expression:
  2095 			\sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; (
  1801 \begin{lemma}\label{starClosedForm}
  2096 			\starupdates \; s \; r \; [ c::[]] ) ) )$
       
  2097 		\item
       
  2098 			$r \backslash_{rsimp} (c::s) = \textit{rsimp} \; ( (
       
  2099 			\sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \;
       
  2100 			(\starupdates \;s \; r \; [ c::[] ])))))$
       
  2101 		\item
       
  2102 			$\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss))
       
  2103 			\sequal
       
  2104 			 \sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \;
       
  2105 			 r^*) \; Ss) )$
       
  2106 		\item
       
  2107 			$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
       
  2108 			\scfrewrites
       
  2109 			\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
       
  2110 		\item
       
  2111 			$( ( \sum ( ( \map \ (\lambda s. \;\;
       
  2112 			(\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \;
       
  2113 			s \; r \; [ c::[] ])))))$\\
       
  2114 			$\sequal$\\
       
  2115 			$( ( \sum ( ( \map \ (\lambda s. \;\;
       
  2116 			( r \backslash_{rsimp} s)) \cdot r^*) \; (\starupdates \;
       
  2117 			s \; r \; [ c::[] ]))))$\\
       
  2118 	\end{itemize}
       
  2119 \end{lemma}
       
  2120 \begin{proof}
       
  2121 	Part 1 leads to part 2.
       
  2122 	The rest of them are routine.
       
  2123 \end{proof}
       
  2124 \noindent
       
  2125 Next the closed form for star regular expressions can be derived:
       
  2126 \begin{theorem}\label{starClosedForm}
  1802 	$\rderssimp{r^*}{c::s} = 
  2127 	$\rderssimp{r^*}{c::s} = 
  1803 	\rsimp{
  2128 	\rsimp{
  1804 		(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; 
  2129 		(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; 
  1805 		(\starupdates \; s\; r \; [[c]])
  2130 		(\starupdates \; s\; r \; [[c]])
  1806 		)
  2131 		)
  1807 		)
  2132 		)
  1808 	}
  2133 	}
  1809 	$
  2134 	$
  1810 \end{lemma}
  2135 \end{theorem}
  1811 \begin{proof}
  2136 \begin{proof}
  1812 	By an induction on $s$.
  2137 	By an induction on $s$.
  1813 	The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2}
  2138 	The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites} 
       
  2139 	and \ref{hfauRsimpeq2}
  1814 	are used.	
  2140 	are used.	
       
  2141 	In \ref{starClosedForm6Hrewrites}, the equalities are
       
  2142 	used to link the LHS and RHS.
  1815 \end{proof}
  2143 \end{proof}
  1816 
  2144 
  1817 
  2145 
  1818 
  2146 
  1819 
  2147 
  1879 %\end{center}
  2207 %\end{center}
  1880 \section{Bounding Closed Forms}
  2208 \section{Bounding Closed Forms}
  1881 
  2209 
  1882 In this section, we introduce how we formalised the bound
  2210 In this section, we introduce how we formalised the bound
  1883 on closed forms.
  2211 on closed forms.
  1884 We first prove that functions such as $\rflts$
  2212 We first show that in general regular expressions up to a certain 
       
  2213 size are finite.
       
  2214 Then we prove that functions such as $\rflts$
  1885 will not cause the size of r-regular expressions to grow.
  2215 will not cause the size of r-regular expressions to grow.
  1886 Putting this together with a general bound 
  2216 Putting this together with a general bound 
  1887 on the finiteness of distinct regular expressions
  2217 on the finiteness of distinct regular expressions
  1888 smaller than a certain size, we obtain a bound on 
  2218 up to a certain size, we obtain a bound on 
  1889 the closed forms.
  2219 the closed forms.
       
  2220 
       
  2221 \subsection{Finiteness of Distinct Regular Expressions}
       
  2222 We define the set of regular expressions whose size are no more than
       
  2223 a certain size $N$ as $\textit{sizeNregex} \; N$:
       
  2224 \[
       
  2225 	\textit{sizeNregex} \; N \dn \{r\; \mid \;  \llbracket r \rrbracket_r \leq N \}
       
  2226 \]
       
  2227 We have that $\textit{sizeNregex} \; N$ is always a finite set:
       
  2228 \begin{lemma}\label{finiteSizeN}
       
  2229 	$\textit{finite} \; (\textit{sizeNregex} \; N)$ holds.
       
  2230 \end{lemma}
       
  2231 \begin{proof}
       
  2232 	By splitting the set $\textit{sizeNregex} \; (N + 1)$ into
       
  2233 	subsets by their categories:
       
  2234 	$\{\ZERO, \ONE, c\}$, $\{* `` \textit{sizeNregex} \; N\}$,
       
  2235 	and so on. Each of these subsets are finitely bounded.
       
  2236 \end{proof}
       
  2237 \noindent
       
  2238 From this we get a corollary that
       
  2239 if forall $r \in rs$, $\rsize{r} \leq N$, then the output of 
       
  2240 $\rdistinct{rs}{\varnothing}$ is a list of regular
       
  2241 expressions of finite size depending on $N$ only. 
       
  2242 \begin{corollary}\label{finiteSizeNCorollary}
       
  2243 	$\rsize{\rdistinct{rs}{\varnothing}} \leq c_N * N$ holds,
       
  2244 	where the constant $c_N$ is equal to $\textit{card} \; (\textit{sizeNregex} \;
       
  2245 	N)$.
       
  2246 \end{corollary}
       
  2247 \begin{proof}
       
  2248 	For all $r$ in 
       
  2249 	$\textit{set} \; (\rdistinct{rs}{\varnothing})$,
       
  2250 	it is always the case that $\rsize{r} \leq N$.
       
  2251 	In addition, the list length is bounded by
       
  2252 	$c_N$, yielding the desired bound.
       
  2253 \end{proof}
       
  2254 \noindent
       
  2255 This fact will be handy in estimating the closed form sizes.
       
  2256 %We have proven that the size of the
       
  2257 %output of $\textit{rdistinct} \; rs' \; \varnothing$
       
  2258 %is bounded by a constant $N * c_N$ depending only on $N$,
       
  2259 %provided that each of $rs'$'s element
       
  2260 %is bounded by $N$.
  1890 
  2261 
  1891 \subsection{$\textit{rsimp}$ Does Not Increment the Size}
  2262 \subsection{$\textit{rsimp}$ Does Not Increment the Size}
  1892 Although it seems evident, we need a series
  2263 Although it seems evident, we need a series
  1893 of non-trivial lemmas to establish that functions such as $\rflts$
  2264 of non-trivial lemmas to establish that functions such as $\rflts$
  1894 do not cause the regular expressions to grow.
  2265 do not cause the regular expressions to grow.
  1942 \begin{proof}
  2313 \begin{proof}
  1943 	By \ref{rsimpMonoLemmas}.
  2314 	By \ref{rsimpMonoLemmas}.
  1944 \end{proof}
  2315 \end{proof}
  1945 
  2316 
  1946 \subsection{Estimating the Closed Forms' sizes}
  2317 \subsection{Estimating the Closed Forms' sizes}
  1947 We now summarize the closed forms below:
  2318 We recap the closed forms we obtained
       
  2319 earlier by putting them together down below:
  1948 \begin{itemize}
  2320 \begin{itemize}
  1949 	\item
  2321 	\item
  1950 		$\rderssimp{(\sum rs)}{s} \sequal
  2322 		$\rderssimp{(\sum rs)}{s} \sequal
  1951 		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
  2323 		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
  1952 	\item
  2324 	\item
  1976 which will each have a size uppder bound 
  2348 which will each have a size uppder bound 
  1977 according to inductive hypothesis, which controls $r \backslash s$.
  2349 according to inductive hypothesis, which controls $r \backslash s$.
  1978 
  2350 
  1979 We elaborate the above reasoning by a series of lemmas
  2351 We elaborate the above reasoning by a series of lemmas
  1980 below, where straightforward proofs are omitted.
  2352 below, where straightforward proofs are omitted.
  1981 \begin{lemma}
  2353 
  1982 	If $\forall r \in rs. \rsize{r} $ is less than or equal to $N$,
  2354 
  1983 	and $\textit{length} \; rs$ is less than or equal to $l$,
  2355 
  1984 	then $\rsize{\sum rs}$ is less than or equal to $l*N + 1$.
  2356 
  1985 \end{lemma}
       
  1986 \noindent
       
  1987 If we define all regular expressions with size no
       
  1988 more than $N$ as $\sizeNregex \; N$:
       
  1989 \[
       
  1990 	\sizeNregex \; N \dn  \{r \mid \rsize{r} \leq N \}
       
  1991 \]
       
  1992 Then such set is finite:
       
  1993 \begin{lemma}\label{finiteSizeN}
       
  1994 	$\textit{isFinite}\; (\sizeNregex \; N)$
       
  1995 \end{lemma}
       
  1996 \begin{proof}
       
  1997 	By overestimating the set $\sizeNregex \; N + 1$
       
  1998 	using union of sets like
       
  1999 	$\{r_1 \cdot r_2 \mid r_1 \in A
       
  2000 		\text{and}
       
  2001 	r_2 \in A\}
       
  2002 	$ where $A = \sizeNregex \; N$.
       
  2003 \end{proof}
       
  2004 \noindent
       
  2005 From this we get a corollary that
       
  2006 if forall $r \in rs$, $\rsize{r} \leq N$, then the output of 
       
  2007 $\rdistinct{rs}{\varnothing}$ is a list of regular
       
  2008 expressions of finite size depending on $N$ only. 
       
  2009 \begin{corollary}\label{finiteSizeNCorollary}
       
  2010 	Assumes that for all $r \in rs. \rsize{r} \leq N$,
       
  2011 	and the cardinality of $\sizeNregex \; N$ is $c_N$
       
  2012 	then$\rsize{\rdistinct{rs}{\varnothing}} \leq c*N$.
       
  2013 \end{corollary}
       
  2014 \noindent
       
  2015 We have proven that the output of $\rdistinct{rs'}{\varnothing}$
       
  2016 is bounded by a constant $c_N$ depending only on $N$,
       
  2017 provided that each of $rs'$'s element
       
  2018 is bounded by $N$.
       
  2019 We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
  2357 We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
  2020 
  2358 
  2021 We show that $\rdistinct$ and $\rflts$
  2359 We show that $\rdistinct$ and $\rflts$
  2022 working together is at least as 
  2360 working together is at least as 
  2023 good as $\rdistinct{}{}$ alone, which can be written as
  2361 good as $\rdistinct{}{}$ alone, which can be written as
  2039 We cannot simply prove how each helper function
  2377 We cannot simply prove how each helper function
  2040 reduces the size and then put them together:
  2378 reduces the size and then put them together:
  2041 From
  2379 From
  2042 \begin{center}
  2380 \begin{center}
  2043 $\llbracket  \textit{rflts}\; rs \rrbracket_r \leq
  2381 $\llbracket  \textit{rflts}\; rs \rrbracket_r \leq
  2044 	\llbracket \; \textit{rs} \rrbracket_r$
  2382 	\llbracket  \textit{rs} \rrbracket_r$
  2045 \end{center}
  2383 \end{center}
  2046 and
  2384 and
  2047 \begin{center}
  2385 \begin{center}
  2048      $\llbracket  \textit{rdistinct} \; rs \; \varnothing \leq
  2386      $\llbracket  \textit{rdistinct} \; rs \; \varnothing \leq
  2049      \llbracket rs \rrbracket_r$
  2387      \llbracket rs \rrbracket_r$
  2050 \end{center}
  2388 \end{center}
  2051 one cannot imply
  2389 one cannot infer 
  2052 \begin{center}
  2390 \begin{center}
  2053 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
  2391 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
  2054 	\leq 
  2392 	\leq 
  2055 	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
  2393 	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
  2056 \end{center}
  2394 \end{center}
  2057 What we can imply is that 
  2395 What we can infer is that 
  2058 \begin{center}
  2396 \begin{center}
  2059 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
  2397 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
  2060 	\leq
  2398 	\leq
  2061 	\llbracket rs \rrbracket_r$
  2399 	\llbracket rs \rrbracket_r$
  2062 \end{center}
  2400 \end{center}
  2063 but this estimate is too rough and $\llbracket rs \rrbracket_r$	is unbounded.
  2401 but this estimate is too rough and $\llbracket rs \rrbracket_r$	is unbounded.
  2064 The way we 
  2402 The way we 
  2065 get through this is by first proving a more general lemma 
  2403 get around this is by first proving a more general lemma 
  2066 (so that the inductive case goes through):
  2404 (so that the inductive case goes through):
  2067 \begin{lemma}\label{fltsSizeReductionAlts}
  2405 \begin{lemma}\label{fltsSizeReductionAlts}
  2068 	If we have three accumulator sets:
  2406 	If we have three accumulator sets:
  2069 	$noalts\_set$, $alts\_set$ and $corr\_set$,
  2407 	$noalts\_set$, $alts\_set$ and $corr\_set$,
  2070 	satisfying:
  2408 	satisfying:
  2085 	\end{tabular}
  2423 	\end{tabular}
  2086 	\end{center}
  2424 	\end{center}
  2087 		holds.
  2425 		holds.
  2088 \end{lemma}
  2426 \end{lemma}
  2089 \noindent
  2427 \noindent
  2090 We need to split the accumulator into two parts: the part
  2428 We split the accumulator into two parts: the part
  2091 which contains alternative regular expressions ($alts\_set$), and 
  2429 which contains alternative regular expressions ($alts\_set$), and 
  2092 the part without any of them($noalts\_set$).
  2430 the part without any of them($noalts\_set$).
       
  2431 This is because $\rflts$ opens up the alternatives in $as$,
       
  2432 causing the accumulators on both sides of the inequality
       
  2433 to diverge slightly.
       
  2434 If we want to compare the accumulators that are not
       
  2435 perfectly in sync, we need to consider the alternatives and non-alternatives
       
  2436 separately.
  2093 The set $corr\_set$ is the corresponding set
  2437 The set $corr\_set$ is the corresponding set
  2094 of $alts\_set$ with all elements under the $\sum$ constructor
  2438 of $alts\_set$ with all elements under the alternative constructor
  2095 spilled out.
  2439 spilled out.
  2096 \begin{proof}
  2440 \begin{proof}
  2097 	By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}.
  2441 	By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}.
  2098 \end{proof}
  2442 \end{proof}
  2099 By setting all three sets to the empty set, one gets the desired size estimate:
  2443 By setting all three sets to the empty set, one gets the desired size estimate:
  2104 \end{corollary}
  2448 \end{corollary}
  2105 \begin{proof}
  2449 \begin{proof}
  2106 	By using the lemma \ref{fltsSizeReductionAlts}.
  2450 	By using the lemma \ref{fltsSizeReductionAlts}.
  2107 \end{proof}
  2451 \end{proof}
  2108 \noindent
  2452 \noindent
  2109 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
  2453 The intuition for why this is true
       
  2454 is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
  2110 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
  2455 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
  2111 
  2456 
  2112 Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:
  2457 Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:
  2113 \begin{lemma}\label{altsSimpControl}
  2458 \begin{lemma}\label{altsSimpControl}
  2114 	$\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$
  2459 	$\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$
  2115 \end{lemma}
  2460 \end{lemma}
  2116 \begin{proof}
  2461 \begin{proof}
  2117 	By using \ref{interactionFltsDB}.
  2462 	By using corollary \ref{interactionFltsDB}.
  2118 \end{proof}
  2463 \end{proof}
  2119 \noindent
  2464 \noindent
  2120 This is a key lemma in establishing the bounds on all the 
  2465 This is a key lemma in establishing the bounds on all the 
  2121 closed forms.
  2466 closed forms.
  2122 With this we are now ready to control the sizes of
  2467 With this we are now ready to control the sizes of
  2123 $(r_1 \cdot r_2 )\backslash s$, $r^* \backslash s$.
  2468 $(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$.
  2124 \begin{theorem}
  2469 \begin{theorem}\label{rBound}
  2125 	For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
  2470 	For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
  2126 \end{theorem}
  2471 \end{theorem}
  2127 \noindent
  2472 \noindent
  2128 \begin{proof}
  2473 \begin{proof}
  2129 	We prove this by induction on $r$. The base cases for $\RZERO$,
  2474 	We prove this by induction on $r$. The base cases for $\RZERO$,
  2144 	\map \; (r_2\backslash_{rsimp} \_)\; (\vsuf{s}{r}))}{\varnothing} \rrbracket_r  + 1$ & (2) \\
  2489 	\map \; (r_2\backslash_{rsimp} \_)\; (\vsuf{s}{r}))}{\varnothing} \rrbracket_r  + 1$ & (2) \\
  2145 											     & $\leq$ & $2 + N_1 + \rsize{r_2} + (N_2 * (card\;(\sizeNregex \; N_2)))$ & (3)\\
  2490 											     & $\leq$ & $2 + N_1 + \rsize{r_2} + (N_2 * (card\;(\sizeNregex \; N_2)))$ & (3)\\
  2146 \end{tabular}
  2491 \end{tabular}
  2147 \end{center}
  2492 \end{center}
  2148 \noindent
  2493 \noindent
  2149 (1) is by the corollary \ref{seqEstimate1}.
  2494 (1) is by theorem \ref{seqClosedForm}.
  2150 (2) is by \ref{altsSimpControl}.
  2495 (2) is by \ref{altsSimpControl}.
  2151 (3) is by \ref{finiteSizeNCorollary}.
  2496 (3) is by \ref{finiteSizeNCorollary}.
  2152 
  2497 
  2153 
  2498 
  2154 Combining the cases when $s = []$ and $s \neq []$, we get (4):
  2499 Combining the cases when $s = []$ and $s \neq []$, we get (4):
  2181 					 & $\leq$ & $1 + (\textit{card} (\sizeNregex \; (N + n_r)))
  2526 					 & $\leq$ & $1 + (\textit{card} (\sizeNregex \; (N + n_r)))
  2182 	* (1 + (N + n_r)) $ & (7)\\
  2527 	* (1 + (N + n_r)) $ & (7)\\
  2183 \end{tabular}
  2528 \end{tabular}
  2184 \end{center}
  2529 \end{center}
  2185 \noindent
  2530 \noindent
  2186 (5) is by the lemma  \ref{starClosedForm}.
  2531 (5) is by theorem \ref{starClosedForm}.
  2187 (6) is by \ref{altsSimpControl}.
  2532 (6) is by \ref{altsSimpControl}.
  2188 (7) is by \ref{finiteSizeNCorollary}.
  2533 (7) is by corollary \ref{finiteSizeNCorollary}.
  2189 Combining with the case when $s = []$, one gets
  2534 Combining with the case when $s = []$, one gets
  2190 \begin{center}
  2535 \begin{center}
  2191 	\begin{tabular}{lcll}
  2536 	\begin{tabular}{lcll}
  2192 		$\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r)))
  2537 		$\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r)))
  2193 		* (1 + (N + n_r)) $ & (8)\\
  2538 		* (1 + (N + n_r)) $ & (8)\\
  2206 & $\leq$ & $\llbracket (\sum (\map \; (\_ \backslash_{rsimp} s)  \; rs) ) \rrbracket_r $  & (10) \\
  2551 & $\leq$ & $\llbracket (\sum (\map \; (\_ \backslash_{rsimp} s)  \; rs) ) \rrbracket_r $  & (10) \\
  2207 & $\leq$ & $1 + N * (length \; rs) $ & (11)\\
  2552 & $\leq$ & $1 + N * (length \; rs) $ & (11)\\
  2208 	\end{tabular}
  2553 	\end{tabular}
  2209 \end{center}
  2554 \end{center}
  2210 \noindent
  2555 \noindent
  2211 (9) is by \ref{altsClosedForm}, (10) by \ref{rsimpSize} and (11) by inductive hypothesis.
  2556 (9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis.
  2212 
  2557 
  2213 Combining with the case when $s = []$, one gets
  2558 Combining with the case when $s = []$, one gets
  2214 \begin{center}
  2559 \begin{center}
  2215 	\begin{tabular}{lcll}
  2560 	\begin{tabular}{lcll}
  2216 		$\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ 
  2561 		$\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ 
  2217 						 & (12)\\
  2562 						 & (12)\\
  2218 	\end{tabular}
  2563 	\end{tabular}
  2219 \end{center}
  2564 \end{center}
  2220 (4), (8), and (12) are all the inductive cases proven.
  2565 We have all the inductive cases proven.
  2221 \end{proof}
  2566 \end{proof}
  2222 
  2567 
  2223 
  2568 This leads to our main result on the size bound:
  2224 \begin{corollary}
  2569 \begin{corollary}
  2225 	For any regex $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$
  2570 	For any annotated regular expression $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$
  2226 \end{corollary}
  2571 \end{corollary}
  2227 \begin{proof}
  2572 \begin{proof}
  2228 	By \ref{sizeRelations}.
  2573 	By lemma \ref{sizeRelations} and theorem \ref{rBound}.
  2229 \end{proof}
  2574 \end{proof}
  2230 \noindent
  2575 \noindent
  2231 
  2576 
  2232 
  2577 
  2233 
  2578 
  2241 %----------------------------------------------------------------------------------------
  2586 %----------------------------------------------------------------------------------------
  2242 %	SECTION 3
  2587 %	SECTION 3
  2243 %----------------------------------------------------------------------------------------
  2588 %----------------------------------------------------------------------------------------
  2244 
  2589 
  2245 
  2590 
  2246 \subsection{A Closed Form for the Sequence Regular Expression}
  2591 \section{Comments and Future Improvements}
  2247 \noindent
  2592 \subsection{Some Experimental Results}
  2248 
       
  2249 Before we get to the proof that says the intermediate result of our lexer will
       
  2250 remain finitely bounded, which is an important efficiency/liveness guarantee,
       
  2251 we shall first develop a few preparatory properties and definitions to 
       
  2252 make the process of proving that a breeze.
       
  2253 
       
  2254 We define rewriting relations for $\rrexp$s, which allows us to do the 
       
  2255 same trick as we did for the correctness proof,
       
  2256 but this time we will have stronger equalities established.
       
  2257 
       
  2258 
       
  2259 
       
  2260 What guarantee does this bound give us?
  2593 What guarantee does this bound give us?
  2261 
       
  2262 Whatever the regex is, it will not grow indefinitely.
  2594 Whatever the regex is, it will not grow indefinitely.
  2263 Take our previous example $(a + aa)^*$ as an example:
  2595 Take our previous example $(a + aa)^*$ as an example:
  2264 \begin{center}
  2596 \begin{center}
  2265 	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
  2597 	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
  2266 		\begin{tikzpicture}
  2598 		\begin{tikzpicture}
  2276 				scaled ticks=false,
  2608 				scaled ticks=false,
  2277 				axis lines=left,
  2609 				axis lines=left,
  2278 				width=5cm,
  2610 				width=5cm,
  2279 				height=4cm, 
  2611 				height=4cm, 
  2280 				legend entries={$(a + aa)^*$},  
  2612 				legend entries={$(a + aa)^*$},  
  2281 				legend pos=north west,
  2613 				legend pos=south east,
  2282 				legend cell align=left]
  2614 				legend cell align=left]
  2283 				\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
  2615 				\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
  2284 			\end{axis}
  2616 			\end{axis}
  2285 		\end{tikzpicture}
  2617 		\end{tikzpicture}
  2286 	\end{tabular}
  2618 	\end{tabular}
  2295 Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
  2627 Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
  2296 inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
  2628 inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
  2297 $f(x) = x * 2^x$.
  2629 $f(x) = x * 2^x$.
  2298 This means the bound we have will surge up at least
  2630 This means the bound we have will surge up at least
  2299 tower-exponentially with a linear increase of the depth.
  2631 tower-exponentially with a linear increase of the depth.
  2300 For a regex of depth $n$, the bound
  2632 
  2301 would be approximately $4^n$.
  2633 One might be quite skeptical about what this non-elementary
  2302 
  2634 bound can bring us.
  2303 Test data in the graphs from randomly generated regular expressions
  2635 It turns out that the giant bounds are far from being hit.
  2304 shows that the giant bounds are far from being hit.
  2636 Here we have some test data from randomly generated regular expressions:
  2305 %a few sample regular experessions' derivatives
  2637 \begin{figure}[H]
  2306 %size change
  2638 	\begin{tabular}{@{}c@{\hspace{2mm}}c@{\hspace{0mm}}c@{}}
  2307 %TODO: giving regex1_size_change.data showing a few regular expressions' size changes 
       
  2308 %w;r;t the input characters number, where the size is usually cubic in terms of original size
       
  2309 %a*, aa*, aaa*, .....
       
  2310 %randomly generated regular expressions
       
  2311 \begin{figure}{H}
       
  2312 	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
  2313 		\begin{tikzpicture}
  2639 		\begin{tikzpicture}
  2314 			\begin{axis}[
  2640 			\begin{axis}[
  2315 				xlabel={number of characters},
  2641 				xlabel={$n$},
  2316 				x label style={at={(1.05,-0.05)}},
  2642 				x label style={at={(1.05,-0.05)}},
  2317 				ylabel={regex size},
  2643 				ylabel={regex size},
  2318 				enlargelimits=false,
  2644 				enlargelimits=false,
  2319 				xtick={0,5,...,30},
  2645 				xtick={0,5,...,30},
  2320 				xmax=33,
  2646 				xmax=33,
  2321 				%ymax=1000,
  2647 				%ymax=1000,
  2322 				%ytick={0,100,...,1000},
  2648 				%ytick={0,100,...,1000},
  2323 				scaled ticks=false,
  2649 				scaled ticks=false,
  2324 				axis lines=left,
  2650 				axis lines=left,
  2325 				width=5cm,
  2651 				width=4.75cm,
  2326 				height=4cm, 
  2652 				height=3.8cm, 
  2327 				legend entries={regex1},  
  2653 				legend entries={regex1},  
  2328 				legend pos=north west,
  2654 				legend pos=north east,
  2329 				legend cell align=left]
  2655 				legend cell align=left]
  2330 				\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
  2656 				\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
  2331 			\end{axis}
  2657 			\end{axis}
  2332 		\end{tikzpicture}
  2658 		\end{tikzpicture}
  2333   &
  2659  & 
  2334   \begin{tikzpicture}
  2660   \begin{tikzpicture}
  2335 	  \begin{axis}[
  2661 	  \begin{axis}[
  2336 		  xlabel={$n$},
  2662 		  xlabel={$n$},
  2337 		  x label style={at={(1.05,-0.05)}},
  2663 		  x label style={at={(1.05,-0.05)}},
  2338 		  %ylabel={time in secs},
  2664 		  %ylabel={time in secs},
  2341 		  xmax=33,
  2667 		  xmax=33,
  2342 		  %ymax=1000,
  2668 		  %ymax=1000,
  2343 		  %ytick={0,100,...,1000},
  2669 		  %ytick={0,100,...,1000},
  2344 		  scaled ticks=false,
  2670 		  scaled ticks=false,
  2345 		  axis lines=left,
  2671 		  axis lines=left,
  2346 		  width=5cm,
  2672 		  width=4.75cm,
  2347 		  height=4cm, 
  2673 		  height=3.8cm, 
  2348 		  legend entries={regex2},  
  2674 		  legend entries={regex2},  
  2349 		  legend pos=north west,
  2675 		  legend pos=south east,
  2350 		  legend cell align=left]
  2676 		  legend cell align=left]
  2351 		  \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
  2677 		  \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
  2352 	  \end{axis}
  2678 	  \end{axis}
  2353   \end{tikzpicture}
  2679   \end{tikzpicture}
  2354   &
  2680  & 
  2355   \begin{tikzpicture}
  2681   \begin{tikzpicture}
  2356 	  \begin{axis}[
  2682 	  \begin{axis}[
  2357 		  xlabel={$n$},
  2683 		  xlabel={$n$},
  2358 		  x label style={at={(1.05,-0.05)}},
  2684 		  x label style={at={(1.05,-0.05)}},
  2359 		  %ylabel={time in secs},
  2685 		  %ylabel={time in secs},
  2362 		  xmax=33,
  2688 		  xmax=33,
  2363 		  %ymax=1000,
  2689 		  %ymax=1000,
  2364 		  %ytick={0,100,...,1000},
  2690 		  %ytick={0,100,...,1000},
  2365 		  scaled ticks=false,
  2691 		  scaled ticks=false,
  2366 		  axis lines=left,
  2692 		  axis lines=left,
  2367 		  width=5cm,
  2693 		  width=4.75cm,
  2368 		  height=4cm, 
  2694 		  height=3.8cm, 
  2369 		  legend entries={regex3},  
  2695 		  legend entries={regex3},  
  2370 		  legend pos=north west,
  2696 		  legend pos=south east,
  2371 		  legend cell align=left]
  2697 		  legend cell align=left]
  2372 		  \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
  2698 		  \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
  2373 	  \end{axis}
  2699 	  \end{axis}
  2374   \end{tikzpicture}\\
  2700   \end{tikzpicture}\\
  2375   \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.}
  2701   \multicolumn{3}{c}{}
  2376 	\end{tabular}    
  2702 	\end{tabular}    
       
  2703   \caption{Graphs: size change of 3 randomly generated 
       
  2704   regular expressions $w.r.t.$ input string length. 
       
  2705   The x axis represents the length of input.}
  2377 \end{figure}  
  2706 \end{figure}  
  2378 \noindent
  2707 \noindent
  2379 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the 
  2708 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the 
  2380 original size.
  2709 original size.
  2381 We will discuss improvements to this bound in the next chapter.
  2710 We will discuss improvements to this bound in the next chapter.
  2382 
  2711 
  2383 
  2712 
  2384 
  2713 
  2385 \section{Possible Further Improvements}
  2714 \subsection{Possible Further Improvements}
  2386 There are two problems with this finiteness result, though.
  2715 There are two problems with this finiteness result, though.\\
  2387 \begin{itemize}
  2716 (i)	
  2388 	\item
  2717 		First, it is not yet a direct formalisation of our lexer's complexity,
  2389 		First, It is not yet a direct formalisation of our lexer's complexity,
       
  2390 		as a complexity proof would require looking into 
  2718 		as a complexity proof would require looking into 
  2391 		the time it takes to execute {\bf all} the operations
  2719 		the time it takes to execute {\bf all} the operations
  2392 		involved in the lexer (simp, collect, decode), not just the derivative.
  2720 		involved in the lexer (simp, collect, decode), not just the derivative.\\
  2393 	\item
  2721 (ii)
  2394 		Second, the bound is not yet tight, and we seek to improve $N_a$ so that
  2722 		Second, the bound is not yet tight, and we seek to improve $N_a$ so that
  2395 		it is polynomial on $\llbracket a \rrbracket$.
  2723 		it is polynomial on $\llbracket a \rrbracket$.\\
  2396 \end{itemize}
  2724 Still, we believe this contribution is useful,
  2397 Still, we believe this contribution is fruitful,
       
  2398 because
  2725 because
  2399 \begin{itemize}
  2726 \begin{itemize}
  2400 	\item
  2727 	\item
  2401 
  2728 
  2402 		The size proof can serve as a cornerstone for a complexity
  2729 		The size proof can serve as a starting point for a complexity
  2403 		formalisation.
  2730 		formalisation.
  2404 		Derivatives are the most important phases of our lexer algorithm.
  2731 		Derivatives are the most important phases of our lexer algorithm.
  2405 		Size properties about derivatives covers the majority of the algorithm
  2732 		Size properties about derivatives covers the majority of the algorithm
  2406 		and is therefore a good indication of complexity of the entire program.
  2733 		and is therefore a good indication of complexity of the entire program.
  2407 	\item
  2734 	\item
  2433 but will eventually level off when the string $s$ is long enough.
  2760 but will eventually level off when the string $s$ is long enough.
  2434 If they grow to a size exponential $w.r.t$ the original regex, our algorithm
  2761 If they grow to a size exponential $w.r.t$ the original regex, our algorithm
  2435 would still be slow.
  2762 would still be slow.
  2436 And unfortunately, we have concrete examples
  2763 And unfortunately, we have concrete examples
  2437 where such regular expressions grew exponentially large before levelling off:
  2764 where such regular expressions grew exponentially large before levelling off:
       
  2765 \begin{center}
  2438 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
  2766 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
  2439 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
  2767 (\underbrace{a \ldots a}_{\text{n a's}})^*)^*$ 
       
  2768 \end{center}
       
  2769 will already have a maximum
  2440 size that is  exponential on the number $n$ 
  2770 size that is  exponential on the number $n$ 
  2441 under our current simplification rules:
  2771 under our current simplification rules:
  2442 %TODO: graph of a regex whose size increases exponentially.
  2772 %TODO: graph of a regex whose size increases exponentially.
  2443 \begin{center}
  2773 \begin{center}
  2444 	\begin{tikzpicture}
  2774 	\begin{tikzpicture}
  2454 			\addplot[mark=*,blue] table {re-chengsong.data};
  2784 			\addplot[mark=*,blue] table {re-chengsong.data};
  2455 		\end{axis}
  2785 		\end{axis}
  2456 	\end{tikzpicture}
  2786 	\end{tikzpicture}
  2457 \end{center}
  2787 \end{center}
  2458 
  2788 
  2459 For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
  2789 For convenience we use $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
  2460 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
  2790 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
  2461 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
  2791 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
  2462 The exponential size is triggered by that the regex
  2792 The exponential size is triggered by that the regex
  2463 $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
  2793 $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
  2464 inside the $(\ldots) ^*$ having exponentially many
  2794 inside the $(\ldots) ^*$ having exponentially many
  2465 different derivatives, despite those difference being minor.
  2795 different derivatives, despite those difference being minor.
  2466 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
  2796 $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
  2467 will therefore contain the following terms (after flattening out all nested 
  2797 will therefore contain the following terms (after flattening out all nested 
  2468 alternatives):
  2798 alternatives):
  2469 \begin{center}
  2799 \begin{center}
  2470 	$(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
  2800 $(\sum_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
       
  2801 [1mm]
  2471 	$(1 \leq m' \leq m )$
  2802 	$(1 \leq m' \leq m )$
  2472 \end{center}
  2803 \end{center}
  2473 These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
  2804 There at at least exponentially
       
  2805 many such terms.\footnote{To be exact, these terms are 
       
  2806 distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted,
       
  2807 but the point is that the number is exponential.
       
  2808 } 
  2474 With each new input character taking the derivative against the intermediate result, more and more such distinct
  2809 With each new input character taking the derivative against the intermediate result, more and more such distinct
  2475 terms will accumulate, 
  2810 terms will accumulate.
  2476 until the length reaches $L.C.M.(1, \ldots, n)$.
  2811 The function $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
  2477 $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
  2812 \begin{center}
  2478 $(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
  2813 $(\sum_{i = 1}^{n}  
  2479 
  2814 (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  
  2480 $(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
  2815 (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot 
  2481 where $m' \neq m''$ \\
  2816 (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
       
  2817 $(\sum_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  
       
  2818 (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot 
       
  2819 (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
       
  2820 \end{center}
       
  2821 \noindent
       
  2822 where $m' \neq m''$
  2482 as they are slightly different.
  2823 as they are slightly different.
  2483 This means that with our current simplification methods,
  2824 This means that with our current simplification methods,
  2484 we will not be able to control the derivative so that
  2825 we will not be able to control the derivative so that
  2485 $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
  2826 $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial. %\leq O((\llbracket r\rrbacket)^c)$
  2486 as there are already exponentially many terms.
       
  2487 These terms are similar in the sense that the head of those terms
  2827 These terms are similar in the sense that the head of those terms
  2488 are all consisted of sub-terms of the form: 
  2828 are all consisted of sub-terms of the form: 
  2489 $(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
  2829 $(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
  2490 For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
  2830 For  $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
  2491 $n * (n + 1) / 2$ such terms. 
  2831 $n * (n + 1) / 2$ such terms. 
  2492 For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
  2832 For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
  2493 can be described by 6 terms:
  2833 can be described by 6 terms:
  2494 $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, 
  2834 $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, 
  2495 $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
  2835 $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
  2496 The total number of different "head terms",  $n * (n + 1) / 2$,
  2836 The total number of different "head terms",  $n * (n + 1) / 2$,
  2497 is proportional to the number of characters in the regex 
  2837 is proportional to the number of characters in the regex 
  2498 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
  2838 $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
  2499 This suggests a slightly different notion of size, which we call the 
  2839 If we can improve our deduplication process so that it becomes smarter
       
  2840 and only keep track of these $n * (n+1) /2$ terms, then we can keep
       
  2841 the size growth polynomial again.
       
  2842 This example also suggests a slightly different notion of size, which we call the 
  2500 alphabetic width:
  2843 alphabetic width:
  2501 %TODO:
  2844 \begin{center}
  2502 (TODO: Alphabetic width def.)
  2845 	\begin{tabular}{lcl}
       
  2846 		$\textit{awidth} \; \ZERO$ & $\dn$ & $0$\\
       
  2847 		$\textit{awidth} \; \ONE$ & $\dn$ & $0$\\
       
  2848 		$\textit{awidth} \; c$ & $\dn$ & $1$\\
       
  2849 		$\textit{awidth} \; r_1 + r_2$ & $\dn$ & $\textit{awidth} \; 
       
  2850 		r_1 + \textit{awidth} \; r_2$\\
       
  2851 		$\textit{awidth} \; r_1 \cdot r_2$ & $\dn$ & $\textit{awidth} \;
       
  2852 		r_1 + \textit{awidth} \; r_2$\\
       
  2853 		$\textit{awidth} \; r^*$ & $\dn$ & $\textit{awidth} \; r$\\
       
  2854 	\end{tabular}
       
  2855 \end{center}
       
  2856 
  2503 
  2857 
  2504 
  2858 
  2505 Antimirov\parencite{Antimirov95} has proven that 
  2859 Antimirov\parencite{Antimirov95} has proven that 
  2506 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
  2860 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$,
  2507 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
  2861 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
  2508 created by doing derivatives of $r$ against all possible strings.
  2862 created by doing derivatives of $r$ against all possible strings.
  2509 If we can make sure that at any moment in our lexing algorithm our 
  2863 If we can make sure that at any moment in our lexing algorithm our 
  2510 intermediate result hold at most one copy of each of the 
  2864 intermediate result hold at most one copy of each of the 
  2511 subterms then we can get the same bound as Antimirov's.
  2865 subterms then we can get the same bound as Antimirov's.
  2521 
  2875 
  2522 
  2876 
  2523 %-----------------------------------
  2877 %-----------------------------------
  2524 %	SUBSECTION 1
  2878 %	SUBSECTION 1
  2525 %-----------------------------------
  2879 %-----------------------------------
  2526 \subsection{Syntactic Equivalence Under $\simp$}
  2880 %\subsection{Syntactic Equivalence Under $\simp$}
  2527 We prove that minor differences can be annhilated
  2881 %We prove that minor differences can be annhilated
  2528 by $\simp$.
  2882 %by $\simp$.
  2529 For example,
  2883 %For example,
  2530 \begin{center}
  2884 %\begin{center}
  2531 	$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
  2885 %	$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = 
  2532 	\simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
  2886 %	\simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
  2533 \end{center}
  2887 %\end{center}
  2534 
  2888