ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 639 80cc6dc4c98b
parent 638 dd9dde2d902b
child 640 bd1354127574
equal deleted inserted replaced
638:dd9dde2d902b 639:80cc6dc4c98b
    37 		is not just finite but polynomial in $\llbracket a\rrbracket$.
    37 		is not just finite but polynomial in $\llbracket a\rrbracket$.
    38 	\item
    38 	\item
    39 		Having the finite bound formalised 
    39 		Having the finite bound formalised 
    40 		gives us a higher confidence that
    40 		gives us a higher confidence that
    41 		our simplification algorithm $\simp$ does not ``mis-behave''
    41 		our simplification algorithm $\simp$ does not ``mis-behave''
    42 		like $\simpsulz$ does.
    42 		like $\textit{simpSL}$ does.
    43 		The bound is universal for a given regular expression, 
    43 		The bound is universal for a given regular expression, 
    44 		which is an advantage over work which 
    44 		which is an advantage over work which 
    45 		only gives empirical evidence on 
    45 		only gives empirical evidence on 
    46 		some test cases (see Verbatim++ work\cite{Verbatimpp}).
    46 		some test cases (see for example Verbatim work \cite{Verbatimpp}).
    47 \end{itemize}
    47 \end{itemize}
    48 \noindent
    48 \noindent
    49 We then extend our $\blexersimp$
    49 We then extend our $\blexersimp$
    50 to support bounded repetitions ($r^{\{n\}}$).
    50 to support bounded repetitions ($r^{\{n\}}$).
    51 We update our formalisation of 
    51 We update our formalisation of 
    52 the correctness and finiteness properties to
    52 the correctness and finiteness properties to
    53 include this new construct. 
    53 include this new construct. 
    54 we can out-compete other verified lexers such as
    54 We show that we can out-compete other verified lexers such as
    55 Verbatim++ on bounded regular expressions.
    55 Verbatim++ on bounded regular expressions.
    56 
    56 
    57 In the next section we describe in more detail
    57 In the next section we describe in more detail
    58 what the finite bound means in our algorithm
    58 what the finite bound means in our algorithm
    59 and why the size of the internal data structures of
    59 and why the size of the internal data structures of
    60 a typical derivative-based lexer such as
    60 a typical derivative-based lexer such as
    61 Sulzmann and Lu's need formal treatment.
    61 Sulzmann and Lu's needs formal treatment.
    62 
    62 
    63 
    63 
    64 
    64 
    65 
    65 
    66 \section{Formalising About Size}
    66 \section{Formalising Size Bound of Derivatives}
    67 \noindent
    67 \noindent
    68 In our lexer ($\blexersimp$),
    68 In our lexer ($\blexersimp$),
    69 we take an annotated regular expression as input,
    69 we take an annotated regular expression as input,
    70 and repeately take derivative of and simplify it.
    70 and repeately take derivative of and simplify it.
    71 \begin{figure}
    71 \begin{figure}
   110 \end{figure}
   110 \end{figure}
   111 
   111 
   112 \noindent
   112 \noindent
   113 Each time
   113 Each time
   114 a derivative is taken, the regular expression might grow.
   114 a derivative is taken, the regular expression might grow.
   115 However, the simplification that is immediately afterwards will always shrink it so that 
   115 However, the simplification that is immediately afterwards will often shrink it so that 
   116 it stays relatively small.
   116 the overall size of the derivatives stays relatively small.
   117 This intuition is depicted by the relative size
   117 This intuition is depicted by the relative size
   118 change between the black and blue nodes:
   118 change between the black and blue nodes:
   119 After $\simp$ the node always shrinks.
   119 After $\simp$ the node shrinks.
   120 Our proof states that all the blue nodes
   120 Our proof states that all the blue nodes
   121 stay below a size bound $N_a$ determined by the input $a$.
   121 stay below a size bound $N_a$ determined by the input $a$.
   122 
   122 
   123 \noindent
   123 \noindent
   124 Sulzmann and Lu's assumed a similar picture about their algorithm,
   124 Sulzmann and Lu's assumed a similar picture about their algorithm,
   152 
   152 
   153 	\end{tikzpicture}
   153 	\end{tikzpicture}
   154 	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}
   154 	\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}
   155 \end{figure}
   155 \end{figure}
   156 \noindent
   156 \noindent
   157 The picture means that on certain cases their lexer (where they use $\simpsulz$ 
   157 The picture means that in some cases their lexer (where they use $\simpsulz$ 
   158 as the simplification function)
   158 as the simplification function)
   159 will have a size explosion, causing the running time 
   159 will have a size explosion, causing the running time 
   160 of each derivative step to grow continuously (for example 
   160 of each derivative step to grow continuously (for example 
   161 in \ref{SulzmannLuLexerTime}).
   161 in \ref{SulzmannLuLexerTime}).
   162 They tested out the run time of their
   162 They tested out the run time of their
   163 lexer on particular examples such as $(a+b+ab)^*$
   163 lexer on particular examples such as $(a+b+ab)^*$
   164 and claimed that their algorithm is linear w.r.t to the input.
   164 and claimed that their algorithm is linear w.r.t to the input.
   165 With our mecahnised proof, we avoid this type of unintentional
   165 With our mecahnised proof, we avoid this type of unintentional
   166 generalisation.\\
   166 generalisation.
   167 
   167 
   168 Before delving in to the details of the formalisation,
   168 Before delving in to the details of the formalisation,
   169 we are going to provide an overview of it.
   169 we are going to provide an overview of it in the next subsection.
   170 In the next subsection, we give a high-level view
       
   171 of the proof.
       
   172 
   170 
   173 
   171 
   174 \subsection{Overview of the Proof}
   172 \subsection{Overview of the Proof}
   175 A high-level overview of the main components of the finiteness proof
   173 A high-level overview of the main components of the finiteness proof
   176 is as follows:
   174 is as follows:
   233 
   231 
   234 \section{The $\textit{Rrexp}$ Datatype}
   232 \section{The $\textit{Rrexp}$ Datatype}
   235 The first step is to define 
   233 The first step is to define 
   236 $\textit{rrexp}$s.
   234 $\textit{rrexp}$s.
   237 They are annotated regular expressions without bitcodes,
   235 They are annotated regular expressions without bitcodes,
   238 allowing a much simpler size bound proof.
   236 allowing a more convenient size bound proof.
   239 %Of course, the bits which encode the lexing information 
   237 %Of course, the bits which encode the lexing information 
   240 %would grow linearly with respect 
   238 %would grow linearly with respect 
   241 %to the input, which should be taken into account when we wish to tackle the runtime comlexity.
   239 %to the input, which should be taken into account when we wish to tackle the runtime comlexity.
   242 %But for the sake of the structural size 
   240 %But for the sake of the structural size 
   243 %we can safely ignore them.\\
   241 %we can safely ignore them.\\
   246 \emph{r-regular expressions},
   244 \emph{r-regular expressions},
   247 was initially defined in \ref{rrexpDef}.
   245 was initially defined in \ref{rrexpDef}.
   248 The reason for the prefix $r$ is
   246 The reason for the prefix $r$ is
   249 to make a distinction  
   247 to make a distinction  
   250 with basic regular expressions.
   248 with basic regular expressions.
       
   249 We give here again the definition of $\rrexp$.
   251 \[			\rrexp ::=   \RZERO \mid  \RONE
   250 \[			\rrexp ::=   \RZERO \mid  \RONE
   252 	\mid  \RCHAR{c}  
   251 	\mid  \RCHAR{c}  
   253 	\mid  \RSEQ{r_1}{r_2}
   252 	\mid  \RSEQ{r_1}{r_2}
   254 	\mid  \RALTS{rs}
   253 	\mid  \RALTS{rs}
   255 	\mid \RSTAR{r}        
   254 	\mid \RSTAR{r}        
   306 		$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
   305 		$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
   307 		$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
   306 		$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
   308 	\end{tabular}
   307 	\end{tabular}
   309 \end{center}
   308 \end{center}
   310 \noindent
   309 \noindent
   311 As can be seen alternative regular expression with an empty argument list
   310 As can be seen, alternative regular expressions with an empty argument list
   312 will be turned into a $\ZERO$.
   311 will be turned into a $\ZERO$.
   313 The singleton alternative $\sum [r]$ becomes $r$ during the
   312 The singleton alternative $\sum [r]$ becomes $r$ during the
   314 $\erase$ function.
   313 $\erase$ function.
   315 The  annotated regular expression $\sum[a, b, c]$ would turn into
   314 The  annotated regular expression $\sum[a, b, c]$ would turn into
   316 $(a+(b+c))$.
   315 $(a+(b+c))$.
   317 All these operations change the size and structure of
   316 All these operations change the size and structure of
   318 an annotated regular expressions, adding unnecessary 
   317 an annotated regular expressions, adding unnecessary 
   319 complexities to the size bound proof.\\
   318 complexities to the size bound proof.
       
   319 
   320 For example, if we define the size of a basic plain regular expression 
   320 For example, if we define the size of a basic plain regular expression 
   321 in the usual way,
   321 in the usual way,
   322 \begin{center}
   322 \begin{center}
   323 	\begin{tabular}{lcl}
   323 	\begin{tabular}{lcl}
   324 		$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
   324 		$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
   351 $\llbracket a \rrbracket  \leq \llbracket  a_\downarrow \rrbracket_p $
   351 $\llbracket a \rrbracket  \leq \llbracket  a_\downarrow \rrbracket_p $
   352 and then estimate $\llbracket  a_\downarrow \rrbracket_p$,
   352 and then estimate $\llbracket  a_\downarrow \rrbracket_p$,
   353 but we found our approach more straightforward.\\
   353 but we found our approach more straightforward.\\
   354 
   354 
   355 \subsection{Functions for R-regular Expressions}
   355 \subsection{Functions for R-regular Expressions}
       
   356 The downside of our approach is that we need to redefine
       
   357 several functions for $\rrexp$.
   356 In this section we shall define the r-regular expression version
   358 In this section we shall define the r-regular expression version
   357 of $\blexer$, and $\blexersimp$ related functions.
   359 of $\bder$, and $\textit{bsimp}$ related functions.
   358 We use $r$ as the prefix or subscript to differentiate
   360 We use $r$ as the prefix or subscript to differentiate
   359 with the bitcoded version.
   361 with the bitcoded version.
   360 %For example,$\backslash_r$, $\rdistincts$, and $\rsimp$
   362 %For example,$\backslash_r$, $\rdistincts$, and $\rsimp$
   361 %as opposed to $\backslash$, $\distinctBy$, and $\bsimp$.
   363 %as opposed to $\backslash$, $\distinctBy$, and $\bsimp$.
   362 %As promised, they are much simpler than their bitcoded counterparts.
   364 %As promised, they are much simpler than their bitcoded counterparts.
   382 		(_{[]}r^*))$
   384 		(_{[]}r^*))$
   383 	\end{tabular}    
   385 	\end{tabular}    
   384 \end{center}  
   386 \end{center}  
   385 \noindent
   387 \noindent
   386 where we omit the definition of $\textit{rnullable}$.
   388 where we omit the definition of $\textit{rnullable}$.
   387 The generalisation from derivative w.r.t to character to
   389 The generalisation from the derivatives w.r.t a character to
   388 derivative w.r.t string is given as
   390 derivatives w.r.t strings is given as
   389 \begin{center}
   391 \begin{center}
   390 	\begin{tabular}{lcl}
   392 	\begin{tabular}{lcl}
   391 		$r \backslash_{rs} []$ & $\dn$ & $r$\\
   393 		$r \backslash_{rs} []$ & $\dn$ & $r$\\
   392 		$r \backslash_{rs} c::s$ & $\dn$ & $(r\backslash_r c) \backslash_{rs} s$
   394 		$r \backslash_{rs} c::s$ & $\dn$ & $(r\backslash_r c) \backslash_{rs} s$
   393 	\end{tabular}
   395 	\end{tabular}
   585 \end{center}
   587 \end{center}
   586 where $r = r_1 \cdot r_2$ or $r = r_0^*$ and so on.
   588 where $r = r_1 \cdot r_2$ or $r = r_0^*$ and so on.
   587 For example, for $r_1 \cdot r_2$ we have the equality as
   589 For example, for $r_1 \cdot r_2$ we have the equality as
   588 	\begin{center}
   590 	\begin{center}
   589 		$ \rderssimp{r_1 \cdot r_2}{s} = 
   591 		$ \rderssimp{r_1 \cdot r_2}{s} = 
   590 		\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
   592 		\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r_2}{\_} \;(\vsuf{s}{r_1})))}$
   591 	\end{center}
   593 	\end{center}
   592 We call the right-hand-side the 
   594 We call the right-hand-side the 
   593 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
   595 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
   594 Second, we will bound the closed form of r-regular expressions
   596 Second, we will bound the closed form of r-regular expressions
   595 using some estimation techniques
   597 using some estimation techniques
   665 		\item
   667 		\item
   666 			%If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
   668 			%If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
   667 			$\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$.
   669 			$\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$.
   668 		\item
   670 		\item
   669 			$\textit{set} \; (\rdistinct{rs}{acc}) 
   671 			$\textit{set} \; (\rdistinct{rs}{acc}) 
   670 			= (textit{set} \; rs) - acc$
   672 			= (\textit{set} \; rs) - acc$
   671 	\end{itemize}
   673 	\end{itemize}
   672 \end{lemma}
   674 \end{lemma}
   673 \noindent
   675 \noindent
   674 \begin{proof}
   676 \begin{proof}
   675 	The first part is by an induction on $rs$.
   677 	The first part is by an induction on $rs$.
   770 \end{lemma}
   772 \end{lemma}
   771 \begin{proof}
   773 \begin{proof}
   772 	By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}.
   774 	By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}.
   773 \end{proof}
   775 \end{proof}
   774 \noindent
   776 \noindent
   775 The next lemma is a more general form of \ref{rdistinctConcat},
   777 The next lemma is a more general form of \ref{rdistinctConcat};
   776 it says that
   778 It says that
   777 $\textit{rdistinct}$ is composable w.r.t list concatenation:
   779 $\textit{rdistinct}$ is composable w.r.t list concatenation:
   778 \begin{lemma}\label{distinctRdistinctAppend}
   780 \begin{lemma}\label{distinctRdistinctAppend}
   779 			If $\;\; \textit{isDistinct} \; rs_1$, 
   781 			If $\;\; \textit{isDistinct} \; rs_1$, 
   780 			and $(set \; rs_1) \cap acc = \varnothing$,
   782 			and $(set \; rs_1) \cap acc = \varnothing$,
   781 			then applying $\textit{rdistinct}$ on $rs_1 @ rs_a$ does not 
   783 			then applying $\textit{rdistinct}$ on $rs_1 @ rs_a$ does not 
   789 \end{proof}
   791 \end{proof}
   790 \noindent
   792 \noindent
   791 $\textit{rdistinct}$ needs to be applied only once, and 
   793 $\textit{rdistinct}$ needs to be applied only once, and 
   792 applying it multiple times does not make any difference:
   794 applying it multiple times does not make any difference:
   793 \begin{corollary}\label{distinctOnceEnough}
   795 \begin{corollary}\label{distinctOnceEnough}
   794 	$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; (rdistinct \; 
   796 	$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; ( (rdistinct \; 
   795 	rs \{ \} @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$
   797 	rs \; \{ \}) @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$
   796 \end{corollary}
   798 \end{corollary}
   797 \begin{proof}
   799 \begin{proof}
   798 	By lemma \ref{distinctRdistinctAppend}.
   800 	By lemma \ref{distinctRdistinctAppend}.
   799 \end{proof}
   801 \end{proof}
   800 
   802 
   843 	last sub-lemma.
   845 	last sub-lemma.
   844 \end{proof}
   846 \end{proof}
   845 \noindent
   847 \noindent
   846 Now we introduce the property that the operations 
   848 Now we introduce the property that the operations 
   847 derivative and $\rsimpalts$
   849 derivative and $\rsimpalts$
   848 commute, this will be used later on, when deriving the closed form for
   850 commute, this will be used later on when deriving the closed form for
   849 the alternative regular expression:
   851 the alternative regular expression:
   850 \begin{lemma}\label{rderRsimpAltsCommute}
   852 \begin{lemma}\label{rderRsimpAltsCommute}
   851 	$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$
   853 	$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$
   852 \end{lemma}
   854 \end{lemma}
   853 \begin{proof}
   855 \begin{proof}
   854 	By induction on $rs$.
   856 	By induction on $rs$.
   855 \end{proof}
   857 \end{proof}
   856 \noindent
   858 \noindent
   857 
   859 
   858 \subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s}
   860 \subsubsection{The $RL$ Function: Language Interpretation for $\textit{Rrexp}$s}
   859 Much like the definition of $L$ on plain regular expressions, one can also 
   861 Much like the definition of $L$ on plain regular expressions, one can also 
   860 define the language interpretation of $\rrexp$s.
   862 define the language interpretation for $\rrexp$s.
   861 \begin{center}
   863 \begin{center}
   862 	\begin{tabular}{lcl}
   864 	\begin{tabular}{lcl}
   863 		$RL \; (\ZERO)$ & $\dn$ & $\phi$\\
   865 		$RL \; (\ZERO_r)$ & $\dn$ & $\phi$\\
   864 		$RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\
   866 		$RL \; (\ONE_r)$ & $\dn$ & $\{[]\}$\\
   865 		$RL \; (c)$ & $\dn$ & $\{[c]\}$\\
   867 		$RL \; (c)$ & $\dn$ & $\{[c]\}$\\
   866 		$RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\
   868 		$RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\
   867 		$RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\
   869 		$RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\
   868 		$RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$
   870 		$RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$
   869 	\end{tabular}
   871 	\end{tabular}
   958 \end{lemma}
   960 \end{lemma}
   959 \begin{proof}
   961 \begin{proof}
   960 	By induction on $r$.
   962 	By induction on $r$.
   961 \end{proof}
   963 \end{proof}
   962 \noindent
   964 \noindent
   963 With this we could prove that a regular expressions
   965 With this we can prove that a regular expressions
   964 after simplification and flattening and de-duplication,
   966 after simplification and flattening and de-duplication,
   965 will not contain any alternative regular expression directly:
   967 will not contain any alternative regular expression directly:
   966 \begin{lemma}\label{nonaltFltsRd}
   968 \begin{lemma}\label{nonaltFltsRd}
   967 	If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ 
   969 	If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ 
   968 	then $\textit{nonAlt} \; x$.
   970 	then $\textit{nonAlt} \; x$.
   996 	The first part is by induction on $rs$, where the induction
   998 	The first part is by induction on $rs$, where the induction
   997 	rule is the inductive cases for $\rflts$.
   999 	rule is the inductive cases for $\rflts$.
   998 	The second part is a corollary from the first part.
  1000 	The second part is a corollary from the first part.
   999 \end{proof}
  1001 \end{proof}
  1000 
  1002 
  1001 And this leads to good structural property of $\rsimp{}$,
  1003 This leads to good structural property of $\rsimp{}$,
  1002 that after simplification, a regular expression is
  1004 that after simplification, a regular expression is
  1003 either good or $\RZERO$:
  1005 either good or $\RZERO$:
  1004 \begin{lemma}\label{good1}
  1006 \begin{lemma}\label{good1}
  1005 	For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$.
  1007 	For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$.
  1006 \end{lemma}
  1008 \end{lemma}
  1015 	by that as well.
  1017 	by that as well.
  1016 	The lemmas \ref{nonnestedRsimp} and  \ref{nonaltFltsRd} are used
  1018 	The lemmas \ref{nonnestedRsimp} and  \ref{nonaltFltsRd} are used
  1017 	to ensure that goodness is preserved at the topmost level.
  1019 	to ensure that goodness is preserved at the topmost level.
  1018 \end{proof}
  1020 \end{proof}
  1019 We shall prove that any good regular expression is 
  1021 We shall prove that any good regular expression is 
  1020 a fixed-point for $\rsimp{}$.
  1022 a fixed-point for $\textit{rsimp}$.
  1021 First we prove an auxiliary lemma:
  1023 First we prove an auxiliary lemma:
  1022 \begin{lemma}\label{goodaltsNonalt}
  1024 \begin{lemma}\label{goodaltsNonalt}
  1023 	If $\good \; \sum rs$, then $\rflts\; rs = rs$.
  1025 	If $\good \; \sum rs$, then $\rflts\; rs = rs$.
  1024 \end{lemma}
  1026 \end{lemma}
  1025 \begin{proof}
  1027 \begin{proof}
  1037 	\ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}.
  1039 	\ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}.
  1038 	The lemma \ref{goodaltsNonalt} is used in the alternative
  1040 	The lemma \ref{goodaltsNonalt} is used in the alternative
  1039 	case where 2 or more elements are present in the list.
  1041 	case where 2 or more elements are present in the list.
  1040 \end{proof}
  1042 \end{proof}
  1041 \noindent
  1043 \noindent
  1042 Given below is a property involving $\rflts$, $\textit{rdistinct}$, 
  1044 Below we show a property involving $\rflts$, $\textit{rdistinct}$, 
  1043 $\rsimp{}$ and $\rsimp_{ALTS}$,
  1045 $\rsimp{}$ and $\rsimp_{ALTS}$,
  1044 which requires $\ref{good1}$ to go through smoothly:
  1046 which requires $\ref{good1}$ to go through smoothly:
  1045 \begin{lemma}\label{flattenRsimpalts}
  1047 \begin{lemma}\label{flattenRsimpalts}
  1046 An application of $\rsimp_{ALTS}$ can be ``absorbed'',
  1048 An application of $\rsimp_{ALTS}$ can be ``absorbed'',
  1047 if its output is concatenated with a list and then applied to $\rflts$.
  1049 if its output is concatenated with a list and then applied to $\rflts$.
  1079 \end{proof}
  1081 \end{proof}
  1080 \noindent
  1082 \noindent
  1081 This property means we do not have to repeatedly
  1083 This property means we do not have to repeatedly
  1082 apply simplification in each step, which justifies
  1084 apply simplification in each step, which justifies
  1083 our definition of $\blexersimp$.
  1085 our definition of $\blexersimp$.
       
  1086 This is in contrast to the work of Sulzmann and Lu where
       
  1087 the simplification is applied in a fixpoint manner.
  1084 
  1088 
  1085 
  1089 
  1086 On the other hand, we can repeat the same $\rsimp{}$ applications
  1090 On the other hand, we can repeat the same $\rsimp{}$ applications
  1087 on regular expressions as many times as we want, if we have at least
  1091 on regular expressions as many times as we want, if we have at least
  1088 one simplification applied to it, and apply it wherever we would like to:
  1092 one simplification applied to it, and apply it wherever we need to:
  1089 \begin{corollary}\label{headOneMoreSimp}
  1093 \begin{corollary}\label{headOneMoreSimp}
  1090 	The following properties hold, directly from \ref{rsimpIdem}:
  1094 	The following properties hold, directly from \ref{rsimpIdem}:
  1091 
  1095 
  1092 	\begin{itemize}
  1096 	\begin{itemize}
  1093 		\item
  1097 		\item
  1113 \begin{proof}
  1117 \begin{proof}
  1114 	By application of lemmas \ref{rsimpIdem} and \ref{good1}.
  1118 	By application of lemmas \ref{rsimpIdem} and \ref{good1}.
  1115 \end{proof}
  1119 \end{proof}
  1116 
  1120 
  1117 \noindent
  1121 \noindent
  1118 With the idempotency of $\rsimp{}$ and its corollaries, 
  1122 With the idempotency of $\textit{rsimp}$ and its corollaries, 
  1119 we can start proving some key equalities leading to the 
  1123 we can start proving some key equalities leading to the 
  1120 closed forms.
  1124 closed forms.
  1121 Now presented are a few equivalent terms under $\rsimp{}$.
  1125 Next we present a few equivalent terms under $\textit{rsimp}$.
  1122 To make the notation more concise
  1126 To make the notation more concise
  1123 We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$:
  1127 We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$.
  1124 \begin{center}
  1128 %\begin{center}
  1125 \begin{tabular}{lcl}
  1129 %\begin{tabular}{lcl}
  1126 	$a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$
  1130 %	$a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$
  1127 \end{tabular}
  1131 %\end{tabular}
  1128 \end{center}
  1132 %\end{center}
  1129 \noindent
  1133 %\noindent
  1130 %\vspace{0em}
  1134 %\vspace{0em}
  1131 \begin{lemma}
  1135 \begin{lemma}
  1132 	The following equivalence hold:
  1136 	The following equivalence hold:
  1133 	\begin{itemize}
  1137 	\begin{itemize}
  1134 	\item
  1138 	\item
  1138 	\item
  1142 	\item
  1139 		$\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$
  1143 		$\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$
  1140 	\item
  1144 	\item
  1141 		$\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$
  1145 		$\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$
  1142 	\item
  1146 	\item
  1143 		$\RALTS{rs} = \RALTS{\map \; \rsimp{} \; rs}$
  1147 		$\RALTS{rs} \sequal \RALTS{\map \; \rsimp{} \; rs}$
  1144 \end{itemize}
  1148 \end{itemize}
  1145 \end{lemma}
  1149 \end{lemma}
  1146 \begin{proof}
  1150 \begin{proof}
  1147 	By induction on the lists involved.
  1151 	By induction on the lists involved.
  1148 \end{proof}
  1152 \end{proof}
  1298 \end{figure}
  1302 \end{figure}
  1299 \noindent
  1303 \noindent
  1300 We define
  1304 We define
  1301 two separate list rewriting relations $\frewrite$ and $\grewrite$.
  1305 two separate list rewriting relations $\frewrite$ and $\grewrite$.
  1302 The rewriting steps that take place during
  1306 The rewriting steps that take place during
  1303 flattening is characterised by $\frewrite$.
  1307 flattening are characterised by $\frewrite$.
  1304 The rewrite relation $\grewrite$ characterises both flattening and de-duplicating.
  1308 The rewrite relation $\grewrite$ characterises both flattening and de-duplicating.
  1305 Sometimes $\grewrites$ is slightly too powerful
  1309 Sometimes $\grewrites$ is slightly too powerful
  1306 so we would rather use $\frewrites$ to prove
  1310 so we would rather use $\frewrites$ to prove
  1307 %because we only
  1311 %because we only
  1308 equalities related to $\rflts$.
  1312 equalities related to $\rflts$.
  1329 \end{lemma}
  1333 \end{lemma}
  1330 \noindent
  1334 \noindent
  1331 These two lemmas can both be proven using a straightforward induction (and
  1335 These two lemmas can both be proven using a straightforward induction (and
  1332 the proofs for them are therefore omitted).
  1336 the proofs for them are therefore omitted).
  1333 
  1337 
  1334 Now the above equalities can be derived like a breeze:
  1338 Now the above equalities can be derived with ease: 
  1335 \begin{corollary}
  1339 \begin{corollary}
  1336 	$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
  1340 	$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal
  1337 	\sum (\rDistinct \;\;  (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
  1341 	\sum (\rDistinct \;\;  (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing)
  1338 	$
  1342 	$
  1339 \end{corollary}
  1343 \end{corollary}
  1365 
  1369 
  1366 
  1370 
  1367 \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}
  1371 \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$}
  1368 In this part, we present lemmas stating
  1372 In this part, we present lemmas stating
  1369 pairs of r-regular expressions and r-regular expression lists
  1373 pairs of r-regular expressions and r-regular expression lists
  1370 where one could rewrite from one in many steps to the other.
  1374 where one can rewrite from one in many steps to the other.
  1371 Most of the proofs to these lemmas are straightforward, using
  1375 Most of the proofs to these lemmas are straightforward, using
  1372 an induction on the corresponding rewriting relations.
  1376 an induction on the corresponding rewriting relations.
  1373 These proofs will therefore be omitted when this is the case.
  1377 These proofs will therefore be omitted when this is the case.
  1374 We present in the following lemma a few pairs of terms that are rewritable via 
  1378 We present in the following lemma a few pairs of terms that are rewritable via 
  1375 $\grewrites$:
  1379 $\grewrites$:
  1422 	\end{itemize}
  1426 	\end{itemize}
  1423 \end{lemma}
  1427 \end{lemma}
  1424 \noindent
  1428 \noindent
  1425 Now comes the core of the proof, 
  1429 Now comes the core of the proof, 
  1426 which says that once two lists are rewritable to each other,
  1430 which says that once two lists are rewritable to each other,
  1427 then they are equivalent under $\rsimp{}$:
  1431 then they are equivalent under $\textit{rsimp}$:
  1428 \begin{lemma}
  1432 \begin{lemma}
  1429 	If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$.
  1433 	If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$.
  1430 \end{lemma}
  1434 \end{lemma}
  1431 
  1435 
  1432 \noindent
  1436 \noindent
  1433 Similar to what we did in \ref{Bitcoded2}, 
  1437 Similar to what we did in chapter \ref{Bitcoded2}, 
  1434 we prove that if one can rewrite from one r-regular expression ($r$)
  1438 we prove that if one can rewrite from one r-regular expression ($r$)
  1435 to the other ($r'$), after taking derivatives one could still rewrite
  1439 to the other ($r'$), after taking derivatives one can still rewrite
  1436 the first ($r\backslash c$) to the other ($r'\backslash c$).
  1440 the first ($r\backslash c$) to the other ($r'\backslash c$).
  1437 \begin{lemma}\label{interleave}
  1441 \begin{lemma}\label{interleave}
  1438 	If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
  1442 	If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$
  1439 \end{lemma}
  1443 \end{lemma}
  1440 \noindent
  1444 \noindent
  1554 \end{center}
  1558 \end{center}
  1555 \noindent
  1559 \noindent
  1556 The $\delta$ function 
  1560 The $\delta$ function 
  1557 returns $r$ when the boolean condition
  1561 returns $r$ when the boolean condition
  1558 $b$ evaluates to true and
  1562 $b$ evaluates to true and
  1559 $\ZERO$ otherwise:
  1563 $\ZERO_r$ otherwise:
  1560 \begin{center}
  1564 \begin{center}
  1561 	\begin{tabular}{lcl}
  1565 	\begin{tabular}{lcl}
  1562 		$\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\
  1566 		$\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\
  1563 				  & $\dn$ & $\ZERO \quad otherwise$
  1567 				  & $\dn$ & $\ZERO_r \quad otherwise$
  1564 	\end{tabular}
  1568 	\end{tabular}
  1565 \end{center}
  1569 \end{center}
  1566 \noindent
  1570 \noindent
  1567 Note that the term
  1571 Note that the term
  1568 \begin{center}
  1572 \begin{center}
  1585 \[
  1589 \[
  1586 	r_1 \backslash_r c_1c_2
  1590 	r_1 \backslash_r c_1c_2
  1587 \]
  1591 \]
  1588 instead of
  1592 instead of
  1589 \[
  1593 \[
  1590 	(r_1 \backslash_r c_1c_2 + \ZERO ) + \ZERO.
  1594 	(r_1 \backslash_r c_1c_2 + \ZERO_r ) + \ZERO_r.
  1591 \]
  1595 \]
  1592 The redundant $\ZERO$s will not be created in the
  1596 The redundant $\ZERO_r$s will not be created in the
  1593 first place.
  1597 first place.
  1594 In a closed-form one needs to take into account this (because
  1598 In a closed-form one needs to take into account this (because
  1595 closed forms require exact equality rather than language equivalence)
  1599 closed forms require exact equality rather than language equivalence)
  1596 and only generate the 
  1600 and only generate the 
  1597 $r_2 \backslash_r s''$ terms satisfying the property
  1601 $r_2 \backslash_r s''$ terms satisfying the property
  1627 is occurring in $(r_1\cdot r_2)\backslash s$.
  1631 is occurring in $(r_1\cdot r_2)\backslash s$.
  1628 
  1632 
  1629 With $\textit{Suffix}$ we are ready to express the
  1633 With $\textit{Suffix}$ we are ready to express the
  1630 sequence regular expression's closed form,
  1634 sequence regular expression's closed form,
  1631 but before doing so 
  1635 but before doing so 
  1632 more devices are needed.
  1636 more definitions are needed.
  1633 The first thing is the flattening function $\sflat{\_}$,
  1637 The first thing is the flattening function $\sflat{\_}$,
  1634 which takes an alternative regular expression and produces a flattened version
  1638 which takes an alternative regular expression and produces a flattened version
  1635 of that alternative regular expression.
  1639 of that alternative regular expression.
  1636 It is needed to convert
  1640 It is needed to convert
  1637 a left-associative nested sequence of alternatives into 
  1641 a left-associative nested sequence of alternatives into 
  1662 
  1666 
  1663 \begin{center}  
  1667 \begin{center}  
  1664 	\begin{tabular}{lcl}
  1668 	\begin{tabular}{lcl}
  1665 		$\sflataux{[]}'$ & $ \dn $ & $ []$\\
  1669 		$\sflataux{[]}'$ & $ \dn $ & $ []$\\
  1666 		$\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\
  1670 		$\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\
  1667 		$\sflataux{r :: rs}$ & $\dn$ & $ r::rs$
  1671 		$\sflataux{r :: rs}'$ & $\dn$ & $ r::rs$
  1668 	\end{tabular}
  1672 	\end{tabular}
  1669 \end{center}
  1673 \end{center}
  1670 \noindent
  1674 \noindent
  1671 $\sflataux{\_}$ breaks up nested alternative regular expressions 
  1675 $\sflataux{\_}$ breaks up nested alternative regular expressions 
  1672 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
  1676 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
  1701 \begin{proof}
  1705 \begin{proof}
  1702 	By a reverse induction on the string $s$, where the inductive cases are $[]$
  1706 	By a reverse induction on the string $s$, where the inductive cases are $[]$
  1703 	and $xs  @ [x]$.
  1707 	and $xs  @ [x]$.
  1704 \end{proof}
  1708 \end{proof}
  1705 \noindent
  1709 \noindent
  1706 If we have a regular expression $r$ whose shpae 
  1710 If we have a regular expression $r$ whose shape 
  1707 fits into those described by $\textit{createdBySequence}$,
  1711 fits into those described by $\textit{createdBySequence}$,
  1708 then we can convert between 
  1712 then we can convert between
  1709 $r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with
  1713 $r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with
  1710 $\sflataux{\_}'$:
  1714 $\sflataux{\_}'$:
  1711 \begin{lemma}\label{sfauIdemDer}
  1715 \begin{lemma}\label{sfauIdemDer}
  1712 	If $\textit{createdBySequence} \; r$, then 
  1716 	If $\textit{createdBySequence} \; r$, then 
  1713 	\begin{center}
  1717 	\begin{center}
  1723 
  1727 
  1724 Now we are ready to express
  1728 Now we are ready to express
  1725 the shape of $r_1 \cdot r_2 \backslash s$
  1729 the shape of $r_1 \cdot r_2 \backslash s$
  1726 \begin{lemma}\label{seqSfau0}
  1730 \begin{lemma}\label{seqSfau0}
  1727 	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 
  1731 	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 
  1728 	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ 
  1732 	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$ 
  1729 \end{lemma}
  1733 \end{lemma}
  1730 \begin{proof}
  1734 \begin{proof}
  1731 	By a reverse induction on the string $s$, where the inductive cases 
  1735 	By a reverse induction on the string $s$, where the inductive cases 
  1732 	are $[]$ and $xs @ [x]$.
  1736 	are $[]$ and $xs @ [x]$.
  1733 	For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2)
  1737 	For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2)
  1753 %allowing people investigating derivatives to get an alternative
  1757 %allowing people investigating derivatives to get an alternative
  1754 %view of what $r_1 \cdot r_2$ is.
  1758 %view of what $r_1 \cdot r_2$ is.
  1755 
  1759 
  1756 We now use $\textit{createdBySequence}$ and
  1760 We now use $\textit{createdBySequence}$ and
  1757 $\sflataux{\_}$ to describe an intuition
  1761 $\sflataux{\_}$ to describe an intuition
  1758 behind the closed form of the sequence closed form.
  1762 behind the sequence closed form.
  1759 If two regular expressions only differ in the way their
  1763 If two regular expressions only differ in the way their
  1760 alternatives are nested, then we should be able to get the same result
  1764 alternatives are nested, then we should be able to get the same result
  1761 once we apply simplification to both of them:
  1765 once we apply simplification to both of them:
  1762 \begin{lemma}\label{sflatRsimpeq}
  1766 \begin{lemma}\label{sflatRsimpeq}
  1763 	If $r$ is created from a sequence through
  1767 	If $r$ is created from a sequence through
  1784 	\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
  1788 	\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
  1785 \end{lemma}
  1789 \end{lemma}
  1786 \begin{proof}
  1790 \begin{proof}
  1787 	We know that 
  1791 	We know that 
  1788 	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 
  1792 	$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 
  1789 	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$
  1793 	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$
  1790 	holds
  1794 	holds
  1791 	by lemma \ref{seqSfau0}.
  1795 	by lemma \ref{seqSfau0}.
  1792 	This allows the theorem to go through because of lemma \ref{sflatRsimpeq}.
  1796 	This allows the theorem to go through because of lemma \ref{sflatRsimpeq}.
  1793 \end{proof}
  1797 \end{proof}
  1794 Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
  1798 Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
  1843 	r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
  1847 	r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
  1844 \end{center}
  1848 \end{center}
  1845 which can be de-duplicated by $\rDistinct$
  1849 which can be de-duplicated by $\rDistinct$
  1846 and therefore bounded finitely.
  1850 and therefore bounded finitely.
  1847 
  1851 
  1848 and then de-duplicate terms of the form  ($s'$ being a substring of $s$).
  1852 %and then de-duplicate terms of the form  ($s'$ being a substring of $s$).
  1849 This allows us to use a similar technique as $r_1 \cdot r_2$ case,
  1853 %This allows us to use a similar technique as $r_1 \cdot r_2$ case,
  1850 
  1854 
  1851 Now the crux of this section is finding a suitable description
  1855 Now the crux of this section is finding a suitable description
  1852 for $rs$ where
  1856 for $rs$ where
  1853 \begin{center}
  1857 \begin{center}
  1854 	$\rderssimp{r^*}{s} = \rsimp{\sum rs}$.
  1858 	$\rderssimp{r^*}{s} = \rsimp{\sum rs}$.
  2066 	By an induction on $s$, the inductive cases
  2070 	By an induction on $s$, the inductive cases
  2067 	being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used.
  2071 	being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used.
  2068 \end{proof}
  2072 \end{proof}
  2069 \noindent
  2073 \noindent
  2070 
  2074 
  2071 $\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
  2075 The function $\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
  2072 \begin{lemma}\label{hflatauxGrewrites}
  2076 \begin{lemma}\label{hflatauxGrewrites}
  2073 	$a :: rs \grewrites \hflataux{a} @ rs$
  2077 	$a :: rs \grewrites \hflataux{a} @ rs$
  2074 \end{lemma}
  2078 \end{lemma}
  2075 \begin{proof}
  2079 \begin{proof}
  2076 	By induction on $a$. $rs$ is set to take arbitrary values.
  2080 	By induction on $a$. $rs$ is set to take arbitrary values.
  2084 \end{lemma}
  2088 \end{lemma}
  2085 
  2089 
  2086 \begin{proof}
  2090 \begin{proof}
  2087 	By using the rewriting relation $\rightsquigarrow$
  2091 	By using the rewriting relation $\rightsquigarrow$
  2088 \end{proof}
  2092 \end{proof}
  2089 And from this we obtain a proof that a star's derivative will be the same
  2093 And from this we obtain that a 
  2090 as if it had all its nested alternatives created during deriving being flattened out:
  2094 regular expression created by star 
       
  2095 is the same as its flattened version, up to equivalence under $\textit{bsimp}$.
  2091 For example,
  2096 For example,
  2092 \begin{lemma}\label{hfauRsimpeq2}
  2097 \begin{lemma}\label{hfauRsimpeq2}
  2093 	$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
  2098 	$\textit{createdByStar} \; r \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
  2094 \end{lemma}
  2099 \end{lemma}
  2095 \begin{proof}
  2100 \begin{proof}
  2096 	By structural induction on $r$, where the induction rules 
  2101 	By structural induction on $r$, where the induction rules 
  2097 	are these of $\createdByStar{\_}$.
  2102 	are these of $\createdByStar{\_}$.
  2098 	Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case.
  2103 	Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case.
  2253 	$\textit{finite} \; (\textit{sizeNregex} \; N)$ holds.
  2258 	$\textit{finite} \; (\textit{sizeNregex} \; N)$ holds.
  2254 \end{lemma}
  2259 \end{lemma}
  2255 \begin{proof}
  2260 \begin{proof}
  2256 	By splitting the set $\textit{sizeNregex} \; (N + 1)$ into
  2261 	By splitting the set $\textit{sizeNregex} \; (N + 1)$ into
  2257 	subsets by their categories:
  2262 	subsets by their categories:
  2258 	$\{\ZERO, \ONE, c\}$, $\{* `` \textit{sizeNregex} \; N\}$,
  2263 	$\{\ZERO_r, \ONE_r, c\}$, $\{r^* \mid r \in \textit{sizeNregex} \; N\}$,
  2259 	and so on. Each of these subsets are finitely bounded.
  2264 	and so on. Each of these subsets are finitely bounded.
  2260 \end{proof}
  2265 \end{proof}
  2261 \noindent
  2266 \noindent
  2262 From this we get a corollary that
  2267 From this we get a corollary that
  2263 if forall $r \in rs$, $\rsize{r} \leq N$, then the output of 
  2268 if forall $r \in rs$, $\rsize{r} \leq N$, then the output of 
  2281 %output of $\textit{rdistinct} \; rs' \; \varnothing$
  2286 %output of $\textit{rdistinct} \; rs' \; \varnothing$
  2282 %is bounded by a constant $N * c_N$ depending only on $N$,
  2287 %is bounded by a constant $N * c_N$ depending only on $N$,
  2283 %provided that each of $rs'$'s element
  2288 %provided that each of $rs'$'s element
  2284 %is bounded by $N$.
  2289 %is bounded by $N$.
  2285 
  2290 
  2286 \subsection{$\textit{rsimp}$ Does Not Increment the Size}
  2291 \subsection{$\textit{rsimp}$ Does Not Increase the Size}
  2287 Although it seems evident, we need a series
  2292 Although it seems evident, we need a series
  2288 of non-trivial lemmas to establish that functions such as $\rflts$
  2293 of non-trivial lemmas to establish that functions such as $\rflts$
  2289 do not cause the regular expressions to grow.
  2294 do not cause the regular expressions to grow.
  2290 \begin{lemma}\label{rsimpMonoLemmas}
  2295 \begin{lemma}\label{rsimpMonoLemmas}
  2291 	\mbox{}
  2296 	\mbox{}
  2338 	By \ref{rsimpMonoLemmas}.
  2343 	By \ref{rsimpMonoLemmas}.
  2339 \end{proof}
  2344 \end{proof}
  2340 
  2345 
  2341 \subsection{Estimating the Closed Forms' sizes}
  2346 \subsection{Estimating the Closed Forms' sizes}
  2342 We recap the closed forms we obtained
  2347 We recap the closed forms we obtained
  2343 earlier by putting them together down below:
  2348 earlier:
  2344 \begin{itemize}
  2349 \begin{itemize}
  2345 	\item
  2350 	\item
  2346 		$\rderssimp{(\sum rs)}{s} \sequal
  2351 		$\rderssimp{(\sum rs)}{s} \sequal
  2347 		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
  2352 		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
  2348 	\item
  2353 	\item
  2372 which will each have a size uppder bound 
  2377 which will each have a size uppder bound 
  2373 according to inductive hypothesis, which controls $r \backslash s$.
  2378 according to inductive hypothesis, which controls $r \backslash s$.
  2374 
  2379 
  2375 We elaborate the above reasoning by a series of lemmas
  2380 We elaborate the above reasoning by a series of lemmas
  2376 below, where straightforward proofs are omitted.
  2381 below, where straightforward proofs are omitted.
  2377 
  2382 %We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
  2378 
  2383 We show that $\textit{rdistinct}$ and $\rflts$
  2379 
       
  2380 
       
  2381 We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
       
  2382 
       
  2383 We show that $\rdistinct$ and $\rflts$
       
  2384 working together is at least as 
  2384 working together is at least as 
  2385 good as $\rdistinct{}{}$ alone, which can be written as
  2385 good as $\textit{rdistinct}$ alone, which can be written as
  2386 \begin{center}
  2386 \begin{center}
  2387 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
  2387 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
  2388 	\leq 
  2388 	\leq 
  2389 	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
  2389 	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
  2390 \end{center}
  2390 \end{center}
  2441 	\begin{center}
  2441 	\begin{center}
  2442 	\begin{tabular}{lcl}
  2442 	\begin{tabular}{lcl}
  2443 	$\llbracket  (\textit{rdistinct} \; (\textit{rflts} \; as) \;
  2443 	$\llbracket  (\textit{rdistinct} \; (\textit{rflts} \; as) \;
  2444 	(noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\
  2444 	(noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\
  2445 						    $\llbracket  (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup
  2445 						    $\llbracket  (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup
  2446 	\{ \ZERO \} )) \rrbracket_r$ & & \\ 
  2446 	\{ \ZERO_r \} )) \rrbracket_r$ & & \\ 
  2447 	\end{tabular}
  2447 	\end{tabular}
  2448 	\end{center}
  2448 	\end{center}
  2449 		holds.
  2449 		holds.
  2450 \end{lemma}
  2450 \end{lemma}
  2451 \noindent
  2451 \noindent
  2553 \end{center}
  2553 \end{center}
  2554 \noindent
  2554 \noindent
  2555 (5) is by theorem \ref{starClosedForm}.
  2555 (5) is by theorem \ref{starClosedForm}.
  2556 (6) is by \ref{altsSimpControl}.
  2556 (6) is by \ref{altsSimpControl}.
  2557 (7) is by corollary \ref{finiteSizeNCorollary}.
  2557 (7) is by corollary \ref{finiteSizeNCorollary}.
  2558 Combining with the case when $s = []$, one gets
  2558 Combining with the case when $s = []$, one obtains
  2559 \begin{center}
  2559 \begin{center}
  2560 	\begin{tabular}{lcll}
  2560 	\begin{tabular}{lcll}
  2561 		$\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r)))
  2561 		$\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r)))
  2562 		* (1 + (N + n_r)) $ & (8)\\
  2562 		* (1 + (N + n_r)) $ & (8)\\
  2563 	\end{tabular}
  2563 	\end{tabular}
  2577 	\end{tabular}
  2577 	\end{tabular}
  2578 \end{center}
  2578 \end{center}
  2579 \noindent
  2579 \noindent
  2580 (9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis.
  2580 (9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis.
  2581 
  2581 
  2582 Combining with the case when $s = []$, one gets
  2582 Combining with the case when $s = []$, we obtain 
  2583 \begin{center}
  2583 \begin{center}
  2584 	\begin{tabular}{lcll}
  2584 	\begin{tabular}{lcll}
  2585 		$\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ 
  2585 		$\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ 
  2586 						 & (12)\\
  2586 						 & (12)\\
  2587 	\end{tabular}
  2587 	\end{tabular}
  2826 For the closed form to be bounded, we would like
  2826 For the closed form to be bounded, we would like
  2827 simplification to be applied to each term in the list.
  2827 simplification to be applied to each term in the list.
  2828 Therefore we introduce some variants of $\opterm$,
  2828 Therefore we introduce some variants of $\opterm$,
  2829 which help conveniently express the rewriting steps 
  2829 which help conveniently express the rewriting steps 
  2830 needed in the closed form proof.
  2830 needed in the closed form proof.
       
  2831 We have $\optermOsimp$, $\optermosimp$ and $\optermsimp$
       
  2832 with slightly different spellings because they help the proof to go through:
  2831 \begin{center}
  2833 \begin{center}
  2832 	\begin{tabular}{lcl}
  2834 	\begin{tabular}{lcl}
  2833 	$\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
  2835 	$\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
  2834 				 & & $\;\;\Some \; (s, n) \Rightarrow 
  2836 				 & & $\;\;\Some \; (s, n) \Rightarrow 
  2835 				 \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\
  2837 				 \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\
  3082 	\textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$,
  3084 	\textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$,
  3083 	where $c_N = \textit{card} \; (\textit{sizeNregex} \; (
  3085 	where $c_N = \textit{card} \; (\textit{sizeNregex} \; (
  3084 	N + \llbracket r^{\{n\}} \rrbracket_r+1))$.
  3086 	N + \llbracket r^{\{n\}} \rrbracket_r+1))$.
  3085 \end{theorem}
  3087 \end{theorem}
  3086 \begin{proof}
  3088 \begin{proof}
  3087 We have that for all regular expressions $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
  3089 We have that for all regular expressions $r'$ in 
       
  3090 \begin{center}
       
  3091 $\textit{set} \; (\map \; (\optermsimp \; r) \; (
  3088 	\nupdates \; s \; r \; [\Some \; ([c], n)]))$,
  3092 	\nupdates \; s \; r \; [\Some \; ([c], n)]))$,
       
  3093 \end{center}
  3089 	$r'$'s size is less than or equal to $N + \llbracket r^{\{n\}} 
  3094 	$r'$'s size is less than or equal to $N + \llbracket r^{\{n\}} 
  3090 	\rrbracket_r + 1$
  3095 	\rrbracket_r + 1$
  3091 because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot
  3096 because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot
  3092 r^{\{m\}}$ for some string $s'$ and number 
  3097 r^{\{m\}}$ for some string $s'$ and number 
  3093 $m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}).
  3098 $m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}).
  3114 
  3119 
  3115 
  3120 
  3116 \section{Comments and Future Improvements}
  3121 \section{Comments and Future Improvements}
  3117 \subsection{Some Experimental Results}
  3122 \subsection{Some Experimental Results}
  3118 What guarantee does this bound give us?
  3123 What guarantee does this bound give us?
  3119 Whatever the regex is, it will not grow indefinitely.
  3124 It states that whatever the regex is, it will not grow indefinitely.
  3120 Take our previous example $(a + aa)^*$ as an example:
  3125 Take our previous example $(a + aa)^*$ as an example:
  3121 \begin{center}
  3126 \begin{center}
  3122 	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
  3127 	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
  3123 		\begin{tikzpicture}
  3128 		\begin{tikzpicture}
  3124 			\begin{axis}[
  3129 			\begin{axis}[
  3235 We will discuss improvements to this bound in the next chapter.
  3240 We will discuss improvements to this bound in the next chapter.
  3236 
  3241 
  3237 
  3242 
  3238 
  3243 
  3239 \subsection{Possible Further Improvements}
  3244 \subsection{Possible Further Improvements}
  3240 There are two problems with this finiteness result, though.\\
  3245 There are two problems with this finiteness result, though:\\
  3241 (i)	
  3246 (i)	
  3242 		First, it is not yet a direct formalisation of our lexer's complexity,
  3247 		First, it is not yet a direct formalisation of our lexer's complexity,
  3243 		as a complexity proof would require looking into 
  3248 		as a complexity proof would require looking into 
  3244 		the time it takes to execute {\bf all} the operations
  3249 		the time it takes to execute {\bf all} the operations
  3245 		involved in the lexer (simp, collect, decode), not just the derivative.\\
  3250 		involved in the lexer (simp, collect, decode), not just the derivative.\\
  3259 	\item
  3264 	\item
  3260 		The bound is already a strong indication that catastrophic
  3265 		The bound is already a strong indication that catastrophic
  3261 		backtracking is much less likely to occur in our $\blexersimp$
  3266 		backtracking is much less likely to occur in our $\blexersimp$
  3262 		algorithm.
  3267 		algorithm.
  3263 		We refine $\blexersimp$ with $\blexerStrong$ in the next chapter
  3268 		We refine $\blexersimp$ with $\blexerStrong$ in the next chapter
  3264 		so that the bound becomes polynomial.
  3269 		so that we conjecture the bound becomes polynomial.
  3265 \end{itemize}
  3270 \end{itemize}
  3266 
  3271 
  3267 %----------------------------------------------------------------------------------------
  3272 %----------------------------------------------------------------------------------------
  3268 %	SECTION 4
  3273 %	SECTION 4
  3269 %----------------------------------------------------------------------------------------
  3274 %----------------------------------------------------------------------------------------
  3324 \begin{center}
  3329 \begin{center}
  3325 $(\sum_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
  3330 $(\sum_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
  3326 [1mm]
  3331 [1mm]
  3327 	$(1 \leq m' \leq m )$
  3332 	$(1 \leq m' \leq m )$
  3328 \end{center}
  3333 \end{center}
  3329 There at at least exponentially
  3334 There are at least exponentially
  3330 many such terms.\footnote{To be exact, these terms are 
  3335 many such terms.\footnote{To be exact, these terms are 
  3331 distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted,
  3336 distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted,
  3332 but the point is that the number is exponential.
  3337 but the point is that the number is exponential.
  3333 } 
  3338 } 
  3334 With each new input character taking the derivative against the intermediate result, more and more such distinct
  3339 With each new input character taking the derivative against the intermediate result, more and more such distinct