ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 558 671a83abccf3
parent 557 812e5d112f49
child 559 9d18f3eac484
equal deleted inserted replaced
557:812e5d112f49 558:671a83abccf3
   263 		reduce the size of a regular expression, not adding to it.
   263 		reduce the size of a regular expression, not adding to it.
   264 \end{itemize}
   264 \end{itemize}
   265 
   265 
   266 \section{Step One: Closed Forms}
   266 \section{Step One: Closed Forms}
   267 		We transform the function application $\rderssimp{r}{s}$
   267 		We transform the function application $\rderssimp{r}{s}$
   268 		into an equivalent form $f\; (g \; (\sum rs))$.
   268 		into an equivalent 
       
   269 		form $f\; (g \; (\sum rs))$.
   269 		The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$.
   270 		The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$.
   270 		This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the
   271 		This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the
   271 		right hand side the "closed form" of $r\backslash s$.
   272 		right hand side the "closed form" of $r\backslash s$.
       
   273 
       
   274 \begin{quote}\it
       
   275 	Claim: For regular expressions $r_1 \cdot r_2$, we claim that
       
   276 	\begin{center}
       
   277 		$ \rderssimp{r_1 \cdot r_2}{s} = 
       
   278 	\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
       
   279 	\end{center}
       
   280 \end{quote}
       
   281 \noindent
       
   282 We explain in detail how we reached those claims.
   272 \subsection{Basic Properties needed for Closed Forms}
   283 \subsection{Basic Properties needed for Closed Forms}
   273 
   284 
   274 \subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
   285 \subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
   275 The $\textit{rdistinct}$ function, as its name suggests, will
   286 The $\textit{rdistinct}$ function, as its name suggests, will
   276 remove duplicates in an \emph{r}$\textit{rexp}$ list, 
   287 remove duplicates in an \emph{r}$\textit{rexp}$ list, 
  1043 \noindent
  1054 \noindent
  1044 The harder closed forms are the sequence and star ones.
  1055 The harder closed forms are the sequence and star ones.
  1045 Before we go on to obtain them, some preliminary definitions
  1056 Before we go on to obtain them, some preliminary definitions
  1046 are needed to make proof statements concise.
  1057 are needed to make proof statements concise.
  1047 
  1058 
  1048 \section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
  1059 \section{"Closed Forms" of Sequence Regular Expressions}
  1049 We note that the different ways in which regular expressions are 
  1060 The problem of obataining a closed-form for sequence regular expression 
  1050 nested do not matter under $\rsimp{}$:
  1061 is constructing $(r_1 \cdot r_2) \backslash_r s$
  1051 \begin{center}
  1062 if we are only allowed to use a combination of $r_1 \backslash s''$ 
  1052 	To be proven:\\
  1063 and  $r_2 \backslash s''$ , where $s''$ is from $s$.
  1053 	$\rsimp{r} \stackrel{?}{=} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$ 
  1064 First let's look at a series of derivatives steps on a sequence 
  1054 and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$
  1065 regular expression, assuming that each time the first
  1055 \end{center}
  1066 component of the sequence is always nullable):
  1056 \noindent
  1067 \begin{center}
       
  1068 
       
  1069 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
       
  1070 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
       
  1071  \ldots$
       
  1072 
       
  1073 \end{center}
       
  1074 Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as 
       
  1075 a giant alternative taking a list of terms 
       
  1076 $[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,
       
  1077 where the head of the list is always the term
       
  1078 representing a match involving only $r_1$, and the tail of the list consisting of
       
  1079 terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$.
  1057 This intuition is also echoed by IndianPaper, where they gave
  1080 This intuition is also echoed by IndianPaper, where they gave
  1058 a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:
  1081 a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:
  1059 \begin{center}
  1082 \begin{center}
  1060 	$((r_1 \cdot r_2) \backslash_r c_1) \stackrel{\backslash_r c_2}{=}
  1083 	\begin{tabular}{c}
  1061 	((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1),\; r_2 \backslash_r c_1)) 
  1084 	$(r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) \myequiv$\\ 
  1062 	\stackrel{\backslash_r c_2}{=}
  1085 	\rule{0pt}{3ex} $((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1) \; r_2 \backslash_r c_1)) \backslash_r (c_2 :: \ldots c_n) 
  1063 	((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
  1086 	\myequiv$\\
       
  1087 	\rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable \; r_1) \; r_2 \backslash_r c_1c_2))
       
  1088 	+ (\delta \ (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)) \backslash_r (c_3 \ldots c_n)
       
  1089 	$
       
  1090 	\end{tabular}
       
  1091 \end{center}
       
  1092 \noindent
       
  1093 The equality in above should be interpretated
       
  1094 as language equivalence. 
       
  1095 The $\delta$ function works similarly to that of
       
  1096 a Kronecker delta function:
       
  1097 \[ \delta \; b\; r\]
       
  1098 will produce $r$
       
  1099 if $b$ evaluates to true, 
       
  1100 and $\RZERO$ otherwise.
       
  1101 Note that their formulation  
       
  1102 \[
       
  1103 	((r_1 \backslash_r \, c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
  1064 	+ (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)
  1104 	+ (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)
  1065 	$
  1105 \]
  1066 \end{center}
  1106 does not faithfully
  1067 \noindent
  1107 represent what the intermediate derivatives would actually look like
  1068 The $\delta \; b \; r$ function above turns the entire term into $\RZERO$
  1108 when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not 
  1069 if the boolean condition $b$ evaluates to false, and outputs $r$ otherwise.
  1109 nullable in the head of the sequence.
  1070 The equality in their derivation steps should be interpretated
  1110 For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable,
  1071 as language equivalence. To pin down the intuition into rigorous terms,
  1111 the regular expression would not look like 
  1072 $\sflat{}$ is used to enable the transformation from 
  1112 \[
       
  1113 	(r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO,
       
  1114 \]
       
  1115 but actually $r_1 \backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the
       
  1116 first place.
       
  1117 In a closed-form one would want to take into account this 
       
  1118 and generate the list of
       
  1119 regular expressions $r_2 \backslash_r s''$ with
       
  1120 string pairs $(s', s'')$ where $s'@s'' = s$ and
       
  1121 $r_1 \backslash s'$ nullable.
       
  1122 We denote the list consisting of such 
       
  1123 strings $s''$ as $\vsuf{s}{r_1}$.
       
  1124 
       
  1125 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
       
  1126 \begin{center}
       
  1127 \begin{tabular}{lcl}
       
  1128 $\vsuf{[]}{\_} $ & $=$ &  $[]$\\
       
  1129 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
       
  1130                                      && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
       
  1131 \end{tabular}
       
  1132 \end{center}
       
  1133 \noindent
       
  1134 The list is sorted in the order $r_2\backslash s''$ 
       
  1135 appears in $(r_1\cdot r_2)\backslash s$.
       
  1136 In essence, $\vsuf{\_}{\_}$ is doing a 
       
  1137 "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
       
  1138 the entire result $(r_1 \cdot r_2) \backslash s$, 
       
  1139 it only stores all the strings $s''$ such that $r_2 \backslash s''$
       
  1140 are occurring terms in $(r_1\cdot r_2)\backslash s$.
       
  1141 
       
  1142 To make the closed form representation 
       
  1143 more straightforward,
       
  1144 the flattetning function $\sflat{\_}$ is used to enable the transformation from 
  1073 a left-associative nested sequence of alternatives into 
  1145 a left-associative nested sequence of alternatives into 
  1074 a flattened list:
  1146 a flattened list:
  1075 $r = \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{}}{\rightarrow} 
  1147 \[
  1076 (\ldots ((r_1 + r_2) + r_3) + \ldots)$
  1148   \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{\_}}{\rightarrow} 
  1077 The definitions $\sflat{}$, $\sflataux{}$ are given below.
  1149 (\ldots ((r_1 + r_2) + r_3) + \ldots)
  1078 
  1150 \]
       
  1151 \noindent
       
  1152 The definitions $\sflat{\_}$, $\sflataux{\_}$ are given below.
  1079  \begin{center}  
  1153  \begin{center}  
  1080  \begin{tabular}{ccc}
  1154  \begin{tabular}{ccc}
  1081  $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
  1155  $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
  1082 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
  1156 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
  1083 $\sflataux r$ & $=$ & $ [r]$
  1157 $\sflataux r$ & $=$ & $ [r]$
  1089 	 $\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\
  1163 	 $\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\
  1090 	 $\sflat{\sum []}$ & $ = $ & $ \sum []$\\
  1164 	 $\sflat{\sum []}$ & $ = $ & $ \sum []$\\
  1091 $\sflat r$ & $=$ & $ r$
  1165 $\sflat r$ & $=$ & $ r$
  1092 \end{tabular}
  1166 \end{tabular}
  1093 \end{center}
  1167 \end{center}
  1094 The intuition behind $\sflataux{\_}$ is to break up nested regexes 
  1168 \noindent
       
  1169 $\sflataux{\_}$ breaks up nested alternative regexes 
  1095 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
  1170 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
  1096 into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
  1171 into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
  1097 It will return the singleton list $[r]$ otherwise.
  1172 It will return the singleton list $[r]$ otherwise.
  1098 $\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
  1173 $\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
  1099 the output type a regular expression, not a list.
  1174 the output type a regular expression, not a list.
  1100 $\sflataux{\_}$  and $\sflat{\_}$ is only recursive in terms of the
  1175 $\sflataux{\_}$  and $\sflat{\_}$ are only recursive on the  
  1101  first element of the list of children of
  1176 first element of the list.
  1102 an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for  dealing with the regular expression
  1177 
  1103 $r_1 \cdot r_2 \backslash s$.
  1178 With $\sflataux{}$ a preliminary to the closed form can be stated,
  1104 \subsection{The $\textit{vsuf}$ Function}
  1179 where the derivative of $r_1 \cdot r_2 \backslash s$ can be
  1105 Note that $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
  1180 flattened into a list whose head and tail meet the description
  1106 	+ (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)$ does not faithfully
  1181 we gave earlier.
  1107 represent what would the intermediate derivatives would actually look like.
  1182 \begin{lemma}\label{seqSfau0}
  1108 For example, when $r_1$ and $r_1 \backslash_r c_1$ are nullable,
  1183 	$\sflataux{\rders{(r_1 \cdot r_2) \backslash s }} = (r_1 \backslash_r s) \cdot r_2 
  1109 the regular expression would not look like $(r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO$,
  1184 	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ 
  1110 but actually $r_1\backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the
  1185 \end{lemma}
  1111 first place.
  1186 \begin{proof}
  1112 In a closed-form one would want to take into account this 
  1187 	By an induction on the string $s$, where the inductive cases 
  1113 and generate the list of string pairs $(s', s'')$ where $s'@s'' = s$ such that
  1188 	are split as $[]$ and $xs @ [x]$.
  1114 only $r_1 \backslash s'$ nullable.
  1189 	Note the key identify holds:
  1115 With $\sflat{\_}$ and $\sflataux{\_}$ we make 
  1190 	\[
  1116 precise what  "closed forms" we have for the sequence derivatives and their simplifications,
  1191 		\map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\;
  1117 in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
  1192 		\map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1}))
  1118 and $\bderssimp{(r_1\cdot r_2)}{s}$,
  1193 	\]
  1119 if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
  1194 		=
  1120 and  $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$  ranges over 
  1195 	\[
  1121 the substring of $s$?
  1196 		\map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1})
  1122 First let's look at a series of derivatives steps on a sequence 
  1197 	\]
  1123 regular expression,  (assuming) that each time the first
  1198 	This enables the inductive case to go through.
  1124 component of the sequence is always nullable):
  1199 \end{proof}
  1125 \begin{center}
  1200 \noindent 
  1126 
  1201 Note that this lemma does $\mathbf{not}$ depend on any
  1127 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
  1202 specific definitions we used,
  1128 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
  1203 allowing people investigating derivatives to get an alternative
  1129  \ldots$
  1204 view of what $r_1 \cdot r_2$ is.
  1130 
  1205 
  1131 \end{center}
  1206 Now we are able to use this for the intuition that 
  1132 %TODO: cite indian paper
  1207 the different ways in which regular expressions are 
  1133 Indianpaper have  come up with a slightly more formal way of putting the above process:
  1208 nested do not matter under $\rsimp{}$:
  1134 \begin{center}
  1209 \begin{center}
  1135 $r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
  1210 	$\rsimp{r} \stackrel{?}{\sequal} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$ 
  1136 \delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
  1211 and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$
  1137 + \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
  1212 \end{center}
  1138 \end{center}
  1213 Simply wrap with $\sum$ constructor and add 
  1139 where  $\delta(b, r)$ will produce $r$
  1214 simplifications to both sides of \ref{seqSfau0}
  1140 if $b$ evaluates to true, 
  1215 and one gets
  1141 and $\ZERO$ otherwise.
  1216 \begin{corollary}\label{seqClosedFormGeneral}
  1142 
  1217 	$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }
  1143  But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
  1218 	=\rsimp{(\sum (  (r_1 \backslash s) \cdot r_2 :: 
  1144  equivalence. To make this intuition useful 
  1219 		\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
  1145  for a formal proof, we need something
  1220 \end{corollary}
  1146 stronger than language equivalence.
  1221 Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
  1147 With the help of $\sflat{\_}$ we can state the equation in Indianpaper
       
  1148 more rigorously:
       
  1149 \begin{lemma}
       
  1150 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
       
  1151 \end{lemma}
       
  1152 
       
  1153 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
       
  1154 
       
  1155 \begin{center}
       
  1156 \begin{tabular}{lcl}
       
  1157 $\vsuf{[]}{\_} $ & $=$ &  $[]$\\
       
  1158 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
       
  1159                                      && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
       
  1160 \end{tabular}
       
  1161 \end{center}                   
       
  1162 It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
       
  1163 and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
       
  1164 the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
       
  1165 In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
       
  1166 the entire result of  $(r_1 \cdot r_2) \backslash s$, 
       
  1167 it only stores all the $s''$ such that $r_2 \backslash s''$  are occurring terms in $(r_1\cdot r_2)\backslash s$.
       
  1168 
       
  1169 With this we can also add simplifications to both sides and get
       
  1170 \begin{lemma}
       
  1171 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
       
  1172 \end{lemma}
       
  1173 Together with the idempotency property of $\rsimp{}$\ref{rsimpIdem},
       
  1174 %TODO: state the idempotency property of rsimp
       
  1175 it is possible to convert the above lemma to obtain a "closed form"
  1222 it is possible to convert the above lemma to obtain a "closed form"
  1176 for  derivatives nested with simplification:
  1223 for  derivatives nested with simplification:
       
  1224 \begin{lemma}\label{seqClosedForm}
       
  1225 	$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) 
       
  1226 	:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$
       
  1227 \end{lemma}
       
  1228 \begin{proof}
       
  1229 	By a case analysis of string $s$.
       
  1230 	When $s$ is empty list, the rewrite is straightforward.
       
  1231 	When $s$ is a list, one could use the corollary \ref{seqSfau0},
       
  1232 	and lemma \ref{Simpders} to rewrite the left-hand-side.
       
  1233 \end{proof}
       
  1234 As a corollary for this closed form, one can estimate the size 
       
  1235 of the sequence derivative $r_1 \cdot r_2 \backslash_r s$ using 
       
  1236 an easier-to-handle expression:
       
  1237 \begin{corollary}\label{seqEstimate1}
       
  1238 	\begin{center}
       
  1239 
       
  1240 	$\llbracket \rderssimp{(r_1 \cdot r_2)}{s} \rrbracket_r = \llbracket \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) 
       
  1241 	:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$
       
  1242     	
       
  1243 	\end{center}
       
  1244 \end{corollary}
       
  1245 \noindent
       
  1246 \subsection{Closed Forms for Star Regular Expressions}
       
  1247 We use a similar technique as $r_1 \cdot r_2$ case,
       
  1248 generating 
       
  1249 all possible sub-strings $s'$ of $s$
       
  1250 such that $r\backslash s' \cdot r^*$ will appear 
       
  1251 as a term in $(r^*) \backslash s$.
       
  1252 The first function we define is a single-step
       
  1253 updating function $\starupdate$, which takes three arguments as input:
       
  1254 the new character $c$ to take derivative with, 
       
  1255 the regular expression
       
  1256 $r$ directly under the star $r^*$, and the
       
  1257 list of strings $sSet$ for the derivative $r^* \backslash s$ 
       
  1258 up til this point  
       
  1259 such that $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ 
       
  1260 (the equality is not exact, more on this later).
       
  1261 \begin{center}
       
  1262 	\begin{tabular}{lcl}
       
  1263 		$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
       
  1264 		$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
       
  1265 						     & & $\textit{if} \; 
       
  1266 						     (\rnullable \; (\rders \; r \; s))$ \\
       
  1267 						     & & $\textit{then} \;\; (s @ [c]) :: [c] :: (
       
  1268 						     \starupdate \; c \; r \; Ss)$ \\
       
  1269 						     & & $\textit{else} \;\; (s @ [c]) :: (
       
  1270 						     \starupdate \; c \; r \; Ss)$
       
  1271 	\end{tabular}
       
  1272 \end{center}
       
  1273 \noindent
       
  1274 As a generalisation from characters to strings,
       
  1275 $\starupdates$ takes a string instead of a character
       
  1276 as the first input argument, and is otherwise the same
       
  1277 as $\starupdate$.
       
  1278 \begin{center}
       
  1279 	\begin{tabular}{lcl}
       
  1280 		$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
       
  1281 		$\starupdates \; (c :: cs) \; r \; Ss$ &  $=$ &  $\starupdates \; cs \; r \; (
       
  1282 		\starupdate \; c \; r \; Ss)$
       
  1283 	\end{tabular}
       
  1284 \end{center}
       
  1285 \noindent
       
  1286 For the star regular expression,
       
  1287 its derivatives can be seen as  a nested gigantic
       
  1288 alternative similar to that of sequence regular expression's derivatives, 
       
  1289 and therefore need
       
  1290 to be ``straightened out" as well.
       
  1291 The function for this would be $\hflat{}$ and $\hflataux{}$.
       
  1292 \begin{center}
       
  1293 	\begin{tabular}{lcl}
       
  1294 		$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
       
  1295 		$\hflataux{r}$ & $\dn$ & $[r]$
       
  1296 	\end{tabular}
       
  1297 \end{center}
       
  1298 
       
  1299 \begin{center}
       
  1300 	\begin{tabular}{lcl}
       
  1301 		$\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\
       
  1302 		$\hflat{r}$ & $\dn$ & $r$
       
  1303 	\end{tabular}
       
  1304 \end{center}
       
  1305 \noindent
       
  1306 %MAYBE TODO: introduce createdByStar
       
  1307 We first introduce an inductive property
       
  1308 for $\starupdate$ and $\hflataux{\_}$, 
       
  1309 it says if we do derivatives of $r^*$
       
  1310 with a string that starts with $c$,
       
  1311 then flatten it out,
       
  1312 we obtain a list
       
  1313 of the shape $\sum_{s' \in sSet} (r\backslash_r s') \cdot r^*$,
       
  1314 where $sSet = \starupdates \; s \; r \; [[c]]$.
       
  1315 \begin{lemma}\label{starHfauInduct}
       
  1316 	$\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = 
       
  1317 	\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; 
       
  1318 	(\starupdates \; s \; r_0 \; [[c]])$
       
  1319 \end{lemma}
       
  1320 \begin{proof}
       
  1321 	By an induction on $s$, the inductive cases
       
  1322 	being $[]$ and $s@[c]$.
       
  1323 \end{proof}
       
  1324 \noindent
       
  1325 Here is a corollary that states the lemma in
       
  1326 a more intuitive way:
       
  1327 \begin{corollary}
       
  1328 	$\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
       
  1329 	(r^*))\; (\starupdates \; c\; r\; [[c]])$
       
  1330 \end{corollary}
       
  1331 \noindent
       
  1332 Note that this is also agnostic of the simplification
       
  1333 function we defined, and is therefore of more general interest.
       
  1334 
       
  1335 Now adding the $\rsimp{}$ bit for closed forms,
       
  1336 we have
  1177 \begin{lemma}
  1337 \begin{lemma}
  1178 $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
  1338 	$a :: rs \grewrites \hflataux{a} @ rs$
  1179 \end{lemma}
  1339 \end{lemma}
  1180 And now the reason we have (1) in section 1 is clear:
  1340 \noindent
  1181 $\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, 
  1341 giving us
  1182 is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
  1342 \begin{lemma}\label{cbsHfauRsimpeq1}
  1183     as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
  1343 	$\rsimp{a+b} = \rsimp{(\sum \hflataux{a} @ \hflataux{b})}$.
  1184 
  1344 \end{lemma}
  1185 %----------------------------------------------------------------------------------------
  1345 \noindent
  1186 %	SECTION 3
  1346 This yields
  1187 %----------------------------------------------------------------------------------------
  1347 \begin{lemma}\label{hfauRsimpeq2}
  1188 
  1348 $\rsimp{r} = \rsimp{(\sum \hflataux{r})}$
  1189 \section{interaction between $\distinctWith$ and $\flts$}
  1349 \end{lemma}
  1190 Note that it is not immediately obvious that 
  1350 \noindent
  1191 \begin{center}
  1351 Together with the rewriting relation
  1192 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
  1352 \begin{lemma}\label{starClosedForm6Hrewrites}
  1193 \end{center}
  1353 	$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
       
  1354 	\scfrewrites
       
  1355 	 \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
       
  1356 \end{lemma}
       
  1357 \noindent
       
  1358 We obtain the closed form for star regular expression:
       
  1359 \begin{lemma}\label{starClosedForm}
       
  1360 	$\rderssimp{r^*}{c::s} = 
       
  1361 	\rsimp{
       
  1362 		(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; 
       
  1363 			(\starupdates \; s\; r \; [[c]])
       
  1364 		      )
       
  1365 		)
       
  1366 	      }
       
  1367 	$
       
  1368 \end{lemma}
       
  1369 \begin{proof}
       
  1370 	By an induction on $s$.
       
  1371 	The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2}
       
  1372 	are used.	
       
  1373 \end{proof}
       
  1374 \section{Estimating the Closed Forms' sizes}
       
  1375 We now summarize the closed forms below:
       
  1376 \begin{itemize}
       
  1377 	\item
       
  1378 	$\rderssimp{(\sum rs)}{s} \sequal
       
  1379 	 \sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
       
  1380 	\item
       
  1381 	$\rderssimp{(r_1 \cdot r_2)}{s} \sequal \sum ((r_1 \backslash s) \cdot r_2 ) 
       
  1382 	:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1}))$
       
  1383 	\item
       
  1384 
       
  1385 	$\rderssimp{r^*}{c::s} = 
       
  1386 	\rsimp{
       
  1387 		(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; 
       
  1388 			(\starupdates \; s\; r \; [[c]])
       
  1389 		      )
       
  1390 		)
       
  1391 	      }
       
  1392 	$
       
  1393 \end{itemize}	
       
  1394 \noindent	
       
  1395 The closed forms on the left-hand-side
       
  1396 are all of the same shape: $\rsimp{ (\sum rs)} $.
       
  1397 Such regular expression will be bounded by the size of $\sum rs'$, 
       
  1398 where every element in $rs'$ is distinct, and each element 
       
  1399 can be described by some inductive sub-structures 
       
  1400 (for example when $r = r_1 \cdot r_2$ then $rs'$ 
       
  1401 will be solely comprised of $r_1 \backslash s'$ 
       
  1402 and $r_2 \backslash s''$, $s'$ and $s''$ being 
       
  1403 sub-strings of $s$).
       
  1404 which will each have a size uppder bound 
       
  1405 according to inductive hypothesis, which controls $r \backslash s$.
       
  1406 
       
  1407 We elaborate the above reasoning by a series of lemmas
       
  1408 below, where straightforward proofs are omitted.
       
  1409 \begin{lemma}
       
  1410 	If $\forall r \in rs. \rsize{r} $ is less than or equal to $N$,
       
  1411 	and $\textit{length} \; rs$ is less than or equal to $l$,
       
  1412 	then $\rsize{\sum rs}$ is less than or equal to $l*N + 1$.
       
  1413 \end{lemma}
       
  1414 \noindent
       
  1415 If we define all regular expressions with size no
       
  1416 more than $N$ as $\sizeNregex \; N$:
       
  1417 \[
       
  1418 	\sizeNregex \; N \dn  \{r \mid \rsize{r} \leq N \}
       
  1419 \]
       
  1420 Then such set is finite:
       
  1421 \begin{lemma}\label{finiteSizeN}
       
  1422 	$\textit{isFinite}\; (\sizeNregex \; N)$
       
  1423 \end{lemma}
       
  1424 \begin{proof}
       
  1425 	By overestimating the set $\sizeNregex \; N + 1$
       
  1426 	using union of sets like
       
  1427 	$\{r_1 \cdot r_2 \mid r_1 \in A
       
  1428 		\text{and}
       
  1429 		r_2 \in A\}
       
  1430 	$ where $A = \sizeNregex \; N$.
       
  1431 \end{proof}
       
  1432 \noindent
       
  1433 From this we get a corollary that
       
  1434 if forall $r \in rs$, $\rsize{r} \leq N$, then the output of 
       
  1435 $\rdistinct{rs}{\varnothing}$ is a list of regular
       
  1436 expressions of finite size depending on $N$ only. 
       
  1437 \begin{corollary}
       
  1438 	Assumes that for all $r \in rs. \rsize{r} \leq N$,
       
  1439 	and the cardinality of $\sizeNregex \; N$ is $c_N$
       
  1440 	then$\rsize{\rdistinct{rs}{\varnothing}} \leq c*N$.
       
  1441 \end{corollary}
       
  1442 \noindent
       
  1443 We have proven that the output of $\rdistinct{rs'}{\varnothing}$
       
  1444 is bounded by a constant $c_N$ depending only on $N$,
       
  1445 provided that each of $rs'$'s element
       
  1446 is bounded by $N$.
       
  1447 We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
       
  1448 
       
  1449 We show how $\rdistinct$ and $\rflts$
       
  1450 in the simplification function together is at least as 
       
  1451 good as $\rdistinct{}{}$ alone.
       
  1452 \begin{lemma}\label{interactionFltsDB}
       
  1453 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r 
       
  1454 	\leq 
       
  1455 	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
       
  1456 \end{lemma}
       
  1457 \noindent
  1194 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
  1458 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
  1195 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
  1459 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
  1196 
  1460 
  1197 
  1461 Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:
  1198 \subsection{A Closed Form for the Sequence Regular Expression}
  1462 \begin{lemma}\label{altsSimpControl}
  1199 \begin{quote}\it
  1463 	$\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$
  1200 	Claim: For regular expressions $r_1 \cdot r_2$, we claim that
  1464 \end{lemma}
  1201 	\begin{center}
  1465 \begin{proof}
  1202 		$ \rderssimp{r_1 \cdot r_2}{s} = 
  1466 	By using \ref{interactionFltsDB}.
  1203 	\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
  1467 \end{proof}
  1204 	\end{center}
  1468 \noindent
  1205 \end{quote}
  1469 which says that the size of regular expression
  1206 \noindent
  1470 is always smaller if we apply the full simplification
  1207 
  1471 rather than just one component ($\rdistinct{}{}$).
  1208 Before we get to the proof that says the intermediate result of our lexer will
  1472 
  1209 remain finitely bounded, which is an important efficiency/liveness guarantee,
  1473 
  1210 we shall first develop a few preparatory properties and definitions to 
  1474 Now we are ready to control the sizes of
  1211 make the process of proving that a breeze.
  1475 $r_1 \cdot r_2 \backslash s$, $r^* \backslash s$.
  1212 
       
  1213 We define rewriting relations for $\rrexp$s, which allows us to do the 
       
  1214 same trick as we did for the correctness proof,
       
  1215 but this time we will have stronger equalities established.
       
  1216 
       
  1217 
       
  1218 \section{Estimating the Closed Forms' sizes}
       
  1219 
       
  1220 		The closed form $f\; (g\; (\sum rs)) $ is controlled
       
  1221 		by the size of $rs'$, where every element in $rs'$ is distinct, and each element can be described by som e inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$), which will have a numeric uppder bound according to inductive hypothesis, which controls $r \backslash s$.
       
  1222 %-----------------------------------
       
  1223 %	SECTION 2
       
  1224 %-----------------------------------
       
  1225 
       
  1226 \begin{theorem}
  1476 \begin{theorem}
  1227 For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
  1477 For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
  1228 \end{theorem}
  1478 \end{theorem}
  1229 
  1479 \noindent
  1230 \noindent
  1480 \begin{proof}
  1231 \begin{proof}
  1481 We prove this by induction on $r$. The base cases for $\RZERO$,
  1232 We prove this by induction on $r$. The base cases for $\AZERO$,
  1482 $\RONE $ and $\RCHAR{c}$ are straightforward. 
  1233 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
  1483 In the sequence $r_1 \cdot r_2$ case,
  1234 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
  1484 the inductive hypotheses state $\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and
  1235 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
  1485 $\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
  1236 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
       
  1237 %
  1486 %
  1238 \begin{center}
  1487 \begin{center}
  1239 \begin{tabular}{lcll}
  1488 \begin{tabular}{lcll}
  1240 & & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
  1489 & & $ \llbracket   \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\
  1241 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
  1490 & $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash s \cdot r_2 \; \;  :: \; \; 
  1242     [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
  1491 \map \; (\lambda s'. r_2\backslash_{rsimp} s') (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\
  1243 & $\leq$ &
  1492 & $\leq$ & $\llbracket \rsimp{(\sum(r_1 \backslash s \cdot r_2 ::
  1244     $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
  1493 \map \; (\lambda s'. r_2\backslash_{rsimp} s') (\vsuf{s}{r})))} \rrbracket_r $ & (2) \\
  1245     [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
  1494 									 
  1246 & $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
       
  1247              \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
       
  1248 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
       
  1249       \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
       
  1250 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
       
  1251 \end{tabular}
  1495 \end{tabular}
  1252 \end{center}
  1496 \end{center}
  1253 
  1497 
  1254 
  1498 \end{proof}
  1255 \noindent
  1499 
  1256 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
  1500 \noindent
       
  1501 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\rderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
  1257 The reason why we could write the derivatives of a sequence this way is described in section 2.
  1502 The reason why we could write the derivatives of a sequence this way is described in section 2.
  1258 The term (2) is used to control (1). 
  1503 The term (2) is used to control (1). 
  1259 That is because one can obtain an overall
  1504 That is because one can obtain an overall
  1260 smaller regex list
  1505 smaller regex list
  1261 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
  1506 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
  1262 Section 3 is dedicated to its proof.
  1507 Section 3 is dedicated to its proof.
  1263 In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
  1508 In (3) we know that  $\llbracket \ASEQ{bs}{(\rderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
  1264 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
  1509 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
  1265 than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
  1510 than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
  1266 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
  1511 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
  1267 We reason similarly for  $\STAR$.\medskip
  1512 We reason similarly for  $\STAR$.\medskip
  1268 \end{proof}
  1513 
       
  1514 %-----------------------------------
       
  1515 %	SECTION 2
       
  1516 %-----------------------------------
       
  1517 
       
  1518 
       
  1519 %----------------------------------------------------------------------------------------
       
  1520 %	SECTION 3
       
  1521 %----------------------------------------------------------------------------------------
       
  1522 
       
  1523 
       
  1524 \subsection{A Closed Form for the Sequence Regular Expression}
       
  1525 \noindent
       
  1526 
       
  1527 Before we get to the proof that says the intermediate result of our lexer will
       
  1528 remain finitely bounded, which is an important efficiency/liveness guarantee,
       
  1529 we shall first develop a few preparatory properties and definitions to 
       
  1530 make the process of proving that a breeze.
       
  1531 
       
  1532 We define rewriting relations for $\rrexp$s, which allows us to do the 
       
  1533 same trick as we did for the correctness proof,
       
  1534 but this time we will have stronger equalities established.
       
  1535 
       
  1536 
  1269 
  1537 
  1270 What guarantee does this bound give us?
  1538 What guarantee does this bound give us?
  1271 
  1539 
  1272 Whatever the regex is, it will not grow indefinitely.
  1540 Whatever the regex is, it will not grow indefinitely.
  1273 Take our previous example $(a + aa)^*$ as an example:
  1541 Take our previous example $(a + aa)^*$ as an example: