lexing: comparison ChengsongTanPhdThesis/Chapters/Finite.tex

equal deleted inserted replaced

-:812e5d112f49
+:671a83abccf3
 		reduce the size of a regular expression, not adding to it.
 \end{itemize}
 \section{Step One: Closed Forms}
 		We transform the function application $\rderssimp{r}{s}$
-		into an equivalent form $f\; (g \; (\sum rs))$.
+		into an equivalent
+		form $f\; (g \; (\sum rs))$.
 		The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$.
 		This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the
 		right hand side the "closed form" of $r\backslash s$.
+\begin{quote}\it
+	Claim: For regular expressions $r_1 \cdot r_2$, we claim that
+	\begin{center}
+		$ \rderssimp{r_1 \cdot r_2}{s} =
+	\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
+	\end{center}
+\end{quote}
+\noindent
+We explain in detail how we reached those claims.
 \subsection{Basic Properties needed for Closed Forms}
 \subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
 The $\textit{rdistinct}$ function, as its name suggests, will
 remove duplicates in an \emph{r}$\textit{rexp}$ list,
 \noindent
 The harder closed forms are the sequence and star ones.
 Before we go on to obtain them, some preliminary definitions
 are needed to make proof statements concise.
-\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
+\section{"Closed Forms" of Sequence Regular Expressions}
-We note that the different ways in which regular expressions are
+The problem of obataining a closed-form for sequence regular expression
-nested do not matter under $\rsimp{}$:
+is constructing $(r_1 \cdot r_2) \backslash_r s$
-\begin{center}
+if we are only allowed to use a combination of $r_1 \backslash s''$
-	To be proven:\\
+and  $r_2 \backslash s''$ , where $s''$ is from $s$.
-	$\rsimp{r} \stackrel{?}{=} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$
+First let's look at a series of derivatives steps on a sequence
-and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$
+regular expression, assuming that each time the first
-\end{center}
+component of the sequence is always nullable):
-\noindent
+\begin{center}
+$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
+$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
+\ldots$
+\end{center}
+Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as
+a giant alternative taking a list of terms
+$[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,
+where the head of the list is always the term
+representing a match involving only $r_1$, and the tail of the list consisting of
+terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$.
 This intuition is also echoed by IndianPaper, where they gave
 a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:
 \begin{center}
-	$((r_1 \cdot r_2) \backslash_r c_1) \stackrel{\backslash_r c_2}{=}
+	\begin{tabular}{c}
-	((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1),\; r_2 \backslash_r c_1))
+	$(r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) \myequiv$\\
-	\stackrel{\backslash_r c_2}{=}
+	\rule{0pt}{3ex} $((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1) \; r_2 \backslash_r c_1)) \backslash_r (c_2 :: \ldots c_n)
-	((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
+	\myequiv$\\
+	\rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable \; r_1) \; r_2 \backslash_r c_1c_2))
+	+ (\delta \ (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)) \backslash_r (c_3 \ldots c_n)
+	$
+	\end{tabular}
+\end{center}
+\noindent
+The equality in above should be interpretated
+as language equivalence.
+The $\delta$ function works similarly to that of
+a Kronecker delta function:
+\[ \delta \; b\; r\]
+will produce $r$
+if $b$ evaluates to true,
+and $\RZERO$ otherwise.
+Note that their formulation
+\[
+	((r_1 \backslash_r \, c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
 	+ (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)
-	$
+\]
-\end{center}
+does not faithfully
-\noindent
+represent what the intermediate derivatives would actually look like
-The $\delta \; b \; r$ function above turns the entire term into $\RZERO$
+when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not
-if the boolean condition $b$ evaluates to false, and outputs $r$ otherwise.
+nullable in the head of the sequence.
-The equality in their derivation steps should be interpretated
+For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable,
-as language equivalence. To pin down the intuition into rigorous terms,
+the regular expression would not look like
-$\sflat{}$ is used to enable the transformation from
+\[
+	(r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO,
+\]
+but actually $r_1 \backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the
+first place.
+In a closed-form one would want to take into account this
+and generate the list of
+regular expressions $r_2 \backslash_r s''$ with
+string pairs $(s', s'')$ where $s'@s'' = s$ and
+$r_1 \backslash s'$ nullable.
+We denote the list consisting of such
+strings $s''$ as $\vsuf{s}{r_1}$.
+The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
+\begin{center}
+\begin{tabular}{lcl}
+$\vsuf{[]}{\_} $ & $=$ &  $[]$\\
+$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
+&& $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
+\end{tabular}
+\end{center}
+\noindent
+The list is sorted in the order $r_2\backslash s''$
+appears in $(r_1\cdot r_2)\backslash s$.
+In essence, $\vsuf{\_}{\_}$ is doing a
+"virtual derivative" of $r_1 \cdot r_2$, but instead of producing
+the entire result $(r_1 \cdot r_2) \backslash s$,
+it only stores all the strings $s''$ such that $r_2 \backslash s''$
+are occurring terms in $(r_1\cdot r_2)\backslash s$.
+To make the closed form representation
+more straightforward,
+the flattetning function $\sflat{\_}$ is used to enable the transformation from
 a left-associative nested sequence of alternatives into
 a flattened list:
-$r = \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{}}{\rightarrow}
+\[
-(\ldots ((r_1 + r_2) + r_3) + \ldots)$
+\sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{\_}}{\rightarrow}
-The definitions $\sflat{}$, $\sflataux{}$ are given below.
+(\ldots ((r_1 + r_2) + r_3) + \ldots)
+\]
+\noindent
+The definitions $\sflat{\_}$, $\sflataux{\_}$ are given below.
 \begin{center}
 \begin{tabular}{ccc}
 $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
 $\sflataux r$ & $=$ & $ [r]$
 	 $\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\
 	 $\sflat{\sum []}$ & $ = $ & $ \sum []$\\
 $\sflat r$ & $=$ & $ r$
 \end{tabular}
 \end{center}
-The intuition behind $\sflataux{\_}$ is to break up nested regexes
+\noindent
+$\sflataux{\_}$ breaks up nested alternative regexes
 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
-into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
+into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
 It will return the singleton list $[r]$ otherwise.
 $\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
 the output type a regular expression, not a list.
-$\sflataux{\_}$  and $\sflat{\_}$ is only recursive in terms of the
+$\sflataux{\_}$  and $\sflat{\_}$ are only recursive on the
-first element of the list of children of
+first element of the list.
-an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for  dealing with the regular expression
-$r_1 \cdot r_2 \backslash s$.
+With $\sflataux{}$ a preliminary to the closed form can be stated,
-\subsection{The $\textit{vsuf}$ Function}
+where the derivative of $r_1 \cdot r_2 \backslash s$ can be
-Note that $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
+flattened into a list whose head and tail meet the description
-	+ (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)$ does not faithfully
+we gave earlier.
-represent what would the intermediate derivatives would actually look like.
+\begin{lemma}\label{seqSfau0}
-For example, when $r_1$ and $r_1 \backslash_r c_1$ are nullable,
+	$\sflataux{\rders{(r_1 \cdot r_2) \backslash s }} = (r_1 \backslash_r s) \cdot r_2
-the regular expression would not look like $(r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO$,
+	:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$
-but actually $r_1\backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the
+\end{lemma}
-first place.
+\begin{proof}
-In a closed-form one would want to take into account this
+	By an induction on the string $s$, where the inductive cases
-and generate the list of string pairs $(s', s'')$ where $s'@s'' = s$ such that
+	are split as $[]$ and $xs @ [x]$.
-only $r_1 \backslash s'$ nullable.
+	Note the key identify holds:
-With $\sflat{\_}$ and $\sflataux{\_}$ we make
+	\[
-precise what  "closed forms" we have for the sequence derivatives and their simplifications,
+		\map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\;
-in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
+		\map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1}))
-and $\bderssimp{(r_1\cdot r_2)}{s}$,
+	\]
-if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
+		=
-and  $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$  ranges over
+	\[
-the substring of $s$?
+		\map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1})
-First let's look at a series of derivatives steps on a sequence
+	\]
-regular expression,  (assuming) that each time the first
+	This enables the inductive case to go through.
-component of the sequence is always nullable):
+\end{proof}
-\begin{center}
+\noindent
+Note that this lemma does $\mathbf{not}$ depend on any
-$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
+specific definitions we used,
-$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
+allowing people investigating derivatives to get an alternative
-\ldots$
+view of what $r_1 \cdot r_2$ is.
-\end{center}
+Now we are able to use this for the intuition that
-%TODO: cite indian paper
+the different ways in which regular expressions are
-Indianpaper have  come up with a slightly more formal way of putting the above process:
+nested do not matter under $\rsimp{}$:
 \begin{center}
-$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
+	$\rsimp{r} \stackrel{?}{\sequal} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$
-\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
+and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$
-+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
+\end{center}
-\end{center}
+Simply wrap with $\sum$ constructor and add
-where  $\delta(b, r)$ will produce $r$
+simplifications to both sides of \ref{seqSfau0}
-if $b$ evaluates to true,
+and one gets
-and $\ZERO$ otherwise.
+\begin{corollary}\label{seqClosedFormGeneral}
+	$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }
-But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
+	=\rsimp{(\sum (  (r_1 \backslash s) \cdot r_2 ::
-equivalence. To make this intuition useful
+		\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
-for a formal proof, we need something
+\end{corollary}
-stronger than language equivalence.
+Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
-With the help of $\sflat{\_}$ we can state the equation in Indianpaper
-more rigorously:
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
-\end{lemma}
-The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
-\begin{center}
-\begin{tabular}{lcl}
-$\vsuf{[]}{\_} $ & $=$ &  $[]$\\
-$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
-&& $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
-\end{tabular}
-\end{center}
-It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
-and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
-the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
-In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing
-the entire result of  $(r_1 \cdot r_2) \backslash s$,
-it only stores all the $s''$ such that $r_2 \backslash s''$  are occurring terms in $(r_1\cdot r_2)\backslash s$.
-With this we can also add simplifications to both sides and get
-\begin{lemma}
-$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
-\end{lemma}
-Together with the idempotency property of $\rsimp{}$\ref{rsimpIdem},
-%TODO: state the idempotency property of rsimp
 it is possible to convert the above lemma to obtain a "closed form"
 for  derivatives nested with simplification:
+\begin{lemma}\label{seqClosedForm}
+	$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 )
+	:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$
+\end{lemma}
+\begin{proof}
+	By a case analysis of string $s$.
+	When $s$ is empty list, the rewrite is straightforward.
+	When $s$ is a list, one could use the corollary \ref{seqSfau0},
+	and lemma \ref{Simpders} to rewrite the left-hand-side.
+\end{proof}
+As a corollary for this closed form, one can estimate the size
+of the sequence derivative $r_1 \cdot r_2 \backslash_r s$ using
+an easier-to-handle expression:
+\begin{corollary}\label{seqEstimate1}
+	\begin{center}
+	$\llbracket \rderssimp{(r_1 \cdot r_2)}{s} \rrbracket_r = \llbracket \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 )
+	:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$
+	\end{center}
+\end{corollary}
+\noindent
+\subsection{Closed Forms for Star Regular Expressions}
+We use a similar technique as $r_1 \cdot r_2$ case,
+generating
+all possible sub-strings $s'$ of $s$
+such that $r\backslash s' \cdot r^*$ will appear
+as a term in $(r^*) \backslash s$.
+The first function we define is a single-step
+updating function $\starupdate$, which takes three arguments as input:
+the new character $c$ to take derivative with,
+the regular expression
+$r$ directly under the star $r^*$, and the
+list of strings $sSet$ for the derivative $r^* \backslash s$
+up til this point
+such that $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$
+(the equality is not exact, more on this later).
+\begin{center}
+	\begin{tabular}{lcl}
+		$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
+		$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
+						     & & $\textit{if} \;
+						     (\rnullable \; (\rders \; r \; s))$ \\
+						     & & $\textit{then} \;\; (s @ [c]) :: [c] :: (
+						     \starupdate \; c \; r \; Ss)$ \\
+						     & & $\textit{else} \;\; (s @ [c]) :: (
+						     \starupdate \; c \; r \; Ss)$
+	\end{tabular}
+\end{center}
+\noindent
+As a generalisation from characters to strings,
+$\starupdates$ takes a string instead of a character
+as the first input argument, and is otherwise the same
+as $\starupdate$.
+\begin{center}
+	\begin{tabular}{lcl}
+		$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
+		$\starupdates \; (c :: cs) \; r \; Ss$ &  $=$ &  $\starupdates \; cs \; r \; (
+		\starupdate \; c \; r \; Ss)$
+	\end{tabular}
+\end{center}
+\noindent
+For the star regular expression,
+its derivatives can be seen as  a nested gigantic
+alternative similar to that of sequence regular expression's derivatives,
+and therefore need
+to be ``straightened out" as well.
+The function for this would be $\hflat{}$ and $\hflataux{}$.
+\begin{center}
+	\begin{tabular}{lcl}
+		$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
+		$\hflataux{r}$ & $\dn$ & $[r]$
+	\end{tabular}
+\end{center}
+\begin{center}
+	\begin{tabular}{lcl}
+		$\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\
+		$\hflat{r}$ & $\dn$ & $r$
+	\end{tabular}
+\end{center}
+\noindent
+%MAYBE TODO: introduce createdByStar
+We first introduce an inductive property
+for $\starupdate$ and $\hflataux{\_}$,
+it says if we do derivatives of $r^*$
+with a string that starts with $c$,
+then flatten it out,
+we obtain a list
+of the shape $\sum_{s' \in sSet} (r\backslash_r s') \cdot r^*$,
+where $sSet = \starupdates \; s \; r \; [[c]]$.
+\begin{lemma}\label{starHfauInduct}
+	$\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} =
+	\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \;
+	(\starupdates \; s \; r_0 \; [[c]])$
+\end{lemma}
+\begin{proof}
+	By an induction on $s$, the inductive cases
+	being $[]$ and $s@[c]$.
+\end{proof}
+\noindent
+Here is a corollary that states the lemma in
+a more intuitive way:
+\begin{corollary}
+	$\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
+	(r^*))\; (\starupdates \; c\; r\; [[c]])$
+\end{corollary}
+\noindent
+Note that this is also agnostic of the simplification
+function we defined, and is therefore of more general interest.
+Now adding the $\rsimp{}$ bit for closed forms,
+we have
 \begin{lemma}
-$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
+	$a :: rs \grewrites \hflataux{a} @ rs$
 \end{lemma}
-And now the reason we have (1) in section 1 is clear:
+\noindent
-$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$,
+giving us
-is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
+\begin{lemma}\label{cbsHfauRsimpeq1}
-as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
+	$\rsimp{a+b} = \rsimp{(\sum \hflataux{a} @ \hflataux{b})}$.
+\end{lemma}
-%----------------------------------------------------------------------------------------
+\noindent
-%	SECTION 3
+This yields
-%----------------------------------------------------------------------------------------
+\begin{lemma}\label{hfauRsimpeq2}
+$\rsimp{r} = \rsimp{(\sum \hflataux{r})}$
-\section{interaction between $\distinctWith$ and $\flts$}
+\end{lemma}
-Note that it is not immediately obvious that
+\noindent
-\begin{center}
+Together with the rewriting relation
-$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
+\begin{lemma}\label{starClosedForm6Hrewrites}
-\end{center}
+	$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
+	\scfrewrites
+	 \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
+\end{lemma}
+\noindent
+We obtain the closed form for star regular expression:
+\begin{lemma}\label{starClosedForm}
+	$\rderssimp{r^*}{c::s} =
+	\rsimp{
+		(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \;
+			(\starupdates \; s\; r \; [[c]])
+		      )
+		)
+	      }
+	$
+\end{lemma}
+\begin{proof}
+	By an induction on $s$.
+	The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2}
+	are used.
+\end{proof}
+\section{Estimating the Closed Forms' sizes}
+We now summarize the closed forms below:
+\begin{itemize}
+	\item
+	$\rderssimp{(\sum rs)}{s} \sequal
+	 \sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
+	\item
+	$\rderssimp{(r_1 \cdot r_2)}{s} \sequal \sum ((r_1 \backslash s) \cdot r_2 )
+	:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1}))$
+	\item
+	$\rderssimp{r^*}{c::s} =
+	\rsimp{
+		(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \;
+			(\starupdates \; s\; r \; [[c]])
+		      )
+		)
+	      }
+	$
+\end{itemize}
+\noindent
+The closed forms on the left-hand-side
+are all of the same shape: $\rsimp{ (\sum rs)} $.
+Such regular expression will be bounded by the size of $\sum rs'$,
+where every element in $rs'$ is distinct, and each element
+can be described by some inductive sub-structures
+(for example when $r = r_1 \cdot r_2$ then $rs'$
+will be solely comprised of $r_1 \backslash s'$
+and $r_2 \backslash s''$, $s'$ and $s''$ being
+sub-strings of $s$).
+which will each have a size uppder bound
+according to inductive hypothesis, which controls $r \backslash s$.
+We elaborate the above reasoning by a series of lemmas
+below, where straightforward proofs are omitted.
+\begin{lemma}
+	If $\forall r \in rs. \rsize{r} $ is less than or equal to $N$,
+	and $\textit{length} \; rs$ is less than or equal to $l$,
+	then $\rsize{\sum rs}$ is less than or equal to $l*N + 1$.
+\end{lemma}
+\noindent
+If we define all regular expressions with size no
+more than $N$ as $\sizeNregex \; N$:
+\[
+	\sizeNregex \; N \dn  \{r \mid \rsize{r} \leq N \}
+\]
+Then such set is finite:
+\begin{lemma}\label{finiteSizeN}
+	$\textit{isFinite}\; (\sizeNregex \; N)$
+\end{lemma}
+\begin{proof}
+	By overestimating the set $\sizeNregex \; N + 1$
+	using union of sets like
+	$\{r_1 \cdot r_2 \mid r_1 \in A
+		\text{and}
+		r_2 \in A\}
+	$ where $A = \sizeNregex \; N$.
+\end{proof}
+\noindent
+From this we get a corollary that
+if forall $r \in rs$, $\rsize{r} \leq N$, then the output of
+$\rdistinct{rs}{\varnothing}$ is a list of regular
+expressions of finite size depending on $N$ only.
+\begin{corollary}
+	Assumes that for all $r \in rs. \rsize{r} \leq N$,
+	and the cardinality of $\sizeNregex \; N$ is $c_N$
+	then$\rsize{\rdistinct{rs}{\varnothing}} \leq c*N$.
+\end{corollary}
+\noindent
+We have proven that the output of $\rdistinct{rs'}{\varnothing}$
+is bounded by a constant $c_N$ depending only on $N$,
+provided that each of $rs'$'s element
+is bounded by $N$.
+We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
+We show how $\rdistinct$ and $\rflts$
+in the simplification function together is at least as
+good as $\rdistinct{}{}$ alone.
+\begin{lemma}\label{interactionFltsDB}
+	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
+	\leq
+	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
+\end{lemma}
+\noindent
 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
+Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:
-\subsection{A Closed Form for the Sequence Regular Expression}
+\begin{lemma}\label{altsSimpControl}
-\begin{quote}\it
+	$\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$
-	Claim: For regular expressions $r_1 \cdot r_2$, we claim that
+\end{lemma}
-	\begin{center}
+\begin{proof}
-		$ \rderssimp{r_1 \cdot r_2}{s} =
+	By using \ref{interactionFltsDB}.
-	\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
+\end{proof}
-	\end{center}
+\noindent
-\end{quote}
+which says that the size of regular expression
-\noindent
+is always smaller if we apply the full simplification
+rather than just one component ($\rdistinct{}{}$).
-Before we get to the proof that says the intermediate result of our lexer will
-remain finitely bounded, which is an important efficiency/liveness guarantee,
-we shall first develop a few preparatory properties and definitions to
+Now we are ready to control the sizes of
-make the process of proving that a breeze.
+$r_1 \cdot r_2 \backslash s$, $r^* \backslash s$.
-We define rewriting relations for $\rrexp$s, which allows us to do the
-same trick as we did for the correctness proof,
-but this time we will have stronger equalities established.
-\section{Estimating the Closed Forms' sizes}
-		The closed form $f\; (g\; (\sum rs)) $ is controlled
-		by the size of $rs'$, where every element in $rs'$ is distinct, and each element can be described by som e inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$), which will have a numeric uppder bound according to inductive hypothesis, which controls $r \backslash s$.
-%-----------------------------------
-%	SECTION 2
-%-----------------------------------
 \begin{theorem}
-For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
+For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
 \end{theorem}
+\noindent
-\noindent
+\begin{proof}
-\begin{proof}
+We prove this by induction on $r$. The base cases for $\RZERO$,
-We prove this by induction on $r$. The base cases for $\AZERO$,
+$\RONE $ and $\RCHAR{c}$ are straightforward.
-$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
+In the sequence $r_1 \cdot r_2$ case,
-for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
+the inductive hypotheses state $\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and
-hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
+$\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
-$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
 %
 \begin{center}
 \begin{tabular}{lcll}
-& & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
+& & $ \llbracket   \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\
-& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
+& $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash s \cdot r_2 \; \;  :: \; \;
-[\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
+\map \; (\lambda s'. r_2\backslash_{rsimp} s') (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\
-& $\leq$ &
+& $\leq$ & $\llbracket \rsimp{(\sum(r_1 \backslash s \cdot r_2 ::
-$\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
+\map \; (\lambda s'. r_2\backslash_{rsimp} s') (\vsuf{s}{r})))} \rrbracket_r $ & (2) \\
-[$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
-& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
-\llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
-\llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
 \end{tabular}
 \end{center}
+\end{proof}
-\noindent
-where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
+\noindent
+where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\rderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
 The reason why we could write the derivatives of a sequence this way is described in section 2.
 The term (2) is used to control (1).
 That is because one can obtain an overall
 smaller regex list
 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
 Section 3 is dedicated to its proof.
-In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is
+In (3) we know that  $\llbracket \ASEQ{bs}{(\rderssimp{ r_1}{s}}{r_2}\rrbracket$ is
 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
 than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
 We reason similarly for  $\STAR$.\medskip
-\end{proof}
+%-----------------------------------
+%	SECTION 2
+%-----------------------------------
+%----------------------------------------------------------------------------------------
+%	SECTION 3
+%----------------------------------------------------------------------------------------
+\subsection{A Closed Form for the Sequence Regular Expression}
+\noindent
+Before we get to the proof that says the intermediate result of our lexer will
+remain finitely bounded, which is an important efficiency/liveness guarantee,
+we shall first develop a few preparatory properties and definitions to
+make the process of proving that a breeze.
+We define rewriting relations for $\rrexp$s, which allows us to do the
+same trick as we did for the correctness proof,
+but this time we will have stronger equalities established.
 What guarantee does this bound give us?
 Whatever the regex is, it will not grow indefinitely.
 Take our previous example $(a + aa)^*$ as an example:

changeset 558	671a83abccf3
parent 557	812e5d112f49
child 559	9d18f3eac484