--- a/ChengsongTanPhdThesis/Chapters/Finite.tex Fri Jul 01 13:02:15 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Mon Jul 04 12:27:13 2022 +0100
@@ -265,10 +265,21 @@
\section{Step One: Closed Forms}
We transform the function application $\rderssimp{r}{s}$
- into an equivalent form $f\; (g \; (\sum rs))$.
+ into an equivalent
+ form $f\; (g \; (\sum rs))$.
The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$.
This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the
right hand side the "closed form" of $r\backslash s$.
+
+\begin{quote}\it
+ Claim: For regular expressions $r_1 \cdot r_2$, we claim that
+ \begin{center}
+ $ \rderssimp{r_1 \cdot r_2}{s} =
+ \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
+ \end{center}
+\end{quote}
+\noindent
+We explain in detail how we reached those claims.
\subsection{Basic Properties needed for Closed Forms}
\subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
@@ -1045,37 +1056,100 @@
Before we go on to obtain them, some preliminary definitions
are needed to make proof statements concise.
-\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
-We note that the different ways in which regular expressions are
-nested do not matter under $\rsimp{}$:
+\section{"Closed Forms" of Sequence Regular Expressions}
+The problem of obataining a closed-form for sequence regular expression
+is constructing $(r_1 \cdot r_2) \backslash_r s$
+if we are only allowed to use a combination of $r_1 \backslash s''$
+and $r_2 \backslash s''$ , where $s''$ is from $s$.
+First let's look at a series of derivatives steps on a sequence
+regular expression, assuming that each time the first
+component of the sequence is always nullable):
\begin{center}
- To be proven:\\
- $\rsimp{r} \stackrel{?}{=} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$
-and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$
+
+$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
+$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad
+ \ldots$
+
\end{center}
-\noindent
+Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as
+a giant alternative taking a list of terms
+$[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$,
+where the head of the list is always the term
+representing a match involving only $r_1$, and the tail of the list consisting of
+terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$.
This intuition is also echoed by IndianPaper, where they gave
a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$:
\begin{center}
- $((r_1 \cdot r_2) \backslash_r c_1) \stackrel{\backslash_r c_2}{=}
- ((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1),\; r_2 \backslash_r c_1))
- \stackrel{\backslash_r c_2}{=}
- ((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
- + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)
+ \begin{tabular}{c}
+ $(r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) \myequiv$\\
+ \rule{0pt}{3ex} $((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1) \; r_2 \backslash_r c_1)) \backslash_r (c_2 :: \ldots c_n)
+ \myequiv$\\
+ \rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable \; r_1) \; r_2 \backslash_r c_1c_2))
+ + (\delta \ (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)) \backslash_r (c_3 \ldots c_n)
$
+ \end{tabular}
\end{center}
\noindent
-The $\delta \; b \; r$ function above turns the entire term into $\RZERO$
-if the boolean condition $b$ evaluates to false, and outputs $r$ otherwise.
-The equality in their derivation steps should be interpretated
-as language equivalence. To pin down the intuition into rigorous terms,
-$\sflat{}$ is used to enable the transformation from
+The equality in above should be interpretated
+as language equivalence.
+The $\delta$ function works similarly to that of
+a Kronecker delta function:
+\[ \delta \; b\; r\]
+will produce $r$
+if $b$ evaluates to true,
+and $\RZERO$ otherwise.
+Note that their formulation
+\[
+ ((r_1 \backslash_r \, c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
+ + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)
+\]
+does not faithfully
+represent what the intermediate derivatives would actually look like
+when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not
+nullable in the head of the sequence.
+For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable,
+the regular expression would not look like
+\[
+ (r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO,
+\]
+but actually $r_1 \backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the
+first place.
+In a closed-form one would want to take into account this
+and generate the list of
+regular expressions $r_2 \backslash_r s''$ with
+string pairs $(s', s'')$ where $s'@s'' = s$ and
+$r_1 \backslash s'$ nullable.
+We denote the list consisting of such
+strings $s''$ as $\vsuf{s}{r_1}$.
+
+The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
+\begin{center}
+\begin{tabular}{lcl}
+$\vsuf{[]}{\_} $ & $=$ & $[]$\\
+$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
+ && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $
+\end{tabular}
+\end{center}
+\noindent
+The list is sorted in the order $r_2\backslash s''$
+appears in $(r_1\cdot r_2)\backslash s$.
+In essence, $\vsuf{\_}{\_}$ is doing a
+"virtual derivative" of $r_1 \cdot r_2$, but instead of producing
+the entire result $(r_1 \cdot r_2) \backslash s$,
+it only stores all the strings $s''$ such that $r_2 \backslash s''$
+are occurring terms in $(r_1\cdot r_2)\backslash s$.
+
+To make the closed form representation
+more straightforward,
+the flattetning function $\sflat{\_}$ is used to enable the transformation from
a left-associative nested sequence of alternatives into
a flattened list:
-$r = \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{}}{\rightarrow}
-(\ldots ((r_1 + r_2) + r_3) + \ldots)$
-The definitions $\sflat{}$, $\sflataux{}$ are given below.
-
+\[
+ \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{\_}}{\rightarrow}
+(\ldots ((r_1 + r_2) + r_3) + \ldots)
+\]
+\noindent
+The definitions $\sflat{\_}$, $\sflataux{\_}$ are given below.
\begin{center}
\begin{tabular}{ccc}
$\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
@@ -1091,118 +1165,363 @@
$\sflat r$ & $=$ & $ r$
\end{tabular}
\end{center}
-The intuition behind $\sflataux{\_}$ is to break up nested regexes
+\noindent
+$\sflataux{\_}$ breaks up nested alternative regexes
of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
-into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
+into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
It will return the singleton list $[r]$ otherwise.
$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps
the output type a regular expression, not a list.
-$\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the
- first element of the list of children of
-an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression
-$r_1 \cdot r_2 \backslash s$.
-\subsection{The $\textit{vsuf}$ Function}
-Note that $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2)
- + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)$ does not faithfully
-represent what would the intermediate derivatives would actually look like.
-For example, when $r_1$ and $r_1 \backslash_r c_1$ are nullable,
-the regular expression would not look like $(r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO$,
-but actually $r_1\backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the
-first place.
-In a closed-form one would want to take into account this
-and generate the list of string pairs $(s', s'')$ where $s'@s'' = s$ such that
-only $r_1 \backslash s'$ nullable.
-With $\sflat{\_}$ and $\sflataux{\_}$ we make
-precise what "closed forms" we have for the sequence derivatives and their simplifications,
-in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
-and $\bderssimp{(r_1\cdot r_2)}{s}$,
-if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
-and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over
-the substring of $s$?
-First let's look at a series of derivatives steps on a sequence
-regular expression, (assuming) that each time the first
-component of the sequence is always nullable):
-\begin{center}
+$\sflataux{\_}$ and $\sflat{\_}$ are only recursive on the
+first element of the list.
+
+With $\sflataux{}$ a preliminary to the closed form can be stated,
+where the derivative of $r_1 \cdot r_2 \backslash s$ can be
+flattened into a list whose head and tail meet the description
+we gave earlier.
+\begin{lemma}\label{seqSfau0}
+ $\sflataux{\rders{(r_1 \cdot r_2) \backslash s }} = (r_1 \backslash_r s) \cdot r_2
+ :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$
+\end{lemma}
+\begin{proof}
+ By an induction on the string $s$, where the inductive cases
+ are split as $[]$ and $xs @ [x]$.
+ Note the key identify holds:
+ \[
+ \map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\;
+ \map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1}))
+ \]
+ =
+ \[
+ \map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1})
+ \]
+ This enables the inductive case to go through.
+\end{proof}
+\noindent
+Note that this lemma does $\mathbf{not}$ depend on any
+specific definitions we used,
+allowing people investigating derivatives to get an alternative
+view of what $r_1 \cdot r_2$ is.
-$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
-$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad
- \ldots$
-
-\end{center}
-%TODO: cite indian paper
-Indianpaper have come up with a slightly more formal way of putting the above process:
+Now we are able to use this for the intuition that
+the different ways in which regular expressions are
+nested do not matter under $\rsimp{}$:
\begin{center}
-$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
-\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
-+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
+ $\rsimp{r} \stackrel{?}{\sequal} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$
+and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$
\end{center}
-where $\delta(b, r)$ will produce $r$
-if $b$ evaluates to true,
-and $\ZERO$ otherwise.
+Simply wrap with $\sum$ constructor and add
+simplifications to both sides of \ref{seqSfau0}
+and one gets
+\begin{corollary}\label{seqClosedFormGeneral}
+ $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }
+ =\rsimp{(\sum ( (r_1 \backslash s) \cdot r_2 ::
+ \map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$
+\end{corollary}
+Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}),
+it is possible to convert the above lemma to obtain a "closed form"
+for derivatives nested with simplification:
+\begin{lemma}\label{seqClosedForm}
+ $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 )
+ :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$
+\end{lemma}
+\begin{proof}
+ By a case analysis of string $s$.
+ When $s$ is empty list, the rewrite is straightforward.
+ When $s$ is a list, one could use the corollary \ref{seqSfau0},
+ and lemma \ref{Simpders} to rewrite the left-hand-side.
+\end{proof}
+As a corollary for this closed form, one can estimate the size
+of the sequence derivative $r_1 \cdot r_2 \backslash_r s$ using
+an easier-to-handle expression:
+\begin{corollary}\label{seqEstimate1}
+ \begin{center}
- But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
- equivalence. To make this intuition useful
- for a formal proof, we need something
-stronger than language equivalence.
-With the help of $\sflat{\_}$ we can state the equation in Indianpaper
-more rigorously:
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
-\end{lemma}
-
-The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
+ $\llbracket \rderssimp{(r_1 \cdot r_2)}{s} \rrbracket_r = \llbracket \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 )
+ :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$
+
+ \end{center}
+\end{corollary}
+\noindent
+\subsection{Closed Forms for Star Regular Expressions}
+We use a similar technique as $r_1 \cdot r_2$ case,
+generating
+all possible sub-strings $s'$ of $s$
+such that $r\backslash s' \cdot r^*$ will appear
+as a term in $(r^*) \backslash s$.
+The first function we define is a single-step
+updating function $\starupdate$, which takes three arguments as input:
+the new character $c$ to take derivative with,
+the regular expression
+$r$ directly under the star $r^*$, and the
+list of strings $sSet$ for the derivative $r^* \backslash s$
+up til this point
+such that $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$
+(the equality is not exact, more on this later).
+\begin{center}
+ \begin{tabular}{lcl}
+ $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
+ $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
+ & & $\textit{if} \;
+ (\rnullable \; (\rders \; r \; s))$ \\
+ & & $\textit{then} \;\; (s @ [c]) :: [c] :: (
+ \starupdate \; c \; r \; Ss)$ \\
+ & & $\textit{else} \;\; (s @ [c]) :: (
+ \starupdate \; c \; r \; Ss)$
+ \end{tabular}
+\end{center}
+\noindent
+As a generalisation from characters to strings,
+$\starupdates$ takes a string instead of a character
+as the first input argument, and is otherwise the same
+as $\starupdate$.
+\begin{center}
+ \begin{tabular}{lcl}
+ $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
+ $\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; (
+ \starupdate \; c \; r \; Ss)$
+ \end{tabular}
+\end{center}
+\noindent
+For the star regular expression,
+its derivatives can be seen as a nested gigantic
+alternative similar to that of sequence regular expression's derivatives,
+and therefore need
+to be ``straightened out" as well.
+The function for this would be $\hflat{}$ and $\hflataux{}$.
+\begin{center}
+ \begin{tabular}{lcl}
+ $\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
+ $\hflataux{r}$ & $\dn$ & $[r]$
+ \end{tabular}
+\end{center}
\begin{center}
-\begin{tabular}{lcl}
-$\vsuf{[]}{\_} $ & $=$ & $[]$\\
-$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
- && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $
-\end{tabular}
-\end{center}
-It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
-and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
-the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
-In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing
-the entire result of $(r_1 \cdot r_2) \backslash s$,
-it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$.
+ \begin{tabular}{lcl}
+ $\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\
+ $\hflat{r}$ & $\dn$ & $r$
+ \end{tabular}
+\end{center}
+\noindent
+%MAYBE TODO: introduce createdByStar
+We first introduce an inductive property
+for $\starupdate$ and $\hflataux{\_}$,
+it says if we do derivatives of $r^*$
+with a string that starts with $c$,
+then flatten it out,
+we obtain a list
+of the shape $\sum_{s' \in sSet} (r\backslash_r s') \cdot r^*$,
+where $sSet = \starupdates \; s \; r \; [[c]]$.
+\begin{lemma}\label{starHfauInduct}
+ $\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} =
+ \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \;
+ (\starupdates \; s \; r_0 \; [[c]])$
+\end{lemma}
+\begin{proof}
+ By an induction on $s$, the inductive cases
+ being $[]$ and $s@[c]$.
+\end{proof}
+\noindent
+Here is a corollary that states the lemma in
+a more intuitive way:
+\begin{corollary}
+ $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
+ (r^*))\; (\starupdates \; c\; r\; [[c]])$
+\end{corollary}
+\noindent
+Note that this is also agnostic of the simplification
+function we defined, and is therefore of more general interest.
+
+Now adding the $\rsimp{}$ bit for closed forms,
+we have
+\begin{lemma}
+ $a :: rs \grewrites \hflataux{a} @ rs$
+\end{lemma}
+\noindent
+giving us
+\begin{lemma}\label{cbsHfauRsimpeq1}
+ $\rsimp{a+b} = \rsimp{(\sum \hflataux{a} @ \hflataux{b})}$.
+\end{lemma}
+\noindent
+This yields
+\begin{lemma}\label{hfauRsimpeq2}
+$\rsimp{r} = \rsimp{(\sum \hflataux{r})}$
+\end{lemma}
+\noindent
+Together with the rewriting relation
+\begin{lemma}\label{starClosedForm6Hrewrites}
+ $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
+ \scfrewrites
+ \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
+\end{lemma}
+\noindent
+We obtain the closed form for star regular expression:
+\begin{lemma}\label{starClosedForm}
+ $\rderssimp{r^*}{c::s} =
+ \rsimp{
+ (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \;
+ (\starupdates \; s\; r \; [[c]])
+ )
+ )
+ }
+ $
+\end{lemma}
+\begin{proof}
+ By an induction on $s$.
+ The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2}
+ are used.
+\end{proof}
+\section{Estimating the Closed Forms' sizes}
+We now summarize the closed forms below:
+\begin{itemize}
+ \item
+ $\rderssimp{(\sum rs)}{s} \sequal
+ \sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
+ \item
+ $\rderssimp{(r_1 \cdot r_2)}{s} \sequal \sum ((r_1 \backslash s) \cdot r_2 )
+ :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1}))$
+ \item
+
+ $\rderssimp{r^*}{c::s} =
+ \rsimp{
+ (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \;
+ (\starupdates \; s\; r \; [[c]])
+ )
+ )
+ }
+ $
+\end{itemize}
+\noindent
+The closed forms on the left-hand-side
+are all of the same shape: $\rsimp{ (\sum rs)} $.
+Such regular expression will be bounded by the size of $\sum rs'$,
+where every element in $rs'$ is distinct, and each element
+can be described by some inductive sub-structures
+(for example when $r = r_1 \cdot r_2$ then $rs'$
+will be solely comprised of $r_1 \backslash s'$
+and $r_2 \backslash s''$, $s'$ and $s''$ being
+sub-strings of $s$).
+which will each have a size uppder bound
+according to inductive hypothesis, which controls $r \backslash s$.
-With this we can also add simplifications to both sides and get
+We elaborate the above reasoning by a series of lemmas
+below, where straightforward proofs are omitted.
\begin{lemma}
-$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
+ If $\forall r \in rs. \rsize{r} $ is less than or equal to $N$,
+ and $\textit{length} \; rs$ is less than or equal to $l$,
+ then $\rsize{\sum rs}$ is less than or equal to $l*N + 1$.
+\end{lemma}
+\noindent
+If we define all regular expressions with size no
+more than $N$ as $\sizeNregex \; N$:
+\[
+ \sizeNregex \; N \dn \{r \mid \rsize{r} \leq N \}
+\]
+Then such set is finite:
+\begin{lemma}\label{finiteSizeN}
+ $\textit{isFinite}\; (\sizeNregex \; N)$
+\end{lemma}
+\begin{proof}
+ By overestimating the set $\sizeNregex \; N + 1$
+ using union of sets like
+ $\{r_1 \cdot r_2 \mid r_1 \in A
+ \text{and}
+ r_2 \in A\}
+ $ where $A = \sizeNregex \; N$.
+\end{proof}
+\noindent
+From this we get a corollary that
+if forall $r \in rs$, $\rsize{r} \leq N$, then the output of
+$\rdistinct{rs}{\varnothing}$ is a list of regular
+expressions of finite size depending on $N$ only.
+\begin{corollary}
+ Assumes that for all $r \in rs. \rsize{r} \leq N$,
+ and the cardinality of $\sizeNregex \; N$ is $c_N$
+ then$\rsize{\rdistinct{rs}{\varnothing}} \leq c*N$.
+\end{corollary}
+\noindent
+We have proven that the output of $\rdistinct{rs'}{\varnothing}$
+is bounded by a constant $c_N$ depending only on $N$,
+provided that each of $rs'$'s element
+is bounded by $N$.
+We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
+
+We show how $\rdistinct$ and $\rflts$
+in the simplification function together is at least as
+good as $\rdistinct{}{}$ alone.
+\begin{lemma}\label{interactionFltsDB}
+ $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
+ \leq
+ \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $.
\end{lemma}
-Together with the idempotency property of $\rsimp{}$\ref{rsimpIdem},
-%TODO: state the idempotency property of rsimp
-it is possible to convert the above lemma to obtain a "closed form"
-for derivatives nested with simplification:
-\begin{lemma}
-$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
+\noindent
+The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
+duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
+
+Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:
+\begin{lemma}\label{altsSimpControl}
+ $\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$
\end{lemma}
-And now the reason we have (1) in section 1 is clear:
-$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$,
-is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
- as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
+\begin{proof}
+ By using \ref{interactionFltsDB}.
+\end{proof}
+\noindent
+which says that the size of regular expression
+is always smaller if we apply the full simplification
+rather than just one component ($\rdistinct{}{}$).
+
+
+Now we are ready to control the sizes of
+$r_1 \cdot r_2 \backslash s$, $r^* \backslash s$.
+\begin{theorem}
+For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
+\end{theorem}
+\noindent
+\begin{proof}
+We prove this by induction on $r$. The base cases for $\RZERO$,
+$\RONE $ and $\RCHAR{c}$ are straightforward.
+In the sequence $r_1 \cdot r_2$ case,
+the inductive hypotheses state $\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and
+$\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
+%
+\begin{center}
+\begin{tabular}{lcll}
+& & $ \llbracket \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\
+& $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash s \cdot r_2 \; \; :: \; \;
+\map \; (\lambda s'. r_2\backslash_{rsimp} s') (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\
+& $\leq$ & $\llbracket \rsimp{(\sum(r_1 \backslash s \cdot r_2 ::
+\map \; (\lambda s'. r_2\backslash_{rsimp} s') (\vsuf{s}{r})))} \rrbracket_r $ & (2) \\
+
+\end{tabular}
+\end{center}
+
+\end{proof}
+
+\noindent
+where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\rderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
+The reason why we could write the derivatives of a sequence this way is described in section 2.
+The term (2) is used to control (1).
+That is because one can obtain an overall
+smaller regex list
+by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
+Section 3 is dedicated to its proof.
+In (3) we know that $\llbracket \ASEQ{bs}{(\rderssimp{ r_1}{s}}{r_2}\rrbracket$ is
+bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
+than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
+for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
+We reason similarly for $\STAR$.\medskip
+
+%-----------------------------------
+% SECTION 2
+%-----------------------------------
+
%----------------------------------------------------------------------------------------
% SECTION 3
%----------------------------------------------------------------------------------------
-\section{interaction between $\distinctWith$ and $\flts$}
-Note that it is not immediately obvious that
-\begin{center}
-$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $.
-\end{center}
-The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
-duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
-
\subsection{A Closed Form for the Sequence Regular Expression}
-\begin{quote}\it
- Claim: For regular expressions $r_1 \cdot r_2$, we claim that
- \begin{center}
- $ \rderssimp{r_1 \cdot r_2}{s} =
- \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
- \end{center}
-\end{quote}
\noindent
Before we get to the proof that says the intermediate result of our lexer will
@@ -1215,57 +1534,6 @@
but this time we will have stronger equalities established.
-\section{Estimating the Closed Forms' sizes}
-
- The closed form $f\; (g\; (\sum rs)) $ is controlled
- by the size of $rs'$, where every element in $rs'$ is distinct, and each element can be described by som e inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$), which will have a numeric uppder bound according to inductive hypothesis, which controls $r \backslash s$.
-%-----------------------------------
-% SECTION 2
-%-----------------------------------
-
-\begin{theorem}
-For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
-\end{theorem}
-
-\noindent
-\begin{proof}
-We prove this by induction on $r$. The base cases for $\AZERO$,
-$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
-for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
-hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
-$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
-%
-\begin{center}
-\begin{tabular}{lcll}
-& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
-& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
- [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
-& $\leq$ &
- $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
- [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
-& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
- \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
- \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
-\end{tabular}
-\end{center}
-
-
-\noindent
-where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
-The reason why we could write the derivatives of a sequence this way is described in section 2.
-The term (2) is used to control (1).
-That is because one can obtain an overall
-smaller regex list
-by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
-Section 3 is dedicated to its proof.
-In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is
-bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
-than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
-for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
-We reason similarly for $\STAR$.\medskip
-\end{proof}
What guarantee does this bound give us?