--- a/ChengsongTanPhdThesis/Chapters/Finite.tex Sat Oct 01 12:06:46 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Mon Oct 03 02:08:49 2022 +0100
@@ -402,7 +402,8 @@
\end{center}
\noindent
We do not define an r-regular expression version of $\blexersimp$,
-as our proof does not involve its use.
+as our proof does not involve its use
+(and there is no bitcode to decode into a lexical value).
Everything about the size of annotated regular expressions
can be calculated via the size of r-regular expressions:
\begin{lemma}\label{sizeRelations}
@@ -433,7 +434,7 @@
\end{center}
Unless stated otherwise in the rest of this
chapter all regular expressions without
-bitcodes are seen as $\rrexp$s.
+bitcodes are seen as r-regular expressions ($\rrexp$s).
For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,
we use the notation $r_1 + r_2$
for brevity.
@@ -476,10 +477,10 @@
%\end{quote}
%\noindent
%We explain in detail how we reached those claims.
-\subsection{Basic Properties needed for Closed Forms}
+\subsection{The Idea Behind Closed Forms}
If we attempt to prove
\begin{center}
- $\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{bsimps} s \rrbracket_r \leq N_r$
+ $\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$
\end{center}
using a naive induction on the structure of $r$,
then we are stuck at the inductive cases such as
@@ -487,51 +488,89 @@
The inductive hypotheses are:
\begin{center}
1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t.
- \;\;\forall s. \llbracket r_1 \backslash_{bsimps} s \rrbracket_r \leq N_{r_1}. $\\
+ \;\;\forall s. \llbracket r_1 \backslash_{rsimps} s \rrbracket_r \leq N_{r_1}. $\\
2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t.
- \;\; \forall s. \llbracket r_2 \backslash_{bsimps} s \rrbracket_r \leq N_{r_2}. $
+ \;\; \forall s. \llbracket r_2 \backslash_{rsimps} s \rrbracket_r \leq N_{r_2}. $
\end{center}
The inductive step to prove would be
\begin{center}
$\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s.
- \llbracket (r_1 \cdot r_2) \backslash_{bsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$
+ \llbracket (r_1 \cdot r_2) \backslash_{rsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$
\end{center}
The problem is that it is not clear what
-$(r_1\cdot r_2) \backslash_{bsimps} s$ looks like,
+$(r_1\cdot r_2) \backslash_{rsimps} s$ looks like,
and therefore $N_{r_1}$ and $N_{r_2}$ in the
inductive hypotheses cannot be directly used.
+We have already seen that $(r_1 \cdot r_2)\backslash s$
+and $(r^*)\backslash s$ can grow in a wild way.
+The point is that they will be equivalent to a list of
+terms $\sum rs$, where each term in $rs$ will
+be made of $r_1 \backslash s' $, $r_2\backslash s'$,
+and $r \backslash s'$ with $s' \in \textit{SubString} \; s$.
+The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$
+in the simplification which saves $rs$ from growing indefinitely.
+
+Based on this idea, we sketch a proof by first showing the equality (where
+$f$ and $g$ are functions that do not increase the size of the input)
+\begin{center}
+$(r_1 \cdot r_2)\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$,
+\end{center}
+and then show the right-hand-side can be finitely bounded.
+We call the right-hand-side the
+\emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
+We will flesh out the proof sketch in the next section.
+
+\section{Details of Closed Forms and Bounds}
+In this section we introduce in detail
+how the closed forms are obtained for regular expressions'
+derivatives and how they are bounded.
+We start by proving some basic identities
+involving the simplification functions for r-regular expressions.
+After that we use these identities to establish the
+closed forms we need.
+Finally, we prove the functions such as $\flts$
+will keep the size non-increasing.
+Putting this together with a general bound
+on the finiteness of distinct regular expressions
+smaller than a certain size, we obtain a bound on
+the closed forms.
+%$r_1\cdot r_2$, $r^*$ and $\sum rs$.
-The next steps involves getting a closed form for
-$\rderssimp{r}{s}$ and then obtaining
-an estimate for the closed form.
+
+\subsection{Some Basic Identities}
+
+
\subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
The $\textit{rdistinct}$ function, as its name suggests, will
-remove duplicates in an \emph{r}$\textit{rexp}$ list,
-according to the accumulator
-and leave only one of each different element in a list:
+remove duplicates in an r-regular expression list.
+It will also correctly exclude any elements that
+is intially in the accumulator set.
\begin{lemma}\label{rdistinctDoesTheJob}
- The function $\textit{rdistinct}$ satisfies the following
- properties:
+ %The function $\textit{rdistinct}$ satisfies the following
+ %properties:
+ Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its
+ recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.}
+ readily defined
+ for testing
+ whether a list's elements are all unique. Then the following
+ properties about $\textit{rdistinct}$ hold:
\begin{itemize}
\item
If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$.
\item
- If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
- then $\textit{isDistinct} \; rs'$.
+ %If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
+ $\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$.
\item
- $\rdistinct{rs}{acc} = rs - acc$
+ $\textit{set} \; (\rdistinct{rs}{acc})
+ = (textit{set} \; rs) - acc$
\end{itemize}
\end{lemma}
\noindent
-The predicate $\textit{isDistinct}$ is for testing
-whether a list's elements are all unique. It is defined
-recursively on the structure of a regular expression,
-and we omit the precise definition here.
\begin{proof}
The first part is by an induction on $rs$.
The second and third part can be proven by using the
- induction rules of $\rdistinct{\_}{\_}$.
+ inductive cases of $\textit{rdistinct}$.
\end{proof}
@@ -541,7 +580,7 @@
list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$
on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$
without the prepending of $rs$:
-\begin{lemma}
+\begin{lemma}\label{rdistinctConcat}
The elements appearing in the accumulator will always be removed.
More precisely,
\begin{itemize}
@@ -549,13 +588,13 @@
If $rs \subseteq rset$, then
$\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$.
\item
- Furthermore, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$,
+ More generally, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$,
then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$
\end{itemize}
\end{lemma}
\begin{proof}
- By induction on $rs$.
+ By induction on $rs$ and using \ref{rdistinctDoesTheJob}.
\end{proof}
\noindent
On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,
@@ -964,9 +1003,10 @@
By induction on the lists involved.
\end{proof}
\noindent
-Similarly,
-we introduce the equality for $\sum$ when certain child regular expressions
-are $\sum$ themselves:
+The above allows us to prove
+two similar equalities (which are a bit more involved).
+It says that we could flatten out the elements
+before simplification and still get the same result.
\begin{lemma}\label{simpFlatten3}
One can flatten the inside $\sum$ of a $\sum$ if it is being
simplified. Concretely,
@@ -1279,7 +1319,11 @@
Before we go on to obtain them, some preliminary definitions
are needed to make proof statements concise.
-\section{"Closed Forms" of Sequence Regular Expressions}
+
+
+
+\subsection{Closed Forms}
+\subsubsection{Closed Form for Sequence Regular Expressions}
The problem of obataining a closed-form for sequence regular expression
is constructing $(r_1 \cdot r_2) \backslash_r s$
if we are only allowed to use a combination of $r_1 \backslash s''$
@@ -1466,7 +1510,7 @@
\end{center}
\end{corollary}
\noindent
-\subsection{Closed Forms for Star Regular Expressions}
+\subsubsection{Closed Forms for Star Regular Expressions}
We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then
the property of the $\distinct$ function.
@@ -1683,7 +1727,13 @@
The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2}
are used.
\end{proof}
-\section{Estimating the Closed Forms' sizes}
+
+
+
+
+
+
+\subsection{Estimating the Closed Forms' sizes}
We now summarize the closed forms below:
\begin{itemize}
\item
@@ -1758,15 +1808,94 @@
is bounded by $N$.
We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
-We show how $\rdistinct$ and $\rflts$
-in the simplification function together is at least as
-good as $\rdistinct{}{}$ alone.
-\begin{lemma}\label{interactionFltsDB}
+We show that $\rdistinct$ and $\rflts$
+working together is at least as
+good as $\rdistinct{}{}$ alone, which can be written as
+\begin{center}
+ $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
+ \leq
+ \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $.
+\end{center}
+We need this so that we know the outcome of our real
+simplification is better than or equal to a rough estimate,
+and therefore can be bounded by that estimate.
+This is a bit harder to establish compared with proving
+$\textit{flts}$ does not make a list larger (which can
+be proven using routine induction):
+\begin{center}
+ $\llbracket \textit{rflts}\; rs \rrbracket_r \leq
+ \llbracket \textit{rs} \rrbracket_r$
+\end{center}
+We cannot simply prove how each helper function
+reduces the size and then put them together:
+From
+\begin{center}
+$\llbracket \textit{rflts}\; rs \rrbracket_r \leq
+ \llbracket \; \textit{rs} \rrbracket_r$
+\end{center}
+and
+\begin{center}
+ $\llbracket \textit{rdistinct} \; rs \; \varnothing \leq
+ \llbracket rs \rrbracket_r$
+\end{center}
+one cannot imply
+\begin{center}
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
\leq
\llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $.
+\end{center}
+What we can imply is that
+\begin{center}
+ $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
+ \leq
+ \llbracket rs \rrbracket_r$
+\end{center}
+but this estimate is too rough and $\llbracket rs \rrbracket_r$ is unbounded.
+The way we
+get through this is by first proving a more general lemma
+(so that the inductive case goes through):
+\begin{lemma}\label{fltsSizeReductionAlts}
+ If we have three accumulator sets:
+ $noalts\_set$, $alts\_set$ and $corr\_set$,
+ satisfying:
+ \begin{itemize}
+ \item
+ $\forall r \in noalts\_set. \; \nexists xs.\; r = \sum xs$
+ \item
+ $\forall r \in alts\_set. \; \exists xs. \; r = \sum xs
+ \; \textit{and} \; set \; xs \subseteq corr\_set$
+ \end{itemize}
+ then we have that
+ \begin{center}
+ \begin{tabular}{lcl}
+ $\llbracket (\textit{rdistinct} \; (\textit{rflts} \; as) \;
+ (noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\
+ $\llbracket (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup
+ \{ \ZERO \} )) \rrbracket_r$ & & \\
+ \end{tabular}
+ \end{center}
+ holds.
\end{lemma}
\noindent
+We need to split the accumulator into two parts: the part
+which contains alternative regular expressions ($alts\_set$), and
+the part without any of them($noalts\_set$).
+The set $corr\_set$ is the corresponding set
+of $alts\_set$ with all elements under the $\sum$ constructor
+spilled out.
+\begin{proof}
+ By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}.
+\end{proof}
+By setting all three sets to the empty set, one gets the desired size estimate:
+\begin{corollary}\label{interactionFltsDB}
+ $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
+ \leq
+ \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $.
+\end{corollary}
+\begin{proof}
+ By using the lemma \ref{fltsSizeReductionAlts}.
+\end{proof}
+\noindent
The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
@@ -1778,13 +1907,10 @@
By using \ref{interactionFltsDB}.
\end{proof}
\noindent
-which says that the size of regular expression
-is always smaller if we apply the full simplification
-rather than just one component ($\rdistinct{}{}$).
-
-
-Now we are ready to control the sizes of
-$r_1 \cdot r_2 \backslash s$, $r^* \backslash s$.
+This is a key lemma in establishing the bounds on all the
+closed forms.
+With this we are now ready to control the sizes of
+$(r_1 \cdot r_2 )\backslash s$, $r^* \backslash s$.
\begin{theorem}
For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
\end{theorem}
@@ -1893,6 +2019,10 @@
\end{proof}
\noindent
+
+
+
+
%-----------------------------------
% SECTION 2
%-----------------------------------
@@ -2062,53 +2192,54 @@
%----------------------------------------------------------------------------------------
% SECTION ALTS CLOSED FORM
%----------------------------------------------------------------------------------------
-\section{A Closed Form for \textit{ALTS}}
-Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
-
-
-There are a few key steps, one of these steps is
-
-
-
-One might want to prove this by something a simple statement like:
-
-For this to hold we want the $\textit{distinct}$ function to pick up
-the elements before and after derivatives correctly:
-$r \in rset \equiv (rder x r) \in (rder x rset)$.
-which essentially requires that the function $\backslash$ is an injective mapping.
-
-Unfortunately the function $\backslash c$ is not an injective mapping.
+%\section{A Closed Form for \textit{ALTS}}
+%Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
+%
+%
+%There are a few key steps, one of these steps is
+%
+%
+%
+%One might want to prove this by something a simple statement like:
+%
+%For this to hold we want the $\textit{distinct}$ function to pick up
+%the elements before and after derivatives correctly:
+%$r \in rset \equiv (rder x r) \in (rder x rset)$.
+%which essentially requires that the function $\backslash$ is an injective mapping.
+%
+%Unfortunately the function $\backslash c$ is not an injective mapping.
+%
+%\subsection{function $\backslash c$ is not injective (1-to-1)}
+%\begin{center}
+% The derivative $w.r.t$ character $c$ is not one-to-one.
+% Formally,
+% $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
+%\end{center}
+%This property is trivially true for the
+%character regex example:
+%\begin{center}
+% $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
+%\end{center}
+%But apart from the cases where the derivative
+%output is $\ZERO$, are there non-trivial results
+%of derivatives which contain strings?
+%The answer is yes.
+%For example,
+%\begin{center}
+% Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
+% where $a$ is not nullable.\\
+% $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
+% $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
+%\end{center}
+%We start with two syntactically different regular expressions,
+%and end up with the same derivative result.
+%This is not surprising as we have such
+%equality as below in the style of Arden's lemma:\\
+%\begin{center}
+% $L(A^*B) = L(A\cdot A^* \cdot B + B)$
+%\end{center}
-\subsection{function $\backslash c$ is not injective (1-to-1)}
-\begin{center}
- The derivative $w.r.t$ character $c$ is not one-to-one.
- Formally,
- $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
-\end{center}
-This property is trivially true for the
-character regex example:
-\begin{center}
- $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
-\end{center}
-But apart from the cases where the derivative
-output is $\ZERO$, are there non-trivial results
-of derivatives which contain strings?
-The answer is yes.
-For example,
-\begin{center}
- Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
- where $a$ is not nullable.\\
- $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
- $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
-\end{center}
-We start with two syntactically different regular expressions,
-and end up with the same derivative result.
-This is not surprising as we have such
-equality as below in the style of Arden's lemma:\\
-\begin{center}
- $L(A^*B) = L(A\cdot A^* \cdot B + B)$
-\end{center}
-
+\section{Further Improvements to the Bound}
There are two problems with this finiteness result, though.
\begin{itemize}
\item