diff -r 37b6fd310a16 -r 61139fdddae0 ChengsongTanPhdThesis/Chapters/Finite.tex --- a/ChengsongTanPhdThesis/Chapters/Finite.tex Sat Oct 01 12:06:46 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Mon Oct 03 02:08:49 2022 +0100 @@ -402,7 +402,8 @@ \end{center} \noindent We do not define an r-regular expression version of $\blexersimp$, -as our proof does not involve its use. +as our proof does not involve its use +(and there is no bitcode to decode into a lexical value). Everything about the size of annotated regular expressions can be calculated via the size of r-regular expressions: \begin{lemma}\label{sizeRelations} @@ -433,7 +434,7 @@ \end{center} Unless stated otherwise in the rest of this chapter all regular expressions without -bitcodes are seen as $\rrexp$s. +bitcodes are seen as r-regular expressions ($\rrexp$s). For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$, we use the notation $r_1 + r_2$ for brevity. @@ -476,10 +477,10 @@ %\end{quote} %\noindent %We explain in detail how we reached those claims. -\subsection{Basic Properties needed for Closed Forms} +\subsection{The Idea Behind Closed Forms} If we attempt to prove \begin{center} - $\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{bsimps} s \rrbracket_r \leq N_r$ + $\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$ \end{center} using a naive induction on the structure of $r$, then we are stuck at the inductive cases such as @@ -487,51 +488,89 @@ The inductive hypotheses are: \begin{center} 1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t. - \;\;\forall s. \llbracket r_1 \backslash_{bsimps} s \rrbracket_r \leq N_{r_1}. $\\ + \;\;\forall s. \llbracket r_1 \backslash_{rsimps} s \rrbracket_r \leq N_{r_1}. $\\ 2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t. - \;\; \forall s. \llbracket r_2 \backslash_{bsimps} s \rrbracket_r \leq N_{r_2}. $ + \;\; \forall s. \llbracket r_2 \backslash_{rsimps} s \rrbracket_r \leq N_{r_2}. $ \end{center} The inductive step to prove would be \begin{center} $\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s. - \llbracket (r_1 \cdot r_2) \backslash_{bsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$ + \llbracket (r_1 \cdot r_2) \backslash_{rsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$ \end{center} The problem is that it is not clear what -$(r_1\cdot r_2) \backslash_{bsimps} s$ looks like, +$(r_1\cdot r_2) \backslash_{rsimps} s$ looks like, and therefore $N_{r_1}$ and $N_{r_2}$ in the inductive hypotheses cannot be directly used. +We have already seen that $(r_1 \cdot r_2)\backslash s$ +and $(r^*)\backslash s$ can grow in a wild way. +The point is that they will be equivalent to a list of +terms $\sum rs$, where each term in $rs$ will +be made of $r_1 \backslash s' $, $r_2\backslash s'$, +and $r \backslash s'$ with $s' \in \textit{SubString} \; s$. +The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$ +in the simplification which saves $rs$ from growing indefinitely. + +Based on this idea, we sketch a proof by first showing the equality (where +$f$ and $g$ are functions that do not increase the size of the input) +\begin{center} +$(r_1 \cdot r_2)\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$, +\end{center} +and then show the right-hand-side can be finitely bounded. +We call the right-hand-side the +\emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. +We will flesh out the proof sketch in the next section. + +\section{Details of Closed Forms and Bounds} +In this section we introduce in detail +how the closed forms are obtained for regular expressions' +derivatives and how they are bounded. +We start by proving some basic identities +involving the simplification functions for r-regular expressions. +After that we use these identities to establish the +closed forms we need. +Finally, we prove the functions such as $\flts$ +will keep the size non-increasing. +Putting this together with a general bound +on the finiteness of distinct regular expressions +smaller than a certain size, we obtain a bound on +the closed forms. +%$r_1\cdot r_2$, $r^*$ and $\sum rs$. -The next steps involves getting a closed form for -$\rderssimp{r}{s}$ and then obtaining -an estimate for the closed form. + +\subsection{Some Basic Identities} + + \subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully} The $\textit{rdistinct}$ function, as its name suggests, will -remove duplicates in an \emph{r}$\textit{rexp}$ list, -according to the accumulator -and leave only one of each different element in a list: +remove duplicates in an r-regular expression list. +It will also correctly exclude any elements that +is intially in the accumulator set. \begin{lemma}\label{rdistinctDoesTheJob} - The function $\textit{rdistinct}$ satisfies the following - properties: + %The function $\textit{rdistinct}$ satisfies the following + %properties: + Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its + recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.} + readily defined + for testing + whether a list's elements are all unique. Then the following + properties about $\textit{rdistinct}$ hold: \begin{itemize} \item If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$. \item - If list $rs'$ is the result of $\rdistinct{rs}{acc}$, - then $\textit{isDistinct} \; rs'$. + %If list $rs'$ is the result of $\rdistinct{rs}{acc}$, + $\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$. \item - $\rdistinct{rs}{acc} = rs - acc$ + $\textit{set} \; (\rdistinct{rs}{acc}) + = (textit{set} \; rs) - acc$ \end{itemize} \end{lemma} \noindent -The predicate $\textit{isDistinct}$ is for testing -whether a list's elements are all unique. It is defined -recursively on the structure of a regular expression, -and we omit the precise definition here. \begin{proof} The first part is by an induction on $rs$. The second and third part can be proven by using the - induction rules of $\rdistinct{\_}{\_}$. + inductive cases of $\textit{rdistinct}$. \end{proof} @@ -541,7 +580,7 @@ list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$ on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$ without the prepending of $rs$: -\begin{lemma} +\begin{lemma}\label{rdistinctConcat} The elements appearing in the accumulator will always be removed. More precisely, \begin{itemize} @@ -549,13 +588,13 @@ If $rs \subseteq rset$, then $\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$. \item - Furthermore, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$, + More generally, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$, then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$ \end{itemize} \end{lemma} \begin{proof} - By induction on $rs$. + By induction on $rs$ and using \ref{rdistinctDoesTheJob}. \end{proof} \noindent On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated, @@ -964,9 +1003,10 @@ By induction on the lists involved. \end{proof} \noindent -Similarly, -we introduce the equality for $\sum$ when certain child regular expressions -are $\sum$ themselves: +The above allows us to prove +two similar equalities (which are a bit more involved). +It says that we could flatten out the elements +before simplification and still get the same result. \begin{lemma}\label{simpFlatten3} One can flatten the inside $\sum$ of a $\sum$ if it is being simplified. Concretely, @@ -1279,7 +1319,11 @@ Before we go on to obtain them, some preliminary definitions are needed to make proof statements concise. -\section{"Closed Forms" of Sequence Regular Expressions} + + + +\subsection{Closed Forms} +\subsubsection{Closed Form for Sequence Regular Expressions} The problem of obataining a closed-form for sequence regular expression is constructing $(r_1 \cdot r_2) \backslash_r s$ if we are only allowed to use a combination of $r_1 \backslash s''$ @@ -1466,7 +1510,7 @@ \end{center} \end{corollary} \noindent -\subsection{Closed Forms for Star Regular Expressions} +\subsubsection{Closed Forms for Star Regular Expressions} We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then the property of the $\distinct$ function. @@ -1683,7 +1727,13 @@ The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2} are used. \end{proof} -\section{Estimating the Closed Forms' sizes} + + + + + + +\subsection{Estimating the Closed Forms' sizes} We now summarize the closed forms below: \begin{itemize} \item @@ -1758,15 +1808,94 @@ is bounded by $N$. We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$. -We show how $\rdistinct$ and $\rflts$ -in the simplification function together is at least as -good as $\rdistinct{}{}$ alone. -\begin{lemma}\label{interactionFltsDB} +We show that $\rdistinct$ and $\rflts$ +working together is at least as +good as $\rdistinct{}{}$ alone, which can be written as +\begin{center} + $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r + \leq + \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. +\end{center} +We need this so that we know the outcome of our real +simplification is better than or equal to a rough estimate, +and therefore can be bounded by that estimate. +This is a bit harder to establish compared with proving +$\textit{flts}$ does not make a list larger (which can +be proven using routine induction): +\begin{center} + $\llbracket \textit{rflts}\; rs \rrbracket_r \leq + \llbracket \textit{rs} \rrbracket_r$ +\end{center} +We cannot simply prove how each helper function +reduces the size and then put them together: +From +\begin{center} +$\llbracket \textit{rflts}\; rs \rrbracket_r \leq + \llbracket \; \textit{rs} \rrbracket_r$ +\end{center} +and +\begin{center} + $\llbracket \textit{rdistinct} \; rs \; \varnothing \leq + \llbracket rs \rrbracket_r$ +\end{center} +one cannot imply +\begin{center} $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r \leq \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. +\end{center} +What we can imply is that +\begin{center} + $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r + \leq + \llbracket rs \rrbracket_r$ +\end{center} +but this estimate is too rough and $\llbracket rs \rrbracket_r$ is unbounded. +The way we +get through this is by first proving a more general lemma +(so that the inductive case goes through): +\begin{lemma}\label{fltsSizeReductionAlts} + If we have three accumulator sets: + $noalts\_set$, $alts\_set$ and $corr\_set$, + satisfying: + \begin{itemize} + \item + $\forall r \in noalts\_set. \; \nexists xs.\; r = \sum xs$ + \item + $\forall r \in alts\_set. \; \exists xs. \; r = \sum xs + \; \textit{and} \; set \; xs \subseteq corr\_set$ + \end{itemize} + then we have that + \begin{center} + \begin{tabular}{lcl} + $\llbracket (\textit{rdistinct} \; (\textit{rflts} \; as) \; + (noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\ + $\llbracket (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup + \{ \ZERO \} )) \rrbracket_r$ & & \\ + \end{tabular} + \end{center} + holds. \end{lemma} \noindent +We need to split the accumulator into two parts: the part +which contains alternative regular expressions ($alts\_set$), and +the part without any of them($noalts\_set$). +The set $corr\_set$ is the corresponding set +of $alts\_set$ with all elements under the $\sum$ constructor +spilled out. +\begin{proof} + By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}. +\end{proof} +By setting all three sets to the empty set, one gets the desired size estimate: +\begin{corollary}\label{interactionFltsDB} + $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r + \leq + \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. +\end{corollary} +\begin{proof} + By using the lemma \ref{fltsSizeReductionAlts}. +\end{proof} +\noindent The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. @@ -1778,13 +1907,10 @@ By using \ref{interactionFltsDB}. \end{proof} \noindent -which says that the size of regular expression -is always smaller if we apply the full simplification -rather than just one component ($\rdistinct{}{}$). - - -Now we are ready to control the sizes of -$r_1 \cdot r_2 \backslash s$, $r^* \backslash s$. +This is a key lemma in establishing the bounds on all the +closed forms. +With this we are now ready to control the sizes of +$(r_1 \cdot r_2 )\backslash s$, $r^* \backslash s$. \begin{theorem} For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$ \end{theorem} @@ -1893,6 +2019,10 @@ \end{proof} \noindent + + + + %----------------------------------- % SECTION 2 %----------------------------------- @@ -2062,53 +2192,54 @@ %---------------------------------------------------------------------------------------- % SECTION ALTS CLOSED FORM %---------------------------------------------------------------------------------------- -\section{A Closed Form for \textit{ALTS}} -Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. - - -There are a few key steps, one of these steps is - - - -One might want to prove this by something a simple statement like: - -For this to hold we want the $\textit{distinct}$ function to pick up -the elements before and after derivatives correctly: -$r \in rset \equiv (rder x r) \in (rder x rset)$. -which essentially requires that the function $\backslash$ is an injective mapping. - -Unfortunately the function $\backslash c$ is not an injective mapping. +%\section{A Closed Form for \textit{ALTS}} +%Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. +% +% +%There are a few key steps, one of these steps is +% +% +% +%One might want to prove this by something a simple statement like: +% +%For this to hold we want the $\textit{distinct}$ function to pick up +%the elements before and after derivatives correctly: +%$r \in rset \equiv (rder x r) \in (rder x rset)$. +%which essentially requires that the function $\backslash$ is an injective mapping. +% +%Unfortunately the function $\backslash c$ is not an injective mapping. +% +%\subsection{function $\backslash c$ is not injective (1-to-1)} +%\begin{center} +% The derivative $w.r.t$ character $c$ is not one-to-one. +% Formally, +% $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ +%\end{center} +%This property is trivially true for the +%character regex example: +%\begin{center} +% $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ +%\end{center} +%But apart from the cases where the derivative +%output is $\ZERO$, are there non-trivial results +%of derivatives which contain strings? +%The answer is yes. +%For example, +%\begin{center} +% Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ +% where $a$ is not nullable.\\ +% $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ +% $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ +%\end{center} +%We start with two syntactically different regular expressions, +%and end up with the same derivative result. +%This is not surprising as we have such +%equality as below in the style of Arden's lemma:\\ +%\begin{center} +% $L(A^*B) = L(A\cdot A^* \cdot B + B)$ +%\end{center} -\subsection{function $\backslash c$ is not injective (1-to-1)} -\begin{center} - The derivative $w.r.t$ character $c$ is not one-to-one. - Formally, - $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ -\end{center} -This property is trivially true for the -character regex example: -\begin{center} - $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ -\end{center} -But apart from the cases where the derivative -output is $\ZERO$, are there non-trivial results -of derivatives which contain strings? -The answer is yes. -For example, -\begin{center} - Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ - where $a$ is not nullable.\\ - $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ - $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ -\end{center} -We start with two syntactically different regular expressions, -and end up with the same derivative result. -This is not surprising as we have such -equality as below in the style of Arden's lemma:\\ -\begin{center} - $L(A^*B) = L(A\cdot A^* \cdot B + B)$ -\end{center} - +\section{Further Improvements to the Bound} There are two problems with this finiteness result, though. \begin{itemize} \item