lexing: comparison ChengsongTanPhdThesis/Chapters/Finite.tex

equal deleted inserted replaced

-:37b6fd310a16
+:61139fdddae0
 $r \backslash_{rsimps} [\,] $ & $\dn$ & $r$
 	\end{tabular}
 \end{center}
 \noindent
 We do not define an r-regular expression version of $\blexersimp$,
-as our proof does not involve its use.
+as our proof does not involve its use
+(and there is no bitcode to decode into a lexical value).
 Everything about the size of annotated regular expressions
 can be calculated via the size of r-regular expressions:
 \begin{lemma}\label{sizeRelations}
 	The following equalities hold:
 	\begin{itemize}
 	implies
 	$\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$.
 \end{center}
 Unless stated otherwise in the rest of this
 chapter all regular expressions without
-bitcodes are seen as $\rrexp$s.
+bitcodes are seen as r-regular expressions ($\rrexp$s).
 For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$,
 we use the notation $r_1 + r_2$
 for brevity.
 %		\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
 %	\end{center}
 %\end{quote}
 %\noindent
 %We explain in detail how we reached those claims.
-\subsection{Basic Properties needed for Closed Forms}
+\subsection{The Idea Behind Closed Forms}
 If we attempt to prove
 \begin{center}
-	$\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{bsimps} s \rrbracket_r \leq N_r$
+	$\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$
 \end{center}
 using a naive induction on the structure of $r$,
 then we are stuck at the inductive cases such as
 $r_1\cdot r_2$.
 The inductive hypotheses are:
 \begin{center}
 	1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t.
-	\;\;\forall s.  \llbracket r_1 \backslash_{bsimps} s \rrbracket_r \leq N_{r_1}. $\\
+	\;\;\forall s.  \llbracket r_1 \backslash_{rsimps} s \rrbracket_r \leq N_{r_1}. $\\
 	2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t.
-	\;\; \forall s. \llbracket r_2 \backslash_{bsimps} s \rrbracket_r \leq N_{r_2}. $
+	\;\; \forall s. \llbracket r_2 \backslash_{rsimps} s \rrbracket_r \leq N_{r_2}. $
 \end{center}
 The inductive step to prove would be
 \begin{center}
 	$\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s.
-	\llbracket (r_1 \cdot r_2) \backslash_{bsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$
+	\llbracket (r_1 \cdot r_2) \backslash_{rsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$
 \end{center}
 The problem is that it is not clear what
-$(r_1\cdot r_2) \backslash_{bsimps} s$ looks like,
+$(r_1\cdot r_2) \backslash_{rsimps} s$ looks like,
 and therefore $N_{r_1}$ and $N_{r_2}$ in the
 inductive hypotheses cannot be directly used.
+We have already seen that $(r_1 \cdot r_2)\backslash s$
+and $(r^*)\backslash s$ can grow in a wild way.
-The next steps involves getting a closed form for
+The point is that they will be equivalent to a list of
-$\rderssimp{r}{s}$ and then obtaining
+terms $\sum rs$, where each term in $rs$ will
-an estimate for the closed form.
+be made of $r_1 \backslash s' $, $r_2\backslash s'$,
+and $r \backslash s'$ with $s' \in \textit{SubString} \; s$.
+The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$
+in the simplification which saves $rs$ from growing indefinitely.
+Based on this idea, we sketch a proof by first showing the equality (where
+$f$ and $g$ are functions that do not increase the size of the input)
+\begin{center}
+$(r_1 \cdot r_2)\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$,
+\end{center}
+and then show the right-hand-side can be finitely bounded.
+We call the right-hand-side the
+\emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$.
+We will flesh out the proof sketch in the next section.
+\section{Details of Closed Forms and Bounds}
+In this section we introduce in detail
+how the closed forms are obtained for regular expressions'
+derivatives and how they are bounded.
+We start by proving some basic identities
+involving the simplification functions for r-regular expressions.
+After that we use these identities to establish the
+closed forms we need.
+Finally, we prove the functions such as $\flts$
+will keep the size non-increasing.
+Putting this together with a general bound
+on the finiteness of distinct regular expressions
+smaller than a certain size, we obtain a bound on
+the closed forms.
+%$r_1\cdot r_2$, $r^*$ and $\sum rs$.
+\subsection{Some Basic Identities}
 \subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
 The $\textit{rdistinct}$ function, as its name suggests, will
-remove duplicates in an \emph{r}$\textit{rexp}$ list,
+remove duplicates in an r-regular expression list.
-according to the accumulator
+It will also correctly exclude any elements that
-and leave only one of each different element in a list:
+is intially in the accumulator set.
 \begin{lemma}\label{rdistinctDoesTheJob}
-	The function $\textit{rdistinct}$ satisfies the following
+	%The function $\textit{rdistinct}$ satisfies the following
-	properties:
+	%properties:
+	Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its
+	recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.}
+	readily defined
+	for testing
+	whether a list's elements are all unique. Then the following
+	properties about $\textit{rdistinct}$ hold:
 	\begin{itemize}
 		\item
 			If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$.
 		\item
-			If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
+			%If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
-			then $\textit{isDistinct} \; rs'$.
+			$\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$.
 		\item
-			$\rdistinct{rs}{acc} = rs - acc$
+			$\textit{set} \; (\rdistinct{rs}{acc})
+			= (textit{set} \; rs) - acc$
 	\end{itemize}
 \end{lemma}
 \noindent
-The predicate $\textit{isDistinct}$ is for testing
-whether a list's elements are all unique. It is defined
-recursively on the structure of a regular expression,
-and we omit the precise definition here.
 \begin{proof}
 	The first part is by an induction on $rs$.
 	The second and third part can be proven by using the
-	induction rules of $\rdistinct{\_}{\_}$.
+	inductive cases of $\textit{rdistinct}$.
 \end{proof}
 \noindent
 $\rdistinct{\_}{\_}$ will cancel out all regular expression terms
 that are in the accumulator, therefore prepending a list $rs_a$ with an arbitrary
 list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$
 on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$
 without the prepending of $rs$:
-\begin{lemma}
+\begin{lemma}\label{rdistinctConcat}
 	The elements appearing in the accumulator will always be removed.
 	More precisely,
 	\begin{itemize}
 		\item
 			If $rs \subseteq rset$, then
 			$\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$.
 		\item
-			Furthermore, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$,
+			More generally, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$,
 			then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$
 	\end{itemize}
 \end{lemma}
 \begin{proof}
-	By induction on $rs$.
+	By induction on $rs$ and using \ref{rdistinctDoesTheJob}.
 \end{proof}
 \noindent
 On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,
 then expanding the accumulator to include that element will not cause the output list to change:
 \begin{lemma}
 \end{lemma}
 \begin{proof}
 	By induction on the lists involved.
 \end{proof}
 \noindent
-Similarly,
+The above allows us to prove
-we introduce the equality for $\sum$ when certain child regular expressions
+two similar equalities (which are a bit more involved).
-are $\sum$ themselves:
+It says that we could flatten out the elements
+before simplification and still get the same result.
 \begin{lemma}\label{simpFlatten3}
 	One can flatten the inside $\sum$ of a $\sum$ if it is being
 	simplified. Concretely,
 	\begin{itemize}
 		\item
 \noindent
 The harder closed forms are the sequence and star ones.
 Before we go on to obtain them, some preliminary definitions
 are needed to make proof statements concise.
-\section{"Closed Forms" of Sequence Regular Expressions}
+\subsection{Closed Forms}
+\subsubsection{Closed Form for Sequence Regular Expressions}
 The problem of obataining a closed-form for sequence regular expression
 is constructing $(r_1 \cdot r_2) \backslash_r s$
 if we are only allowed to use a combination of $r_1 \backslash s''$
 and  $r_2 \backslash s''$ , where $s''$ is from $s$.
 First let's look at a series of derivatives steps on a sequence
 		:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$
 	\end{center}
 \end{corollary}
 \noindent
-\subsection{Closed Forms for Star Regular Expressions}
+\subsubsection{Closed Forms for Star Regular Expressions}
 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then
 the property of the $\distinct$ function.
 Now we try to get a bound on $r^* \backslash s$ as well.
 Again, we first look at how a star's derivatives evolve, if they grow maximally:
 \begin{proof}
 	By an induction on $s$.
 	The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2}
 	are used.
 \end{proof}
-\section{Estimating the Closed Forms' sizes}
+\subsection{Estimating the Closed Forms' sizes}
 We now summarize the closed forms below:
 \begin{itemize}
 	\item
 		$\rderssimp{(\sum rs)}{s} \sequal
 		\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$
 is bounded by a constant $c_N$ depending only on $N$,
 provided that each of $rs'$'s element
 is bounded by $N$.
 We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$.
-We show how $\rdistinct$ and $\rflts$
+We show that $\rdistinct$ and $\rflts$
-in the simplification function together is at least as
+working together is at least as
-good as $\rdistinct{}{}$ alone.
+good as $\rdistinct{}{}$ alone, which can be written as
-\begin{lemma}\label{interactionFltsDB}
+\begin{center}
 	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
 	\leq
 	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
-\end{lemma}
+\end{center}
+We need this so that we know the outcome of our real
+simplification is better than or equal to a rough estimate,
+and therefore can be bounded by that estimate.
+This is a bit harder to establish compared with proving
+$\textit{flts}$ does not make a list larger (which can
+be proven using routine induction):
+\begin{center}
+	$\llbracket  \textit{rflts}\; rs \rrbracket_r \leq
+	\llbracket  \textit{rs} \rrbracket_r$
+\end{center}
+We cannot simply prove how each helper function
+reduces the size and then put them together:
+From
+\begin{center}
+$\llbracket  \textit{rflts}\; rs \rrbracket_r \leq
+	\llbracket \; \textit{rs} \rrbracket_r$
+\end{center}
+and
+\begin{center}
+$\llbracket  \textit{rdistinct} \; rs \; \varnothing \leq
+\llbracket rs \rrbracket_r$
+\end{center}
+one cannot imply
+\begin{center}
+	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
+	\leq
+	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
+\end{center}
+What we can imply is that
+\begin{center}
+	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
+	\leq
+	\llbracket rs \rrbracket_r$
+\end{center}
+but this estimate is too rough and $\llbracket rs \rrbracket_r$	is unbounded.
+The way we
+get through this is by first proving a more general lemma
+(so that the inductive case goes through):
+\begin{lemma}\label{fltsSizeReductionAlts}
+	If we have three accumulator sets:
+	$noalts\_set$, $alts\_set$ and $corr\_set$,
+	satisfying:
+	\begin{itemize}
+		\item
+			$\forall r \in noalts\_set. \; \nexists xs.\; r = \sum  xs$
+		\item
+			$\forall r \in alts\_set. \; \exists xs. \; r = \sum xs
+			\; \textit{and} \; set \; xs \subseteq corr\_set$
+	\end{itemize}
+	then we have that
+	\begin{center}
+	\begin{tabular}{lcl}
+	$\llbracket  (\textit{rdistinct} \; (\textit{rflts} \; as) \;
+	(noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\
+						    $\llbracket  (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup
+	\{ \ZERO \} )) \rrbracket_r$ & & \\
+	\end{tabular}
+	\end{center}
+		holds.
+\end{lemma}
+\noindent
+We need to split the accumulator into two parts: the part
+which contains alternative regular expressions ($alts\_set$), and
+the part without any of them($noalts\_set$).
+The set $corr\_set$ is the corresponding set
+of $alts\_set$ with all elements under the $\sum$ constructor
+spilled out.
+\begin{proof}
+	By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}.
+\end{proof}
+By setting all three sets to the empty set, one gets the desired size estimate:
+\begin{corollary}\label{interactionFltsDB}
+	$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r
+	\leq
+	\llbracket \rdistinct{rs}{\varnothing}  \rrbracket_r  $.
+\end{corollary}
+\begin{proof}
+	By using the lemma \ref{fltsSizeReductionAlts}.
+\end{proof}
 \noindent
 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
 Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$:
 \end{lemma}
 \begin{proof}
 	By using \ref{interactionFltsDB}.
 \end{proof}
 \noindent
-which says that the size of regular expression
+This is a key lemma in establishing the bounds on all the
-is always smaller if we apply the full simplification
+closed forms.
-rather than just one component ($\rdistinct{}{}$).
+With this we are now ready to control the sizes of
+$(r_1 \cdot r_2 )\backslash s$, $r^* \backslash s$.
-Now we are ready to control the sizes of
-$r_1 \cdot r_2 \backslash s$, $r^* \backslash s$.
 \begin{theorem}
 	For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$
 \end{theorem}
 \noindent
 \begin{proof}
 \end{corollary}
 \begin{proof}
 	By \ref{sizeRelations}.
 \end{proof}
 \noindent
 %-----------------------------------
 %	SECTION 2
 %-----------------------------------
 %----------------------------------------------------------------------------------------
 %	SECTION ALTS CLOSED FORM
 %----------------------------------------------------------------------------------------
-\section{A Closed Form for \textit{ALTS}}
+%\section{A Closed Form for \textit{ALTS}}
-Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
+%Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
+%
+%
-There are a few key steps, one of these steps is
+%There are a few key steps, one of these steps is
+%
+%
+%
-One might want to prove this by something a simple statement like:
+%One might want to prove this by something a simple statement like:
+%
-For this to hold we want the $\textit{distinct}$ function to pick up
+%For this to hold we want the $\textit{distinct}$ function to pick up
-the elements before and after derivatives correctly:
+%the elements before and after derivatives correctly:
-$r \in rset \equiv (rder x r) \in (rder x rset)$.
+%$r \in rset \equiv (rder x r) \in (rder x rset)$.
-which essentially requires that the function $\backslash$ is an injective mapping.
+%which essentially requires that the function $\backslash$ is an injective mapping.
+%
-Unfortunately the function $\backslash c$ is not an injective mapping.
+%Unfortunately the function $\backslash c$ is not an injective mapping.
+%
-\subsection{function $\backslash c$ is not injective (1-to-1)}
+%\subsection{function $\backslash c$ is not injective (1-to-1)}
-\begin{center}
+%\begin{center}
-	The derivative $w.r.t$ character $c$ is not one-to-one.
+%	The derivative $w.r.t$ character $c$ is not one-to-one.
-	Formally,
+%	Formally,
-	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
+%	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
-\end{center}
+%\end{center}
-This property is trivially true for the
+%This property is trivially true for the
-character regex example:
+%character regex example:
-\begin{center}
+%\begin{center}
-	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
+%	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
-\end{center}
+%\end{center}
-But apart from the cases where the derivative
+%But apart from the cases where the derivative
-output is $\ZERO$, are there non-trivial results
+%output is $\ZERO$, are there non-trivial results
-of derivatives which contain strings?
+%of derivatives which contain strings?
-The answer is yes.
+%The answer is yes.
-For example,
+%For example,
-\begin{center}
+%\begin{center}
-	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
+%	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
-	where $a$ is not nullable.\\
+%	where $a$ is not nullable.\\
-	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
+%	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
-	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
+%	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
-\end{center}
+%\end{center}
-We start with two syntactically different regular expressions,
+%We start with two syntactically different regular expressions,
-and end up with the same derivative result.
+%and end up with the same derivative result.
-This is not surprising as we have such
+%This is not surprising as we have such
-equality as below in the style of Arden's lemma:\\
+%equality as below in the style of Arden's lemma:\\
-\begin{center}
+%\begin{center}
-	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
+%	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
-\end{center}
+%\end{center}
+\section{Further Improvements to the Bound}
 There are two problems with this finiteness result, though.
 \begin{itemize}
 	\item
 		First, It is not yet a direct formalisation of our lexer's complexity,
 		as a complexity proof would require looking into

changeset 609	61139fdddae0
parent 601	ce4e5151a836
child 610	d028c662a3df