lexing: comparison ChengsongTanPhdThesis/Chapters/Finite.tex

equal deleted inserted replaced

-:8ffa28fce271
+:b797c9a709d9
 		The bound is universal for a given regular expression,
 		which is an advantage over work which
 		only gives empirical evidence on
 		some test cases (see Verbatim++ work\cite{Verbatimpp}).
 \end{itemize}
+\noindent
+We then extend our $\blexersimp$
+to support bounded repetitions ($r^{\{n\}}$).
+We update our formalisation of
+the correctness and finiteness properties to
+include this new construct.
+we can out-compete other verified lexers such as
+Verbatim++ on bounded regular expressions.
 In the next section we describe in more detail
 what the finite bound means in our algorithm
 and why the size of the internal data structures of
 a typical derivative-based lexer such as
 Sulzmann and Lu's need formal treatment.
 \section{Formalising About Size}
 \noindent
 In our lexer ($\blexersimp$),
 we take an annotated regular expression as input,
 %-----------------------------------
 %	SECTION 2
 %-----------------------------------
+\section{Bounded Repetitions}
+We have promised in chapter \ref{Introduction}
+that our lexing algorithm can potentially be extended
+to handle bounded repetitions
+in natural and elegant ways.
+Now we fulfill our promise by adding support for
+the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$.
+We add clauses in our derivatives-based lexing algorithms (with simplifications)
+introduced in chapter \ref{Bitcoded2}.
+\subsection{Augmented Definitions}
+There are a number of definitions that need to be augmented.
+The most notable one would be the POSIX rules for $r^{\{n\}}$:
+\begin{center}
+	\begin{mathpar}
+		\inferrule{\forall v \in vs_1. \vdash v:r \land
+		|v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\
+		\textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \;
+		(vs_1 @ vs_2) : r^{\{n\}} }
+	\end{mathpar}
+\end{center}
+As Ausaf had pointed out \cite{Ausaf},
+sometimes empty iterations have to be taken to get
+a match with exactly $n$ repetitions,
+and hence the $vs_2$ part.
+Another important definition would be the size:
+\begin{center}
+	\begin{tabular}{lcl}
+		$\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ &
+		$\llbracket r \rrbracket_r + n$\\
+	\end{tabular}
+\end{center}
+\noindent
+Arguably we should use $\log \; n$ for the size because
+the number of digits increase logarithmically w.r.t $n$.
+For simplicity we choose to add the counter directly to the size.
+The derivative w.r.t a bounded regular expression
+is given as
+\begin{center}
+	\begin{tabular}{lcl}
+		$r^{\{n\}} \backslash_r c$ & $\dn$ &
+		$r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\
+					   & & $\RZERO \;\quad \quad\quad \quad
+					   \textit{otherwise}$\\
+	\end{tabular}
+\end{center}
+\noindent
+For brevity, we sometimes use NTIMES to refer to bounded
+regular expressions.
+The $\mkeps$ function clause for NTIMES would be
+\begin{center}
+	\begin{tabular}{lcl}
+		$\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \;
+		(\textit{replicate} \; n\; (\mkeps \; r))$
+	\end{tabular}
+\end{center}
+\noindent
+The injection looks like
+\begin{center}
+	\begin{tabular}{lcl}
+		$\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ &
+		$\dn$ & $\Stars \;
+		((\inj \; r \;c \;v ) :: vs)$
+	\end{tabular}
+\end{center}
+\noindent
+\subsection{Proofs for the Augmented Lexing Algorithm}
+We need to maintain two proofs with the additional $r^{\{n\}}$
+construct: the
+correctness proof in chapter \ref{Bitcoded2},
+and the finiteness proof in chapter \ref{Finite}.
+\subsubsection{Correctness Proof Augmentation}
+The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitions
+have been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}.
+As they have commented, once the definitions are in place,
+the proofs given for the basic regular expressions will extend to
+bounded regular expressions, and there are no ``surprises''.
+We confirm this point because the correctness theorem would also
+extend without surprise to $\blexersimp$.
+The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on
+do not need to be changed,
+and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to
+add one more line which can be solved by sledgehammer
+to solve the $r^{\{n\}}$ inductive case.
+\subsubsection{Finiteness Proof Augmentation}
+The bounded repetitions are
+very similar to stars, and therefore the treatment
+is similar, with minor changes to handle some slight complications
+when the counter reaches 0.
+The exponential growth is similar:
+\begin{center}
+	\begin{tabular}{ll}
+		$r^{\{n\}} $ & $\longrightarrow_{\backslash c}$\\
+		$(r\backslash c)  \cdot
+		r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c'}$\\
+		\\
+		$r \backslash cc'  \cdot r^{\{n - 2\}}* +
+		r \backslash c' \cdot r^{\{n - 1\}}*$ &
+		$\longrightarrow_{\backslash c''}$\\
+		\\
+		$(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
+		r \backslash c''\cdot r^{\{n-1\}}) +
+		(r \backslash c'c'' \cdot r^{\{n-2\}}* +
+		r \backslash c'' \cdot r^{\{n-1\}}*)$ &
+		$\longrightarrow_{\backslash c'''}$ \\
+		\\
+		$\ldots$\\
+	\end{tabular}
+\end{center}
+Again, we assume that $r\backslash c$, $r \backslash cc'$ and so on
+are all nullable.
+The flattened list of terms for $r^{\{n\}} \backslash_{rs} s$
+\begin{center}
+	$[r_1 \backslash cc'c'' \cdot r^{\{n-3\}}*,\;
+	r \backslash c''\cdot r^{\{n-1\}}, \;
+	r \backslash c'c'' \cdot r^{\{n-2\}}*, \;
+	r \backslash c'' \cdot r^{\{n-1\}}*,\; \ldots ]$
+\end{center}
+that comes from
+\begin{center}
+		$(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
+		r \backslash c''\cdot r^{\{n-1\}}) +
+		(r \backslash c'c'' \cdot r^{\{n-2\}}* +
+		r \backslash c'' \cdot r^{\{n-1\}}*)+ \ldots$
+\end{center}
+are made of sequences with different tails, where the counters
+might differ.
+The observation for maintaining the bound is that
+these counters never exceed $n$, the original
+counter. With the number of counters staying finite,
+$\rDistinct$ will deduplicate and keep the list finite.
+We introduce this idea as a lemma once we describe all
+the necessary helper functions.
+Similar to the star case, we want
+\begin{center}
+	$\rderssimp{r^{\{n\}}}{s} = \rsimp{\sum rs}$.
+\end{center}
+where $rs$
+shall be in the form of
+$\map \; f \; Ss$, where $f$ is a function and
+$Ss$ a list of objects to act on.
+For star, the object's datatype is string.
+The list of strings $Ss$
+is generated using functions
+$\starupdate$ and $\starupdates$.
+The function that takes a string and returns a regular expression
+is the anonymous function $
+(\lambda s'. \; r\backslash s' \cdot r^{\{m\}})$.
+In the NTIMES setting,
+the $\starupdate$ and $\starupdates$ functions are replaced by
+$\textit{nupdate}$ and $\textit{nupdates}$:
+\begin{center}
+	\begin{tabular}{lcl}
+		$\nupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
+		$\nupdate \; c \; r \;
+		(\Some \; (s, \; n + 1) \; :: \; Ss)$ & $\dn$ & %\\
+						     $\textit{if} \;
+						     (\rnullable \; (r \backslash_{rs} s))$ \\
+						     & & $\;\;\textit{then}
+						     \;\; \Some \; (s @ [c], n + 1) :: \Some \; ([c], n) :: (
+						     \nupdate \; c \; r \; Ss)$ \\
+						     & & $\textit{else} \;\; \Some \; (s @ [c], n+1) :: (
+						     \nupdate \; c \; r \; Ss)$\\
+		$\nupdate \; c \; r \; (\textit{None} :: Ss)$ & $\dn$ &
+		$(\None :: \nupdate  \; c \; r \; Ss)$\\
+							      & & \\
+	%\end{tabular}
+%\end{center}
+%\begin{center}
+	%\begin{tabular}{lcl}
+		$\nupdates \; [] \; r \; Ss$ & $\dn$ & $Ss$\\
+		$\nupdates \; (c :: cs) \; r \; Ss$ &  $\dn$ &  $\nupdates \; cs \; r \; (
+		\nupdate \; c \; r \; Ss)$
+	\end{tabular}
+\end{center}
+\noindent
+which take into account when a subterm
+\begin{center}
+	$r \backslash_s s \cdot r^{\{n\}}$
+\end{center}
+counter $n$
+is 0, and therefore expands to
+\begin{center}
+$r \backslash_s (s@[c]) \cdot r^{\{n\}} \;+
+\; \ZERO$
+\end{center}
+after taking a derivative.
+The object now has type
+\begin{center}
+$\textit{option} \;(\textit{string}, \textit{nat})$
+\end{center}
+and therefore the function for converting such an option into
+a regular expression term is called $\opterm$:
+\begin{center}
+	\begin{tabular}{lcl}
+	$\opterm \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
+				 & & $\;\;\Some \; (s, n) \Rightarrow
+				 (r\backslash_{rs} s)\cdot r^{\{n\}}$\\
+				 & & $\;\;\None  \Rightarrow
+				 \ZERO$\\
+	\end{tabular}
+\end{center}
+\noindent
+Put together, the list $\map \; f \; Ss$ is instantiated as
+\begin{center}
+	$\map \; (\opterm \; r) \; (\nupdates \; s \; r \;
+	[\Some \; ([c], n)])$.
+\end{center}
+For the closed form to be bounded, we would like
+simplification to be applied to each term in the list.
+Therefore we introduce some variants of $\opterm$,
+which help conveniently express the rewriting steps
+needed in the closed form proof.
+\begin{center}
+	\begin{tabular}{lcl}
+	$\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
+				 & & $\;\;\Some \; (s, n) \Rightarrow
+				 \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\
+				 & & $\;\;\None  \Rightarrow
+				 \ZERO$\\
+				 \\
+	$\optermosimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
+				 & & $\;\;\Some \; (s, n) \Rightarrow
+				 (\textit{rsimp} \; (r\backslash_{rs} s))
+				 \cdot r^{\{n\}}$\\
+				 & & $\;\;\None  \Rightarrow
+				 \ZERO$\\
+				 \\
+	$\optermsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
+				 & & $\;\;\Some \; (s, n) \Rightarrow
+				 (r\backslash_{rsimps} s)\cdot r^{\{n\}}$\\
+				 & & $\;\;\None  \Rightarrow
+				 \ZERO$\\
+	\end{tabular}
+\end{center}
+For a list of
+$\textit{option} \;(\textit{string}, \textit{nat})$ elements,
+we define the highest power for it recursively:
+\begin{center}
+	\begin{tabular}{lcl}
+		$\hpa \; [] \; n $ & $\dn$ & $n$\\
+		$\hpa \; (\None :: os) \; n $ &  $\dn$ &  $\hpa \; os \; n$\\
+		$\hpa \; (\Some \; (s, n) :: os) \; m$ & $\dn$ &
+		$\hpa \;os \; (\textit{max} \; n\; m)$\\
+		\\
+		$\hpower \; rs $ & $\dn$ & $\hpa \; rs \; 0$\\
+	\end{tabular}
+\end{center}
+Now the intuition that an NTIMES regular expression's power
+does not increase can be easily expressed as
+\begin{lemma}\label{nupdatesMono2}
+	$\hpower \; (\nupdates \;s \; r \; [\Some \; ([c], n)]) \leq n$
+\end{lemma}
+\begin{proof}
+	Note that the power is non-increasing after a $\nupdate$ application:
+	\begin{center}
+		$\hpa \;\; (\nupdate \; c \; r \; Ss)\;\; m \leq
+		 \hpa\; \; Ss \; m$.
+	 \end{center}
+	 This is also the case for $\nupdates$:
+	\begin{center}
+		$\hpa \;\; (\nupdates \; s \; r \; Ss)\;\; m \leq
+		 \hpa\; \; Ss \; m$.
+	 \end{center}
+	 Therefore we have that
+	 \begin{center}
+		 $\hpower \;\; (\nupdates \; s \; r \; Ss) \leq
+		  \hpower \;\; Ss$
+	 \end{center}
+	 which leads to the lemma being proven.
+\end{proof}
+We also define the inductive rules for
+the shape of derivatives of the NTIMES regular expressions:\\[-3em]
+\begin{center}
+	\begin{mathpar}
+		\inferrule{\mbox{}}{\cbn \;\ZERO}
+		\inferrule{\mbox{}}{\cbn \; \; r_a \cdot (r^{\{n\}})}
+		\inferrule{\cbn \; r_1 \;\; \; \cbn \; r_2}{\cbn \; r_1 + r_2}
+		\inferrule{\cbn \; r}{\cbn \; r + \ZERO}
+	\end{mathpar}
+\end{center}
+\noindent
+A derivative of NTIMES fits into the shape described by $\cbn$:
+\begin{lemma}\label{ntimesDersCbn}
+	$\cbn \; ((r' \cdot r^{\{n\}}) \backslash_{rs} s)$ holds.
+\end{lemma}
+\begin{proof}
+	By a reverse induction on $s$.
+	For the inductive case, note that if $\cbn \; r$ holds,
+	then $\cbn \; (r\backslash_r c)$ holds.
+\end{proof}
+\noindent
+In addition, for $\cbn$-shaped regular expressioins, one can flatten
+them:
+\begin{lemma}\label{ntimesHfauPushin}
+	If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} =
+	\textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \;
+	(\hflataux{r})})$
+\end{lemma}
+\begin{proof}
+	By an induction on the inductive cases of $\cbn$.
+\end{proof}
+\noindent
+This time we do not need to define the flattening functions for NTIMES only,
+because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already.
+\begin{lemma}\label{ntimesHfauInduct}
+$\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} =
+\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$
+\end{lemma}
+\begin{proof}
+	By a reverse induction on $s$.
+	The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used.
+\end{proof}
+\noindent
+We have a recursive property for NTIMES with $\nupdate$
+similar to that for STAR,
+and one for $\nupdates $ as well:
+\begin{lemma}\label{nupdateInduct1}
+	\mbox{}
+	\begin{itemize}
+		\item
+			\begin{center}
+	 $\textit{concat} \; (\map \; (\hflataux{\_} \circ (
+	\opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \;
+	c \; r \; Ss)$\\
+	\end{center}
+	holds.
+\item
+	\begin{center}
+	 $\textit{concat} \; (\map \; \hflataux{\_}\;
+	\map \; (\_\backslash_r x) \;
+		(\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\
+		$=$\\
+	$\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$
+	\end{center}
+	holds.
+	\end{itemize}
+\end{lemma}
+\begin{proof}
+	(i) is by an induction on $Ss$.
+	(ii) is by an induction on $xs$.
+\end{proof}
+\noindent
+The $\nString$ predicate is defined for conveniently
+expressing that there are no empty strings in the
+$\Some \;(s, n)$ elements generated by $\nupdate$:
+\begin{center}
+	\begin{tabular}{lcl}
+		$\nString \; \None$  & $\dn$ & $ \textit{true}$\\
+		$\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\
+		$\nString \; (\Some \; (c::s, n))$  & $\dn$ & $ \textit{true}$\\
+	\end{tabular}
+\end{center}
+\begin{lemma}\label{nupdatesNonempty}
+	If for all elements $o \in \textit{set} \; Ss$,
+	$\nString \; o$ holds, the we have that
+	for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$,
+	$\nString \; o'$ holds.
+\end{lemma}
+\begin{proof}
+	By an induction on $s$, where $Ss$ is set to vary over all possible values.
+\end{proof}
+\noindent
+\begin{lemma}\label{ntimesClosedFormsSteps}
+	The following list of equalities or rewriting relations hold:\\
+	(i) $r^{\{n+1\}} \backslash_{rsimps} (c::s) =
+	\textit{rsimp} \; (\sum (\map \; (\opterm \;r \;\_) \; (\nupdates \;
+	s \; r \; [\Some \; ([c], n)])))$\\
+	(ii)
+	\begin{center}
+	$\sum (\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [
+	\Some \; ([c], n)]))$ \\ $ \sequal$\\
+	 $\sum (\map \; (\textit{rsimp} \circ (\opterm \; r))\; (\nupdates \;
+	 s\;r \; [\Some \; ([c], n)]))$\\
+	\end{center}
+	(iii)
+	\begin{center}
+	$\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
+	([c], n)]))$\\
+	$\sequal$\\
+	 $\sum \;(\map \; (\optermsimp r) \; (\nupdates \; s \; r \; [\Some \;
+	([c], n)])) $\\
+	\end{center}
+	(iv)
+	\begin{center}
+	$\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
+	([c], n)])) $ \\ $\sequal$\\
+	 $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
+	([c], n)])) $\\
+	\end{center}
+	(v)
+	\begin{center}
+	 $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
+	 ([c], n)])) $ \\ $\sequal$\\
+	  $\sum \; (\map \; (\textit{rsimp} \circ (\opterm \; r)) \;
+	  (\nupdates \; s \; r \; [\Some \; ([c], n)]))$
+	\end{center}
+\end{lemma}
+\begin{proof}
+	Routine.
+	(iii) and (iv) make use of the fact that all the strings $s$
+	inside $\Some \; (s, m)$ which are elements of the list
+	$\nupdates \; s\;r\;[\Some\; ([c], n)]$ are non-empty,
+	which is from lemma \ref{nupdatesNonempty}.
+	Once the string in $o = \Some \; (s, n)$ is
+	nonempty, $\optermsimp \; r \;o$,
+	$\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed
+	to be equal.
+	(v) uses \ref{nupdateInduct1}.
+\end{proof}
+\noindent
+Now we are ready to present the closed form for NTIMES:
+\begin{theorem}\label{ntimesClosedForm}
+	The derivative of $r^{\{n+1\}}$ can be described as an alternative
+	containing a list
+	of terms:\\
+	$r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; (
+	\sum (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \;
+	[\Some \; ([c], n)])))$
+\end{theorem}
+\begin{proof}
+	By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}.
+\end{proof}
+\noindent
+The key observation for bounding this closed form
+is that the counter on $r^{\{n\}}$ will
+only decrement during derivatives:
+\begin{lemma}\label{nupdatesNLeqN}
+	For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \;
+	[\Some \; ([c], n)])$, either $o = \None$, or $o = \Some
+	\; (s', m)$ for some string $s'$ and number $m \leq n$.
+\end{lemma}
+\noindent
+The proof is routine and therefore omitted.
+This allows us to say what kind of terms
+are in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; (
+\nupdates \; s \; r \; [\Some \; ([c], n)]))$:
+only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$
+with a small $m$:
+\begin{lemma}\label{ntimesClosedFormListElemShape}
+	For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
+	\nupdates \; s \; r \; [\Some \; ([c], n)]))$,
+	we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot
+	r^{\{m\}}$ for some string $s'$ and number $m \leq n$.
+\end{lemma}
+\begin{proof}
+	Using lemma \ref{nupdatesNLeqN}.
+\end{proof}
+\begin{theorem}\label{ntimesClosedFormBounded}
+	Assuming that for any string $s$, $\llbracket r \backslash_{rsimps} s
+	\rrbracket_r \leq N$ holds, then we have that\\
+	$\llbracket r^{\{n+1\}} \backslash_{rsimps} s \rrbracket_r \leq
+	\textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$,
+	where $c_N = \textit{card} \; (\textit{sizeNregex} \; (
+	N + \llbracket r^{\{n\}} \rrbracket_r+1))$.
+\end{theorem}
+\begin{proof}
+We have that for all regular expressions $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
+	\nupdates \; s \; r \; [\Some \; ([c], n)]))$,
+	$r'$'s size is less than or equal to $N + \llbracket r^{\{n\}}
+	\rrbracket_r + 1$
+because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot
+r^{\{m\}}$ for some string $s'$ and number
+$m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}).
+In addition, we know that the list
+$\map \; (\optermsimp \; r) \; (
+\nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most
+$c_N = \textit{card} \;
+(\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$.
+This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r
+\leq N * c_N$.
+\end{proof}
+We aim to formalise the correctness and size bound
+for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$
+and so on, which is still work in progress.
+They should more or less follow the same recipe described in this section.
+Once we know about how to deal with them recursively using suitable auxiliary
+definitions, we are able to routinely establish the proofs.
 %----------------------------------------------------------------------------------------
 %	SECTION 3
 %----------------------------------------------------------------------------------------

changeset 625	b797c9a709d9
parent 624	8ffa28fce271
child 638	dd9dde2d902b