--- a/ChengsongTanPhdThesis/Chapters/Finite.tex Sat Nov 12 21:34:40 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Thu Nov 17 23:13:57 2022 +0000
@@ -45,12 +45,24 @@
only gives empirical evidence on
some test cases (see Verbatim++ work\cite{Verbatimpp}).
\end{itemize}
+\noindent
+We then extend our $\blexersimp$
+to support bounded repetitions ($r^{\{n\}}$).
+We update our formalisation of
+the correctness and finiteness properties to
+include this new construct.
+we can out-compete other verified lexers such as
+Verbatim++ on bounded regular expressions.
+
In the next section we describe in more detail
what the finite bound means in our algorithm
and why the size of the internal data structures of
a typical derivative-based lexer such as
Sulzmann and Lu's need formal treatment.
+
+
+
\section{Formalising About Size}
\noindent
In our lexer ($\blexersimp$),
@@ -2592,6 +2604,507 @@
% SECTION 2
%-----------------------------------
+\section{Bounded Repetitions}
+We have promised in chapter \ref{Introduction}
+that our lexing algorithm can potentially be extended
+to handle bounded repetitions
+in natural and elegant ways.
+Now we fulfill our promise by adding support for
+the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$.
+We add clauses in our derivatives-based lexing algorithms (with simplifications)
+introduced in chapter \ref{Bitcoded2}.
+
+\subsection{Augmented Definitions}
+There are a number of definitions that need to be augmented.
+The most notable one would be the POSIX rules for $r^{\{n\}}$:
+\begin{center}
+ \begin{mathpar}
+ \inferrule{\forall v \in vs_1. \vdash v:r \land
+ |v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\
+ \textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \;
+ (vs_1 @ vs_2) : r^{\{n\}} }
+ \end{mathpar}
+\end{center}
+As Ausaf had pointed out \cite{Ausaf},
+sometimes empty iterations have to be taken to get
+a match with exactly $n$ repetitions,
+and hence the $vs_2$ part.
+
+Another important definition would be the size:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ &
+ $\llbracket r \rrbracket_r + n$\\
+ \end{tabular}
+\end{center}
+\noindent
+Arguably we should use $\log \; n$ for the size because
+the number of digits increase logarithmically w.r.t $n$.
+For simplicity we choose to add the counter directly to the size.
+
+The derivative w.r.t a bounded regular expression
+is given as
+\begin{center}
+ \begin{tabular}{lcl}
+ $r^{\{n\}} \backslash_r c$ & $\dn$ &
+ $r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\
+ & & $\RZERO \;\quad \quad\quad \quad
+ \textit{otherwise}$\\
+ \end{tabular}
+\end{center}
+\noindent
+For brevity, we sometimes use NTIMES to refer to bounded
+regular expressions.
+The $\mkeps$ function clause for NTIMES would be
+\begin{center}
+ \begin{tabular}{lcl}
+ $\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \;
+ (\textit{replicate} \; n\; (\mkeps \; r))$
+ \end{tabular}
+\end{center}
+\noindent
+The injection looks like
+\begin{center}
+ \begin{tabular}{lcl}
+ $\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ &
+ $\dn$ & $\Stars \;
+ ((\inj \; r \;c \;v ) :: vs)$
+ \end{tabular}
+\end{center}
+\noindent
+
+
+\subsection{Proofs for the Augmented Lexing Algorithm}
+We need to maintain two proofs with the additional $r^{\{n\}}$
+construct: the
+correctness proof in chapter \ref{Bitcoded2},
+and the finiteness proof in chapter \ref{Finite}.
+
+\subsubsection{Correctness Proof Augmentation}
+The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitions
+have been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}.
+As they have commented, once the definitions are in place,
+the proofs given for the basic regular expressions will extend to
+bounded regular expressions, and there are no ``surprises''.
+We confirm this point because the correctness theorem would also
+extend without surprise to $\blexersimp$.
+The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on
+do not need to be changed,
+and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to
+add one more line which can be solved by sledgehammer
+to solve the $r^{\{n\}}$ inductive case.
+
+
+\subsubsection{Finiteness Proof Augmentation}
+The bounded repetitions are
+very similar to stars, and therefore the treatment
+is similar, with minor changes to handle some slight complications
+when the counter reaches 0.
+The exponential growth is similar:
+\begin{center}
+ \begin{tabular}{ll}
+ $r^{\{n\}} $ & $\longrightarrow_{\backslash c}$\\
+ $(r\backslash c) \cdot
+ r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c'}$\\
+ \\
+ $r \backslash cc' \cdot r^{\{n - 2\}}* +
+ r \backslash c' \cdot r^{\{n - 1\}}*$ &
+ $\longrightarrow_{\backslash c''}$\\
+ \\
+ $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
+ r \backslash c''\cdot r^{\{n-1\}}) +
+ (r \backslash c'c'' \cdot r^{\{n-2\}}* +
+ r \backslash c'' \cdot r^{\{n-1\}}*)$ &
+ $\longrightarrow_{\backslash c'''}$ \\
+ \\
+ $\ldots$\\
+ \end{tabular}
+\end{center}
+Again, we assume that $r\backslash c$, $r \backslash cc'$ and so on
+are all nullable.
+The flattened list of terms for $r^{\{n\}} \backslash_{rs} s$
+\begin{center}
+ $[r_1 \backslash cc'c'' \cdot r^{\{n-3\}}*,\;
+ r \backslash c''\cdot r^{\{n-1\}}, \;
+ r \backslash c'c'' \cdot r^{\{n-2\}}*, \;
+ r \backslash c'' \cdot r^{\{n-1\}}*,\; \ldots ]$
+\end{center}
+that comes from
+\begin{center}
+ $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
+ r \backslash c''\cdot r^{\{n-1\}}) +
+ (r \backslash c'c'' \cdot r^{\{n-2\}}* +
+ r \backslash c'' \cdot r^{\{n-1\}}*)+ \ldots$
+\end{center}
+are made of sequences with different tails, where the counters
+might differ.
+The observation for maintaining the bound is that
+these counters never exceed $n$, the original
+counter. With the number of counters staying finite,
+$\rDistinct$ will deduplicate and keep the list finite.
+We introduce this idea as a lemma once we describe all
+the necessary helper functions.
+
+Similar to the star case, we want
+\begin{center}
+ $\rderssimp{r^{\{n\}}}{s} = \rsimp{\sum rs}$.
+\end{center}
+where $rs$
+shall be in the form of
+$\map \; f \; Ss$, where $f$ is a function and
+$Ss$ a list of objects to act on.
+For star, the object's datatype is string.
+The list of strings $Ss$
+is generated using functions
+$\starupdate$ and $\starupdates$.
+The function that takes a string and returns a regular expression
+is the anonymous function $
+(\lambda s'. \; r\backslash s' \cdot r^{\{m\}})$.
+In the NTIMES setting,
+the $\starupdate$ and $\starupdates$ functions are replaced by
+$\textit{nupdate}$ and $\textit{nupdates}$:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\nupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
+ $\nupdate \; c \; r \;
+ (\Some \; (s, \; n + 1) \; :: \; Ss)$ & $\dn$ & %\\
+ $\textit{if} \;
+ (\rnullable \; (r \backslash_{rs} s))$ \\
+ & & $\;\;\textit{then}
+ \;\; \Some \; (s @ [c], n + 1) :: \Some \; ([c], n) :: (
+ \nupdate \; c \; r \; Ss)$ \\
+ & & $\textit{else} \;\; \Some \; (s @ [c], n+1) :: (
+ \nupdate \; c \; r \; Ss)$\\
+ $\nupdate \; c \; r \; (\textit{None} :: Ss)$ & $\dn$ &
+ $(\None :: \nupdate \; c \; r \; Ss)$\\
+ & & \\
+ %\end{tabular}
+%\end{center}
+%\begin{center}
+ %\begin{tabular}{lcl}
+ $\nupdates \; [] \; r \; Ss$ & $\dn$ & $Ss$\\
+ $\nupdates \; (c :: cs) \; r \; Ss$ & $\dn$ & $\nupdates \; cs \; r \; (
+ \nupdate \; c \; r \; Ss)$
+ \end{tabular}
+\end{center}
+\noindent
+which take into account when a subterm
+\begin{center}
+ $r \backslash_s s \cdot r^{\{n\}}$
+\end{center}
+counter $n$
+is 0, and therefore expands to
+\begin{center}
+$r \backslash_s (s@[c]) \cdot r^{\{n\}} \;+
+\; \ZERO$
+\end{center}
+after taking a derivative.
+The object now has type
+\begin{center}
+$\textit{option} \;(\textit{string}, \textit{nat})$
+\end{center}
+and therefore the function for converting such an option into
+a regular expression term is called $\opterm$:
+
+\begin{center}
+ \begin{tabular}{lcl}
+ $\opterm \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
+ & & $\;\;\Some \; (s, n) \Rightarrow
+ (r\backslash_{rs} s)\cdot r^{\{n\}}$\\
+ & & $\;\;\None \Rightarrow
+ \ZERO$\\
+ \end{tabular}
+\end{center}
+\noindent
+Put together, the list $\map \; f \; Ss$ is instantiated as
+\begin{center}
+ $\map \; (\opterm \; r) \; (\nupdates \; s \; r \;
+ [\Some \; ([c], n)])$.
+\end{center}
+For the closed form to be bounded, we would like
+simplification to be applied to each term in the list.
+Therefore we introduce some variants of $\opterm$,
+which help conveniently express the rewriting steps
+needed in the closed form proof.
+\begin{center}
+ \begin{tabular}{lcl}
+ $\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
+ & & $\;\;\Some \; (s, n) \Rightarrow
+ \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\
+ & & $\;\;\None \Rightarrow
+ \ZERO$\\
+ \\
+ $\optermosimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
+ & & $\;\;\Some \; (s, n) \Rightarrow
+ (\textit{rsimp} \; (r\backslash_{rs} s))
+ \cdot r^{\{n\}}$\\
+ & & $\;\;\None \Rightarrow
+ \ZERO$\\
+ \\
+ $\optermsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
+ & & $\;\;\Some \; (s, n) \Rightarrow
+ (r\backslash_{rsimps} s)\cdot r^{\{n\}}$\\
+ & & $\;\;\None \Rightarrow
+ \ZERO$\\
+ \end{tabular}
+\end{center}
+
+
+For a list of
+$\textit{option} \;(\textit{string}, \textit{nat})$ elements,
+we define the highest power for it recursively:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\hpa \; [] \; n $ & $\dn$ & $n$\\
+ $\hpa \; (\None :: os) \; n $ & $\dn$ & $\hpa \; os \; n$\\
+ $\hpa \; (\Some \; (s, n) :: os) \; m$ & $\dn$ &
+ $\hpa \;os \; (\textit{max} \; n\; m)$\\
+ \\
+ $\hpower \; rs $ & $\dn$ & $\hpa \; rs \; 0$\\
+ \end{tabular}
+\end{center}
+
+Now the intuition that an NTIMES regular expression's power
+does not increase can be easily expressed as
+\begin{lemma}\label{nupdatesMono2}
+ $\hpower \; (\nupdates \;s \; r \; [\Some \; ([c], n)]) \leq n$
+\end{lemma}
+\begin{proof}
+ Note that the power is non-increasing after a $\nupdate$ application:
+ \begin{center}
+ $\hpa \;\; (\nupdate \; c \; r \; Ss)\;\; m \leq
+ \hpa\; \; Ss \; m$.
+ \end{center}
+ This is also the case for $\nupdates$:
+ \begin{center}
+ $\hpa \;\; (\nupdates \; s \; r \; Ss)\;\; m \leq
+ \hpa\; \; Ss \; m$.
+ \end{center}
+ Therefore we have that
+ \begin{center}
+ $\hpower \;\; (\nupdates \; s \; r \; Ss) \leq
+ \hpower \;\; Ss$
+ \end{center}
+ which leads to the lemma being proven.
+
+ \end{proof}
+
+
+We also define the inductive rules for
+the shape of derivatives of the NTIMES regular expressions:\\[-3em]
+\begin{center}
+ \begin{mathpar}
+ \inferrule{\mbox{}}{\cbn \;\ZERO}
+
+ \inferrule{\mbox{}}{\cbn \; \; r_a \cdot (r^{\{n\}})}
+
+ \inferrule{\cbn \; r_1 \;\; \; \cbn \; r_2}{\cbn \; r_1 + r_2}
+
+ \inferrule{\cbn \; r}{\cbn \; r + \ZERO}
+ \end{mathpar}
+\end{center}
+\noindent
+A derivative of NTIMES fits into the shape described by $\cbn$:
+\begin{lemma}\label{ntimesDersCbn}
+ $\cbn \; ((r' \cdot r^{\{n\}}) \backslash_{rs} s)$ holds.
+\end{lemma}
+\begin{proof}
+ By a reverse induction on $s$.
+ For the inductive case, note that if $\cbn \; r$ holds,
+ then $\cbn \; (r\backslash_r c)$ holds.
+\end{proof}
+\noindent
+In addition, for $\cbn$-shaped regular expressioins, one can flatten
+them:
+\begin{lemma}\label{ntimesHfauPushin}
+ If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} =
+ \textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \;
+ (\hflataux{r})})$
+\end{lemma}
+\begin{proof}
+ By an induction on the inductive cases of $\cbn$.
+\end{proof}
+\noindent
+This time we do not need to define the flattening functions for NTIMES only,
+because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already.
+\begin{lemma}\label{ntimesHfauInduct}
+$\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} =
+ \map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$
+\end{lemma}
+\begin{proof}
+ By a reverse induction on $s$.
+ The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used.
+\end{proof}
+\noindent
+We have a recursive property for NTIMES with $\nupdate$
+similar to that for STAR,
+and one for $\nupdates $ as well:
+\begin{lemma}\label{nupdateInduct1}
+ \mbox{}
+ \begin{itemize}
+ \item
+ \begin{center}
+ $\textit{concat} \; (\map \; (\hflataux{\_} \circ (
+ \opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \;
+ c \; r \; Ss)$\\
+ \end{center}
+ holds.
+\item
+ \begin{center}
+ $\textit{concat} \; (\map \; \hflataux{\_}\;
+ \map \; (\_\backslash_r x) \;
+ (\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\
+ $=$\\
+ $\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$
+ \end{center}
+ holds.
+ \end{itemize}
+\end{lemma}
+\begin{proof}
+ (i) is by an induction on $Ss$.
+ (ii) is by an induction on $xs$.
+\end{proof}
+\noindent
+The $\nString$ predicate is defined for conveniently
+expressing that there are no empty strings in the
+$\Some \;(s, n)$ elements generated by $\nupdate$:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\nString \; \None$ & $\dn$ & $ \textit{true}$\\
+ $\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\
+ $\nString \; (\Some \; (c::s, n))$ & $\dn$ & $ \textit{true}$\\
+ \end{tabular}
+\end{center}
+\begin{lemma}\label{nupdatesNonempty}
+ If for all elements $o \in \textit{set} \; Ss$,
+ $\nString \; o$ holds, the we have that
+ for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$,
+ $\nString \; o'$ holds.
+\end{lemma}
+\begin{proof}
+ By an induction on $s$, where $Ss$ is set to vary over all possible values.
+\end{proof}
+
+\noindent
+
+\begin{lemma}\label{ntimesClosedFormsSteps}
+ The following list of equalities or rewriting relations hold:\\
+ (i) $r^{\{n+1\}} \backslash_{rsimps} (c::s) =
+ \textit{rsimp} \; (\sum (\map \; (\opterm \;r \;\_) \; (\nupdates \;
+ s \; r \; [\Some \; ([c], n)])))$\\
+ (ii)
+ \begin{center}
+ $\sum (\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [
+ \Some \; ([c], n)]))$ \\ $ \sequal$\\
+ $\sum (\map \; (\textit{rsimp} \circ (\opterm \; r))\; (\nupdates \;
+ s\;r \; [\Some \; ([c], n)]))$\\
+ \end{center}
+ (iii)
+ \begin{center}
+ $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
+ ([c], n)]))$\\
+ $\sequal$\\
+ $\sum \;(\map \; (\optermsimp r) \; (\nupdates \; s \; r \; [\Some \;
+ ([c], n)])) $\\
+ \end{center}
+ (iv)
+ \begin{center}
+ $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
+ ([c], n)])) $ \\ $\sequal$\\
+ $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
+ ([c], n)])) $\\
+ \end{center}
+ (v)
+ \begin{center}
+ $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
+ ([c], n)])) $ \\ $\sequal$\\
+ $\sum \; (\map \; (\textit{rsimp} \circ (\opterm \; r)) \;
+ (\nupdates \; s \; r \; [\Some \; ([c], n)]))$
+ \end{center}
+\end{lemma}
+\begin{proof}
+ Routine.
+ (iii) and (iv) make use of the fact that all the strings $s$
+ inside $\Some \; (s, m)$ which are elements of the list
+ $\nupdates \; s\;r\;[\Some\; ([c], n)]$ are non-empty,
+ which is from lemma \ref{nupdatesNonempty}.
+ Once the string in $o = \Some \; (s, n)$ is
+ nonempty, $\optermsimp \; r \;o$,
+ $\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed
+ to be equal.
+ (v) uses \ref{nupdateInduct1}.
+\end{proof}
+\noindent
+Now we are ready to present the closed form for NTIMES:
+\begin{theorem}\label{ntimesClosedForm}
+ The derivative of $r^{\{n+1\}}$ can be described as an alternative
+ containing a list
+ of terms:\\
+ $r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; (
+ \sum (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \;
+ [\Some \; ([c], n)])))$
+\end{theorem}
+\begin{proof}
+ By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}.
+\end{proof}
+\noindent
+The key observation for bounding this closed form
+is that the counter on $r^{\{n\}}$ will
+only decrement during derivatives:
+\begin{lemma}\label{nupdatesNLeqN}
+ For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \;
+ [\Some \; ([c], n)])$, either $o = \None$, or $o = \Some
+ \; (s', m)$ for some string $s'$ and number $m \leq n$.
+\end{lemma}
+\noindent
+The proof is routine and therefore omitted.
+This allows us to say what kind of terms
+are in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; (
+\nupdates \; s \; r \; [\Some \; ([c], n)]))$:
+only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$
+with a small $m$:
+\begin{lemma}\label{ntimesClosedFormListElemShape}
+ For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
+ \nupdates \; s \; r \; [\Some \; ([c], n)]))$,
+ we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot
+ r^{\{m\}}$ for some string $s'$ and number $m \leq n$.
+\end{lemma}
+\begin{proof}
+ Using lemma \ref{nupdatesNLeqN}.
+\end{proof}
+
+\begin{theorem}\label{ntimesClosedFormBounded}
+ Assuming that for any string $s$, $\llbracket r \backslash_{rsimps} s
+ \rrbracket_r \leq N$ holds, then we have that\\
+ $\llbracket r^{\{n+1\}} \backslash_{rsimps} s \rrbracket_r \leq
+ \textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$,
+ where $c_N = \textit{card} \; (\textit{sizeNregex} \; (
+ N + \llbracket r^{\{n\}} \rrbracket_r+1))$.
+\end{theorem}
+\begin{proof}
+We have that for all regular expressions $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
+ \nupdates \; s \; r \; [\Some \; ([c], n)]))$,
+ $r'$'s size is less than or equal to $N + \llbracket r^{\{n\}}
+ \rrbracket_r + 1$
+because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot
+r^{\{m\}}$ for some string $s'$ and number
+$m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}).
+In addition, we know that the list
+$\map \; (\optermsimp \; r) \; (
+\nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most
+$c_N = \textit{card} \;
+(\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$.
+This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r
+\leq N * c_N$.
+\end{proof}
+
+We aim to formalise the correctness and size bound
+for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$
+and so on, which is still work in progress.
+They should more or less follow the same recipe described in this section.
+Once we know about how to deal with them recursively using suitable auxiliary
+definitions, we are able to routinely establish the proofs.
+
%----------------------------------------------------------------------------------------
% SECTION 3