--- a/ChengsongTanPhdThesis/Chapters/Cubic.tex Sat Nov 12 21:34:40 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Cubic.tex Thu Nov 17 23:13:57 2022 +0000
@@ -8,10 +8,14 @@
\lstset{style=myScalastyle}
-This chapter is a ``miscellaneous''
-chapter which records various
-extensions to our $\blexersimp$'s formalisations.\\
-Firstly we present further improvements
+This chapter is a ``work-in-progress''
+chapter which records
+extensions to our $\blexersimp$.
+We intend to formalise this part, which
+we have not been able to finish due to time constraints of the PhD.
+Nevertheless, we outline the ideas we intend to use for the proof.
+
+We present further improvements
made to our lexer algorithm $\blexersimp$.
We devise a stronger simplification algorithm,
called $\bsimpStrong$, which can prune away
@@ -35,13 +39,6 @@
by exploring the connection between the internal
data structure of our $\blexerStrong$ and
Animirov's partial derivatives.\\
-Secondly, we extend our $\blexersimp$
-to support bounded repetitions ($r^{\{n\}}$).
-We update our formalisation of
-the correctness and finiteness properties to
-include this new construct.
-we can out-compete other verified lexers such as
-Verbatim++ on bounded regular expressions.
%We also present the idempotency property proof
%of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
%This reinforces our claim that the fixpoint construction
@@ -595,507 +592,6 @@
% SECTION 2
%----------------------------------------------------------------------------------------
-\section{Bounded Repetitions}
-We have promised in chapter \ref{Introduction}
-that our lexing algorithm can potentially be extended
-to handle bounded repetitions
-in natural and elegant ways.
-Now we fulfill our promise by adding support for
-the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$.
-We add clauses in our derivatives-based lexing algorithms (with simplifications)
-introduced in chapter \ref{Bitcoded2}.
-
-\subsection{Augmented Definitions}
-There are a number of definitions that need to be augmented.
-The most notable one would be the POSIX rules for $r^{\{n\}}$:
-\begin{center}
- \begin{mathpar}
- \inferrule{\forall v \in vs_1. \vdash v:r \land
- |v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\
- \textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \;
- (vs_1 @ vs_2) : r^{\{n\}} }
- \end{mathpar}
-\end{center}
-As Ausaf had pointed out \cite{Ausaf},
-sometimes empty iterations have to be taken to get
-a match with exactly $n$ repetitions,
-and hence the $vs_2$ part.
-
-Another important definition would be the size:
-\begin{center}
- \begin{tabular}{lcl}
- $\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ &
- $\llbracket r \rrbracket_r + n$\\
- \end{tabular}
-\end{center}
-\noindent
-Arguably we should use $\log \; n$ for the size because
-the number of digits increase logarithmically w.r.t $n$.
-For simplicity we choose to add the counter directly to the size.
-
-The derivative w.r.t a bounded regular expression
-is given as
-\begin{center}
- \begin{tabular}{lcl}
- $r^{\{n\}} \backslash_r c$ & $\dn$ &
- $r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\
- & & $\RZERO \;\quad \quad\quad \quad
- \textit{otherwise}$\\
- \end{tabular}
-\end{center}
-\noindent
-For brevity, we sometimes use NTIMES to refer to bounded
-regular expressions.
-The $\mkeps$ function clause for NTIMES would be
-\begin{center}
- \begin{tabular}{lcl}
- $\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \;
- (\textit{replicate} \; n\; (\mkeps \; r))$
- \end{tabular}
-\end{center}
-\noindent
-The injection looks like
-\begin{center}
- \begin{tabular}{lcl}
- $\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ &
- $\dn$ & $\Stars \;
- ((\inj \; r \;c \;v ) :: vs)$
- \end{tabular}
-\end{center}
-\noindent
-
-
-\subsection{Proofs for the Augmented Lexing Algorithm}
-We need to maintain two proofs with the additional $r^{\{n\}}$
-construct: the
-correctness proof in chapter \ref{Bitcoded2},
-and the finiteness proof in chapter \ref{Finite}.
-
-\subsubsection{Correctness Proof Augmentation}
-The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitions
-have been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}.
-As they have commented, once the definitions are in place,
-the proofs given for the basic regular expressions will extend to
-bounded regular expressions, and there are no ``surprises''.
-We confirm this point because the correctness theorem would also
-extend without surprise to $\blexersimp$.
-The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on
-do not need to be changed,
-and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to
-add one more line which can be solved by sledgehammer
-to solve the $r^{\{n\}}$ inductive case.
-
-
-\subsubsection{Finiteness Proof Augmentation}
-The bounded repetitions are
-very similar to stars, and therefore the treatment
-is similar, with minor changes to handle some slight complications
-when the counter reaches 0.
-The exponential growth is similar:
-\begin{center}
- \begin{tabular}{ll}
- $r^{\{n\}} $ & $\longrightarrow_{\backslash c}$\\
- $(r\backslash c) \cdot
- r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c'}$\\
- \\
- $r \backslash cc' \cdot r^{\{n - 2\}}* +
- r \backslash c' \cdot r^{\{n - 1\}}*$ &
- $\longrightarrow_{\backslash c''}$\\
- \\
- $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
- r \backslash c''\cdot r^{\{n-1\}}) +
- (r \backslash c'c'' \cdot r^{\{n-2\}}* +
- r \backslash c'' \cdot r^{\{n-1\}}*)$ &
- $\longrightarrow_{\backslash c'''}$ \\
- \\
- $\ldots$\\
- \end{tabular}
-\end{center}
-Again, we assume that $r\backslash c$, $r \backslash cc'$ and so on
-are all nullable.
-The flattened list of terms for $r^{\{n\}} \backslash_{rs} s$
-\begin{center}
- $[r_1 \backslash cc'c'' \cdot r^{\{n-3\}}*,\;
- r \backslash c''\cdot r^{\{n-1\}}, \;
- r \backslash c'c'' \cdot r^{\{n-2\}}*, \;
- r \backslash c'' \cdot r^{\{n-1\}}*,\; \ldots ]$
-\end{center}
-that comes from
-\begin{center}
- $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
- r \backslash c''\cdot r^{\{n-1\}}) +
- (r \backslash c'c'' \cdot r^{\{n-2\}}* +
- r \backslash c'' \cdot r^{\{n-1\}}*)+ \ldots$
-\end{center}
-are made of sequences with different tails, where the counters
-might differ.
-The observation for maintaining the bound is that
-these counters never exceed $n$, the original
-counter. With the number of counters staying finite,
-$\rDistinct$ will deduplicate and keep the list finite.
-We introduce this idea as a lemma once we describe all
-the necessary helper functions.
-
-Similar to the star case, we want
-\begin{center}
- $\rderssimp{r^{\{n\}}}{s} = \rsimp{\sum rs}$.
-\end{center}
-where $rs$
-shall be in the form of
-$\map \; f \; Ss$, where $f$ is a function and
-$Ss$ a list of objects to act on.
-For star, the object's datatype is string.
-The list of strings $Ss$
-is generated using functions
-$\starupdate$ and $\starupdates$.
-The function that takes a string and returns a regular expression
-is the anonymous function $
-(\lambda s'. \; r\backslash s' \cdot r^{\{m\}})$.
-In the NTIMES setting,
-the $\starupdate$ and $\starupdates$ functions are replaced by
-$\textit{nupdate}$ and $\textit{nupdates}$:
-\begin{center}
- \begin{tabular}{lcl}
- $\nupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
- $\nupdate \; c \; r \;
- (\Some \; (s, \; n + 1) \; :: \; Ss)$ & $\dn$ & %\\
- $\textit{if} \;
- (\rnullable \; (r \backslash_{rs} s))$ \\
- & & $\;\;\textit{then}
- \;\; \Some \; (s @ [c], n + 1) :: \Some \; ([c], n) :: (
- \nupdate \; c \; r \; Ss)$ \\
- & & $\textit{else} \;\; \Some \; (s @ [c], n+1) :: (
- \nupdate \; c \; r \; Ss)$\\
- $\nupdate \; c \; r \; (\textit{None} :: Ss)$ & $\dn$ &
- $(\None :: \nupdate \; c \; r \; Ss)$\\
- & & \\
- %\end{tabular}
-%\end{center}
-%\begin{center}
- %\begin{tabular}{lcl}
- $\nupdates \; [] \; r \; Ss$ & $\dn$ & $Ss$\\
- $\nupdates \; (c :: cs) \; r \; Ss$ & $\dn$ & $\nupdates \; cs \; r \; (
- \nupdate \; c \; r \; Ss)$
- \end{tabular}
-\end{center}
-\noindent
-which take into account when a subterm
-\begin{center}
- $r \backslash_s s \cdot r^{\{n\}}$
-\end{center}
-counter $n$
-is 0, and therefore expands to
-\begin{center}
-$r \backslash_s (s@[c]) \cdot r^{\{n\}} \;+
-\; \ZERO$
-\end{center}
-after taking a derivative.
-The object now has type
-\begin{center}
-$\textit{option} \;(\textit{string}, \textit{nat})$
-\end{center}
-and therefore the function for converting such an option into
-a regular expression term is called $\opterm$:
-
-\begin{center}
- \begin{tabular}{lcl}
- $\opterm \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
- & & $\;\;\Some \; (s, n) \Rightarrow
- (r\backslash_{rs} s)\cdot r^{\{n\}}$\\
- & & $\;\;\None \Rightarrow
- \ZERO$\\
- \end{tabular}
-\end{center}
-\noindent
-Put together, the list $\map \; f \; Ss$ is instantiated as
-\begin{center}
- $\map \; (\opterm \; r) \; (\nupdates \; s \; r \;
- [\Some \; ([c], n)])$.
-\end{center}
-For the closed form to be bounded, we would like
-simplification to be applied to each term in the list.
-Therefore we introduce some variants of $\opterm$,
-which help conveniently express the rewriting steps
-needed in the closed form proof.
-\begin{center}
- \begin{tabular}{lcl}
- $\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
- & & $\;\;\Some \; (s, n) \Rightarrow
- \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\
- & & $\;\;\None \Rightarrow
- \ZERO$\\
- \\
- $\optermosimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
- & & $\;\;\Some \; (s, n) \Rightarrow
- (\textit{rsimp} \; (r\backslash_{rs} s))
- \cdot r^{\{n\}}$\\
- & & $\;\;\None \Rightarrow
- \ZERO$\\
- \\
- $\optermsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
- & & $\;\;\Some \; (s, n) \Rightarrow
- (r\backslash_{rsimps} s)\cdot r^{\{n\}}$\\
- & & $\;\;\None \Rightarrow
- \ZERO$\\
- \end{tabular}
-\end{center}
-
-
-For a list of
-$\textit{option} \;(\textit{string}, \textit{nat})$ elements,
-we define the highest power for it recursively:
-\begin{center}
- \begin{tabular}{lcl}
- $\hpa \; [] \; n $ & $\dn$ & $n$\\
- $\hpa \; (\None :: os) \; n $ & $\dn$ & $\hpa \; os \; n$\\
- $\hpa \; (\Some \; (s, n) :: os) \; m$ & $\dn$ &
- $\hpa \;os \; (\textit{max} \; n\; m)$\\
- \\
- $\hpower \; rs $ & $\dn$ & $\hpa \; rs \; 0$\\
- \end{tabular}
-\end{center}
-
-Now the intuition that an NTIMES regular expression's power
-does not increase can be easily expressed as
-\begin{lemma}\label{nupdatesMono2}
- $\hpower \; (\nupdates \;s \; r \; [\Some \; ([c], n)]) \leq n$
-\end{lemma}
-\begin{proof}
- Note that the power is non-increasing after a $\nupdate$ application:
- \begin{center}
- $\hpa \;\; (\nupdate \; c \; r \; Ss)\;\; m \leq
- \hpa\; \; Ss \; m$.
- \end{center}
- This is also the case for $\nupdates$:
- \begin{center}
- $\hpa \;\; (\nupdates \; s \; r \; Ss)\;\; m \leq
- \hpa\; \; Ss \; m$.
- \end{center}
- Therefore we have that
- \begin{center}
- $\hpower \;\; (\nupdates \; s \; r \; Ss) \leq
- \hpower \;\; Ss$
- \end{center}
- which leads to the lemma being proven.
-
- \end{proof}
-
-
-We also define the inductive rules for
-the shape of derivatives of the NTIMES regular expressions:\\[-3em]
-\begin{center}
- \begin{mathpar}
- \inferrule{\mbox{}}{\cbn \;\ZERO}
-
- \inferrule{\mbox{}}{\cbn \; \; r_a \cdot (r^{\{n\}})}
-
- \inferrule{\cbn \; r_1 \;\; \; \cbn \; r_2}{\cbn \; r_1 + r_2}
-
- \inferrule{\cbn \; r}{\cbn \; r + \ZERO}
- \end{mathpar}
-\end{center}
-\noindent
-A derivative of NTIMES fits into the shape described by $\cbn$:
-\begin{lemma}\label{ntimesDersCbn}
- $\cbn \; ((r' \cdot r^{\{n\}}) \backslash_{rs} s)$ holds.
-\end{lemma}
-\begin{proof}
- By a reverse induction on $s$.
- For the inductive case, note that if $\cbn \; r$ holds,
- then $\cbn \; (r\backslash_r c)$ holds.
-\end{proof}
-\noindent
-In addition, for $\cbn$-shaped regular expressioins, one can flatten
-them:
-\begin{lemma}\label{ntimesHfauPushin}
- If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} =
- \textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \;
- (\hflataux{r})})$
-\end{lemma}
-\begin{proof}
- By an induction on the inductive cases of $\cbn$.
-\end{proof}
-\noindent
-This time we do not need to define the flattening functions for NTIMES only,
-because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already.
-\begin{lemma}\label{ntimesHfauInduct}
-$\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} =
- \map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$
-\end{lemma}
-\begin{proof}
- By a reverse induction on $s$.
- The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used.
-\end{proof}
-\noindent
-We have a recursive property for NTIMES with $\nupdate$
-similar to that for STAR,
-and one for $\nupdates $ as well:
-\begin{lemma}\label{nupdateInduct1}
- \mbox{}
- \begin{itemize}
- \item
- \begin{center}
- $\textit{concat} \; (\map \; (\hflataux{\_} \circ (
- \opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \;
- c \; r \; Ss)$\\
- \end{center}
- holds.
-\item
- \begin{center}
- $\textit{concat} \; (\map \; \hflataux{\_}\;
- \map \; (\_\backslash_r x) \;
- (\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\
- $=$\\
- $\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$
- \end{center}
- holds.
- \end{itemize}
-\end{lemma}
-\begin{proof}
- (i) is by an induction on $Ss$.
- (ii) is by an induction on $xs$.
-\end{proof}
-\noindent
-The $\nString$ predicate is defined for conveniently
-expressing that there are no empty strings in the
-$\Some \;(s, n)$ elements generated by $\nupdate$:
-\begin{center}
- \begin{tabular}{lcl}
- $\nString \; \None$ & $\dn$ & $ \textit{true}$\\
- $\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\
- $\nString \; (\Some \; (c::s, n))$ & $\dn$ & $ \textit{true}$\\
- \end{tabular}
-\end{center}
-\begin{lemma}\label{nupdatesNonempty}
- If for all elements $o \in \textit{set} \; Ss$,
- $\nString \; o$ holds, the we have that
- for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$,
- $\nString \; o'$ holds.
-\end{lemma}
-\begin{proof}
- By an induction on $s$, where $Ss$ is set to vary over all possible values.
-\end{proof}
-
-\noindent
-
-\begin{lemma}\label{ntimesClosedFormsSteps}
- The following list of equalities or rewriting relations hold:\\
- (i) $r^{\{n+1\}} \backslash_{rsimps} (c::s) =
- \textit{rsimp} \; (\sum (\map \; (\opterm \;r \;\_) \; (\nupdates \;
- s \; r \; [\Some \; ([c], n)])))$\\
- (ii)
- \begin{center}
- $\sum (\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [
- \Some \; ([c], n)]))$ \\ $ \sequal$\\
- $\sum (\map \; (\textit{rsimp} \circ (\opterm \; r))\; (\nupdates \;
- s\;r \; [\Some \; ([c], n)]))$\\
- \end{center}
- (iii)
- \begin{center}
- $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
- ([c], n)]))$\\
- $\sequal$\\
- $\sum \;(\map \; (\optermsimp r) \; (\nupdates \; s \; r \; [\Some \;
- ([c], n)])) $\\
- \end{center}
- (iv)
- \begin{center}
- $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
- ([c], n)])) $ \\ $\sequal$\\
- $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
- ([c], n)])) $\\
- \end{center}
- (v)
- \begin{center}
- $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
- ([c], n)])) $ \\ $\sequal$\\
- $\sum \; (\map \; (\textit{rsimp} \circ (\opterm \; r)) \;
- (\nupdates \; s \; r \; [\Some \; ([c], n)]))$
- \end{center}
-\end{lemma}
-\begin{proof}
- Routine.
- (iii) and (iv) make use of the fact that all the strings $s$
- inside $\Some \; (s, m)$ which are elements of the list
- $\nupdates \; s\;r\;[\Some\; ([c], n)]$ are non-empty,
- which is from lemma \ref{nupdatesNonempty}.
- Once the string in $o = \Some \; (s, n)$ is
- nonempty, $\optermsimp \; r \;o$,
- $\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed
- to be equal.
- (v) uses \ref{nupdateInduct1}.
-\end{proof}
-\noindent
-Now we are ready to present the closed form for NTIMES:
-\begin{theorem}\label{ntimesClosedForm}
- The derivative of $r^{\{n+1\}}$ can be described as an alternative
- containing a list
- of terms:\\
- $r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; (
- \sum (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \;
- [\Some \; ([c], n)])))$
-\end{theorem}
-\begin{proof}
- By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}.
-\end{proof}
-\noindent
-The key observation for bounding this closed form
-is that the counter on $r^{\{n\}}$ will
-only decrement during derivatives:
-\begin{lemma}\label{nupdatesNLeqN}
- For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \;
- [\Some \; ([c], n)])$, either $o = \None$, or $o = \Some
- \; (s', m)$ for some string $s'$ and number $m \leq n$.
-\end{lemma}
-\noindent
-The proof is routine and therefore omitted.
-This allows us to say what kind of terms
-are in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; (
-\nupdates \; s \; r \; [\Some \; ([c], n)]))$:
-only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$
-with a small $m$:
-\begin{lemma}\label{ntimesClosedFormListElemShape}
- For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
- \nupdates \; s \; r \; [\Some \; ([c], n)]))$,
- we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot
- r^{\{m\}}$ for some string $s'$ and number $m \leq n$.
-\end{lemma}
-\begin{proof}
- Using lemma \ref{nupdatesNLeqN}.
-\end{proof}
-
-\begin{theorem}\label{ntimesClosedFormBounded}
- Assuming that for any string $s$, $\llbracket r \backslash_{rsimps} s
- \rrbracket_r \leq N$ holds, then we have that\\
- $\llbracket r^{\{n+1\}} \backslash_{rsimps} s \rrbracket_r \leq
- \textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$,
- where $c_N = \textit{card} \; (\textit{sizeNregex} \; (
- N + \llbracket r^{\{n\}} \rrbracket_r+1))$.
-\end{theorem}
-\begin{proof}
-We have that for all regular expressions $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
- \nupdates \; s \; r \; [\Some \; ([c], n)]))$,
- $r'$'s size is less than or equal to $N + \llbracket r^{\{n\}}
- \rrbracket_r + 1$
-because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot
-r^{\{m\}}$ for some string $s'$ and number
-$m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}).
-In addition, we know that the list
-$\map \; (\optermsimp \; r) \; (
-\nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most
-$c_N = \textit{card} \;
-(\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$.
-This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r
-\leq N * c_N$.
-\end{proof}
-
-We aim to formalise the correctness and size bound
-for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$
-and so on, which is still work in progress.
-They should more or less follow the same recipe described in this section.
-Once we know about how to deal with them recursively using suitable auxiliary
-definitions, we are able to routinely establish the proofs.
-
%The closed form for them looks like:
%%\begin{center}