diff -r 8ffa28fce271 -r b797c9a709d9 ChengsongTanPhdThesis/Chapters/Finite.tex --- a/ChengsongTanPhdThesis/Chapters/Finite.tex Sat Nov 12 21:34:40 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Thu Nov 17 23:13:57 2022 +0000 @@ -45,12 +45,24 @@ only gives empirical evidence on some test cases (see Verbatim++ work\cite{Verbatimpp}). \end{itemize} +\noindent +We then extend our $\blexersimp$ +to support bounded repetitions ($r^{\{n\}}$). +We update our formalisation of +the correctness and finiteness properties to +include this new construct. +we can out-compete other verified lexers such as +Verbatim++ on bounded regular expressions. + In the next section we describe in more detail what the finite bound means in our algorithm and why the size of the internal data structures of a typical derivative-based lexer such as Sulzmann and Lu's need formal treatment. + + + \section{Formalising About Size} \noindent In our lexer ($\blexersimp$), @@ -2592,6 +2604,507 @@ % SECTION 2 %----------------------------------- +\section{Bounded Repetitions} +We have promised in chapter \ref{Introduction} +that our lexing algorithm can potentially be extended +to handle bounded repetitions +in natural and elegant ways. +Now we fulfill our promise by adding support for +the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$. +We add clauses in our derivatives-based lexing algorithms (with simplifications) +introduced in chapter \ref{Bitcoded2}. + +\subsection{Augmented Definitions} +There are a number of definitions that need to be augmented. +The most notable one would be the POSIX rules for $r^{\{n\}}$: +\begin{center} + \begin{mathpar} + \inferrule{\forall v \in vs_1. \vdash v:r \land + |v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\ + \textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \; + (vs_1 @ vs_2) : r^{\{n\}} } + \end{mathpar} +\end{center} +As Ausaf had pointed out \cite{Ausaf}, +sometimes empty iterations have to be taken to get +a match with exactly $n$ repetitions, +and hence the $vs_2$ part. + +Another important definition would be the size: +\begin{center} + \begin{tabular}{lcl} + $\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ & + $\llbracket r \rrbracket_r + n$\\ + \end{tabular} +\end{center} +\noindent +Arguably we should use $\log \; n$ for the size because +the number of digits increase logarithmically w.r.t $n$. +For simplicity we choose to add the counter directly to the size. + +The derivative w.r.t a bounded regular expression +is given as +\begin{center} + \begin{tabular}{lcl} + $r^{\{n\}} \backslash_r c$ & $\dn$ & + $r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\ + & & $\RZERO \;\quad \quad\quad \quad + \textit{otherwise}$\\ + \end{tabular} +\end{center} +\noindent +For brevity, we sometimes use NTIMES to refer to bounded +regular expressions. +The $\mkeps$ function clause for NTIMES would be +\begin{center} + \begin{tabular}{lcl} + $\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \; + (\textit{replicate} \; n\; (\mkeps \; r))$ + \end{tabular} +\end{center} +\noindent +The injection looks like +\begin{center} + \begin{tabular}{lcl} + $\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ & + $\dn$ & $\Stars \; + ((\inj \; r \;c \;v ) :: vs)$ + \end{tabular} +\end{center} +\noindent + + +\subsection{Proofs for the Augmented Lexing Algorithm} +We need to maintain two proofs with the additional $r^{\{n\}}$ +construct: the +correctness proof in chapter \ref{Bitcoded2}, +and the finiteness proof in chapter \ref{Finite}. + +\subsubsection{Correctness Proof Augmentation} +The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitions +have been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}. +As they have commented, once the definitions are in place, +the proofs given for the basic regular expressions will extend to +bounded regular expressions, and there are no ``surprises''. +We confirm this point because the correctness theorem would also +extend without surprise to $\blexersimp$. +The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on +do not need to be changed, +and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to +add one more line which can be solved by sledgehammer +to solve the $r^{\{n\}}$ inductive case. + + +\subsubsection{Finiteness Proof Augmentation} +The bounded repetitions are +very similar to stars, and therefore the treatment +is similar, with minor changes to handle some slight complications +when the counter reaches 0. +The exponential growth is similar: +\begin{center} + \begin{tabular}{ll} + $r^{\{n\}} $ & $\longrightarrow_{\backslash c}$\\ + $(r\backslash c) \cdot + r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c'}$\\ + \\ + $r \backslash cc' \cdot r^{\{n - 2\}}* + + r \backslash c' \cdot r^{\{n - 1\}}*$ & + $\longrightarrow_{\backslash c''}$\\ + \\ + $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* + + r \backslash c''\cdot r^{\{n-1\}}) + + (r \backslash c'c'' \cdot r^{\{n-2\}}* + + r \backslash c'' \cdot r^{\{n-1\}}*)$ & + $\longrightarrow_{\backslash c'''}$ \\ + \\ + $\ldots$\\ + \end{tabular} +\end{center} +Again, we assume that $r\backslash c$, $r \backslash cc'$ and so on +are all nullable. +The flattened list of terms for $r^{\{n\}} \backslash_{rs} s$ +\begin{center} + $[r_1 \backslash cc'c'' \cdot r^{\{n-3\}}*,\; + r \backslash c''\cdot r^{\{n-1\}}, \; + r \backslash c'c'' \cdot r^{\{n-2\}}*, \; + r \backslash c'' \cdot r^{\{n-1\}}*,\; \ldots ]$ +\end{center} +that comes from +\begin{center} + $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* + + r \backslash c''\cdot r^{\{n-1\}}) + + (r \backslash c'c'' \cdot r^{\{n-2\}}* + + r \backslash c'' \cdot r^{\{n-1\}}*)+ \ldots$ +\end{center} +are made of sequences with different tails, where the counters +might differ. +The observation for maintaining the bound is that +these counters never exceed $n$, the original +counter. With the number of counters staying finite, +$\rDistinct$ will deduplicate and keep the list finite. +We introduce this idea as a lemma once we describe all +the necessary helper functions. + +Similar to the star case, we want +\begin{center} + $\rderssimp{r^{\{n\}}}{s} = \rsimp{\sum rs}$. +\end{center} +where $rs$ +shall be in the form of +$\map \; f \; Ss$, where $f$ is a function and +$Ss$ a list of objects to act on. +For star, the object's datatype is string. +The list of strings $Ss$ +is generated using functions +$\starupdate$ and $\starupdates$. +The function that takes a string and returns a regular expression +is the anonymous function $ +(\lambda s'. \; r\backslash s' \cdot r^{\{m\}})$. +In the NTIMES setting, +the $\starupdate$ and $\starupdates$ functions are replaced by +$\textit{nupdate}$ and $\textit{nupdates}$: +\begin{center} + \begin{tabular}{lcl} + $\nupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ + $\nupdate \; c \; r \; + (\Some \; (s, \; n + 1) \; :: \; Ss)$ & $\dn$ & %\\ + $\textit{if} \; + (\rnullable \; (r \backslash_{rs} s))$ \\ + & & $\;\;\textit{then} + \;\; \Some \; (s @ [c], n + 1) :: \Some \; ([c], n) :: ( + \nupdate \; c \; r \; Ss)$ \\ + & & $\textit{else} \;\; \Some \; (s @ [c], n+1) :: ( + \nupdate \; c \; r \; Ss)$\\ + $\nupdate \; c \; r \; (\textit{None} :: Ss)$ & $\dn$ & + $(\None :: \nupdate \; c \; r \; Ss)$\\ + & & \\ + %\end{tabular} +%\end{center} +%\begin{center} + %\begin{tabular}{lcl} + $\nupdates \; [] \; r \; Ss$ & $\dn$ & $Ss$\\ + $\nupdates \; (c :: cs) \; r \; Ss$ & $\dn$ & $\nupdates \; cs \; r \; ( + \nupdate \; c \; r \; Ss)$ + \end{tabular} +\end{center} +\noindent +which take into account when a subterm +\begin{center} + $r \backslash_s s \cdot r^{\{n\}}$ +\end{center} +counter $n$ +is 0, and therefore expands to +\begin{center} +$r \backslash_s (s@[c]) \cdot r^{\{n\}} \;+ +\; \ZERO$ +\end{center} +after taking a derivative. +The object now has type +\begin{center} +$\textit{option} \;(\textit{string}, \textit{nat})$ +\end{center} +and therefore the function for converting such an option into +a regular expression term is called $\opterm$: + +\begin{center} + \begin{tabular}{lcl} + $\opterm \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ + & & $\;\;\Some \; (s, n) \Rightarrow + (r\backslash_{rs} s)\cdot r^{\{n\}}$\\ + & & $\;\;\None \Rightarrow + \ZERO$\\ + \end{tabular} +\end{center} +\noindent +Put together, the list $\map \; f \; Ss$ is instantiated as +\begin{center} + $\map \; (\opterm \; r) \; (\nupdates \; s \; r \; + [\Some \; ([c], n)])$. +\end{center} +For the closed form to be bounded, we would like +simplification to be applied to each term in the list. +Therefore we introduce some variants of $\opterm$, +which help conveniently express the rewriting steps +needed in the closed form proof. +\begin{center} + \begin{tabular}{lcl} + $\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ + & & $\;\;\Some \; (s, n) \Rightarrow + \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\ + & & $\;\;\None \Rightarrow + \ZERO$\\ + \\ + $\optermosimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ + & & $\;\;\Some \; (s, n) \Rightarrow + (\textit{rsimp} \; (r\backslash_{rs} s)) + \cdot r^{\{n\}}$\\ + & & $\;\;\None \Rightarrow + \ZERO$\\ + \\ + $\optermsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ + & & $\;\;\Some \; (s, n) \Rightarrow + (r\backslash_{rsimps} s)\cdot r^{\{n\}}$\\ + & & $\;\;\None \Rightarrow + \ZERO$\\ + \end{tabular} +\end{center} + + +For a list of +$\textit{option} \;(\textit{string}, \textit{nat})$ elements, +we define the highest power for it recursively: +\begin{center} + \begin{tabular}{lcl} + $\hpa \; [] \; n $ & $\dn$ & $n$\\ + $\hpa \; (\None :: os) \; n $ & $\dn$ & $\hpa \; os \; n$\\ + $\hpa \; (\Some \; (s, n) :: os) \; m$ & $\dn$ & + $\hpa \;os \; (\textit{max} \; n\; m)$\\ + \\ + $\hpower \; rs $ & $\dn$ & $\hpa \; rs \; 0$\\ + \end{tabular} +\end{center} + +Now the intuition that an NTIMES regular expression's power +does not increase can be easily expressed as +\begin{lemma}\label{nupdatesMono2} + $\hpower \; (\nupdates \;s \; r \; [\Some \; ([c], n)]) \leq n$ +\end{lemma} +\begin{proof} + Note that the power is non-increasing after a $\nupdate$ application: + \begin{center} + $\hpa \;\; (\nupdate \; c \; r \; Ss)\;\; m \leq + \hpa\; \; Ss \; m$. + \end{center} + This is also the case for $\nupdates$: + \begin{center} + $\hpa \;\; (\nupdates \; s \; r \; Ss)\;\; m \leq + \hpa\; \; Ss \; m$. + \end{center} + Therefore we have that + \begin{center} + $\hpower \;\; (\nupdates \; s \; r \; Ss) \leq + \hpower \;\; Ss$ + \end{center} + which leads to the lemma being proven. + + \end{proof} + + +We also define the inductive rules for +the shape of derivatives of the NTIMES regular expressions:\\[-3em] +\begin{center} + \begin{mathpar} + \inferrule{\mbox{}}{\cbn \;\ZERO} + + \inferrule{\mbox{}}{\cbn \; \; r_a \cdot (r^{\{n\}})} + + \inferrule{\cbn \; r_1 \;\; \; \cbn \; r_2}{\cbn \; r_1 + r_2} + + \inferrule{\cbn \; r}{\cbn \; r + \ZERO} + \end{mathpar} +\end{center} +\noindent +A derivative of NTIMES fits into the shape described by $\cbn$: +\begin{lemma}\label{ntimesDersCbn} + $\cbn \; ((r' \cdot r^{\{n\}}) \backslash_{rs} s)$ holds. +\end{lemma} +\begin{proof} + By a reverse induction on $s$. + For the inductive case, note that if $\cbn \; r$ holds, + then $\cbn \; (r\backslash_r c)$ holds. +\end{proof} +\noindent +In addition, for $\cbn$-shaped regular expressioins, one can flatten +them: +\begin{lemma}\label{ntimesHfauPushin} + If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} = + \textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \; + (\hflataux{r})})$ +\end{lemma} +\begin{proof} + By an induction on the inductive cases of $\cbn$. +\end{proof} +\noindent +This time we do not need to define the flattening functions for NTIMES only, +because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already. +\begin{lemma}\label{ntimesHfauInduct} +$\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} = + \map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$ +\end{lemma} +\begin{proof} + By a reverse induction on $s$. + The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used. +\end{proof} +\noindent +We have a recursive property for NTIMES with $\nupdate$ +similar to that for STAR, +and one for $\nupdates $ as well: +\begin{lemma}\label{nupdateInduct1} + \mbox{} + \begin{itemize} + \item + \begin{center} + $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( + \opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \; + c \; r \; Ss)$\\ + \end{center} + holds. +\item + \begin{center} + $\textit{concat} \; (\map \; \hflataux{\_}\; + \map \; (\_\backslash_r x) \; + (\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\ + $=$\\ + $\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$ + \end{center} + holds. + \end{itemize} +\end{lemma} +\begin{proof} + (i) is by an induction on $Ss$. + (ii) is by an induction on $xs$. +\end{proof} +\noindent +The $\nString$ predicate is defined for conveniently +expressing that there are no empty strings in the +$\Some \;(s, n)$ elements generated by $\nupdate$: +\begin{center} + \begin{tabular}{lcl} + $\nString \; \None$ & $\dn$ & $ \textit{true}$\\ + $\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\ + $\nString \; (\Some \; (c::s, n))$ & $\dn$ & $ \textit{true}$\\ + \end{tabular} +\end{center} +\begin{lemma}\label{nupdatesNonempty} + If for all elements $o \in \textit{set} \; Ss$, + $\nString \; o$ holds, the we have that + for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$, + $\nString \; o'$ holds. +\end{lemma} +\begin{proof} + By an induction on $s$, where $Ss$ is set to vary over all possible values. +\end{proof} + +\noindent + +\begin{lemma}\label{ntimesClosedFormsSteps} + The following list of equalities or rewriting relations hold:\\ + (i) $r^{\{n+1\}} \backslash_{rsimps} (c::s) = + \textit{rsimp} \; (\sum (\map \; (\opterm \;r \;\_) \; (\nupdates \; + s \; r \; [\Some \; ([c], n)])))$\\ + (ii) + \begin{center} + $\sum (\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [ + \Some \; ([c], n)]))$ \\ $ \sequal$\\ + $\sum (\map \; (\textit{rsimp} \circ (\opterm \; r))\; (\nupdates \; + s\;r \; [\Some \; ([c], n)]))$\\ + \end{center} + (iii) + \begin{center} + $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \; + ([c], n)]))$\\ + $\sequal$\\ + $\sum \;(\map \; (\optermsimp r) \; (\nupdates \; s \; r \; [\Some \; + ([c], n)])) $\\ + \end{center} + (iv) + \begin{center} + $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \; + ([c], n)])) $ \\ $\sequal$\\ + $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \; + ([c], n)])) $\\ + \end{center} + (v) + \begin{center} + $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \; + ([c], n)])) $ \\ $\sequal$\\ + $\sum \; (\map \; (\textit{rsimp} \circ (\opterm \; r)) \; + (\nupdates \; s \; r \; [\Some \; ([c], n)]))$ + \end{center} +\end{lemma} +\begin{proof} + Routine. + (iii) and (iv) make use of the fact that all the strings $s$ + inside $\Some \; (s, m)$ which are elements of the list + $\nupdates \; s\;r\;[\Some\; ([c], n)]$ are non-empty, + which is from lemma \ref{nupdatesNonempty}. + Once the string in $o = \Some \; (s, n)$ is + nonempty, $\optermsimp \; r \;o$, + $\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed + to be equal. + (v) uses \ref{nupdateInduct1}. +\end{proof} +\noindent +Now we are ready to present the closed form for NTIMES: +\begin{theorem}\label{ntimesClosedForm} + The derivative of $r^{\{n+1\}}$ can be described as an alternative + containing a list + of terms:\\ + $r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; ( + \sum (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \; + [\Some \; ([c], n)])))$ +\end{theorem} +\begin{proof} + By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}. +\end{proof} +\noindent +The key observation for bounding this closed form +is that the counter on $r^{\{n\}}$ will +only decrement during derivatives: +\begin{lemma}\label{nupdatesNLeqN} + For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \; + [\Some \; ([c], n)])$, either $o = \None$, or $o = \Some + \; (s', m)$ for some string $s'$ and number $m \leq n$. +\end{lemma} +\noindent +The proof is routine and therefore omitted. +This allows us to say what kind of terms +are in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; ( +\nupdates \; s \; r \; [\Some \; ([c], n)]))$: +only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$ +with a small $m$: +\begin{lemma}\label{ntimesClosedFormListElemShape} + For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; ( + \nupdates \; s \; r \; [\Some \; ([c], n)]))$, + we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot + r^{\{m\}}$ for some string $s'$ and number $m \leq n$. +\end{lemma} +\begin{proof} + Using lemma \ref{nupdatesNLeqN}. +\end{proof} + +\begin{theorem}\label{ntimesClosedFormBounded} + Assuming that for any string $s$, $\llbracket r \backslash_{rsimps} s + \rrbracket_r \leq N$ holds, then we have that\\ + $\llbracket r^{\{n+1\}} \backslash_{rsimps} s \rrbracket_r \leq + \textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$, + where $c_N = \textit{card} \; (\textit{sizeNregex} \; ( + N + \llbracket r^{\{n\}} \rrbracket_r+1))$. +\end{theorem} +\begin{proof} +We have that for all regular expressions $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; ( + \nupdates \; s \; r \; [\Some \; ([c], n)]))$, + $r'$'s size is less than or equal to $N + \llbracket r^{\{n\}} + \rrbracket_r + 1$ +because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot +r^{\{m\}}$ for some string $s'$ and number +$m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}). +In addition, we know that the list +$\map \; (\optermsimp \; r) \; ( +\nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most +$c_N = \textit{card} \; +(\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$. +This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r +\leq N * c_N$. +\end{proof} + +We aim to formalise the correctness and size bound +for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$ +and so on, which is still work in progress. +They should more or less follow the same recipe described in this section. +Once we know about how to deal with them recursively using suitable auxiliary +definitions, we are able to routinely establish the proofs. + %---------------------------------------------------------------------------------------- % SECTION 3