diff -r 8ffa28fce271 -r b797c9a709d9 ChengsongTanPhdThesis/Chapters/Cubic.tex --- a/ChengsongTanPhdThesis/Chapters/Cubic.tex Sat Nov 12 21:34:40 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Cubic.tex Thu Nov 17 23:13:57 2022 +0000 @@ -8,10 +8,14 @@ \lstset{style=myScalastyle} -This chapter is a ``miscellaneous'' -chapter which records various -extensions to our $\blexersimp$'s formalisations.\\ -Firstly we present further improvements +This chapter is a ``work-in-progress'' +chapter which records +extensions to our $\blexersimp$. +We intend to formalise this part, which +we have not been able to finish due to time constraints of the PhD. +Nevertheless, we outline the ideas we intend to use for the proof. + +We present further improvements made to our lexer algorithm $\blexersimp$. We devise a stronger simplification algorithm, called $\bsimpStrong$, which can prune away @@ -35,13 +39,6 @@ by exploring the connection between the internal data structure of our $\blexerStrong$ and Animirov's partial derivatives.\\ -Secondly, we extend our $\blexersimp$ -to support bounded repetitions ($r^{\{n\}}$). -We update our formalisation of -the correctness and finiteness properties to -include this new construct. -we can out-compete other verified lexers such as -Verbatim++ on bounded regular expressions. %We also present the idempotency property proof %of $\bsimp$, which leverages the idempotency proof of $\rsimp$. %This reinforces our claim that the fixpoint construction @@ -595,507 +592,6 @@ % SECTION 2 %---------------------------------------------------------------------------------------- -\section{Bounded Repetitions} -We have promised in chapter \ref{Introduction} -that our lexing algorithm can potentially be extended -to handle bounded repetitions -in natural and elegant ways. -Now we fulfill our promise by adding support for -the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$. -We add clauses in our derivatives-based lexing algorithms (with simplifications) -introduced in chapter \ref{Bitcoded2}. - -\subsection{Augmented Definitions} -There are a number of definitions that need to be augmented. -The most notable one would be the POSIX rules for $r^{\{n\}}$: -\begin{center} - \begin{mathpar} - \inferrule{\forall v \in vs_1. \vdash v:r \land - |v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\ - \textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \; - (vs_1 @ vs_2) : r^{\{n\}} } - \end{mathpar} -\end{center} -As Ausaf had pointed out \cite{Ausaf}, -sometimes empty iterations have to be taken to get -a match with exactly $n$ repetitions, -and hence the $vs_2$ part. - -Another important definition would be the size: -\begin{center} - \begin{tabular}{lcl} - $\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ & - $\llbracket r \rrbracket_r + n$\\ - \end{tabular} -\end{center} -\noindent -Arguably we should use $\log \; n$ for the size because -the number of digits increase logarithmically w.r.t $n$. -For simplicity we choose to add the counter directly to the size. - -The derivative w.r.t a bounded regular expression -is given as -\begin{center} - \begin{tabular}{lcl} - $r^{\{n\}} \backslash_r c$ & $\dn$ & - $r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\ - & & $\RZERO \;\quad \quad\quad \quad - \textit{otherwise}$\\ - \end{tabular} -\end{center} -\noindent -For brevity, we sometimes use NTIMES to refer to bounded -regular expressions. -The $\mkeps$ function clause for NTIMES would be -\begin{center} - \begin{tabular}{lcl} - $\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \; - (\textit{replicate} \; n\; (\mkeps \; r))$ - \end{tabular} -\end{center} -\noindent -The injection looks like -\begin{center} - \begin{tabular}{lcl} - $\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ & - $\dn$ & $\Stars \; - ((\inj \; r \;c \;v ) :: vs)$ - \end{tabular} -\end{center} -\noindent - - -\subsection{Proofs for the Augmented Lexing Algorithm} -We need to maintain two proofs with the additional $r^{\{n\}}$ -construct: the -correctness proof in chapter \ref{Bitcoded2}, -and the finiteness proof in chapter \ref{Finite}. - -\subsubsection{Correctness Proof Augmentation} -The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitions -have been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}. -As they have commented, once the definitions are in place, -the proofs given for the basic regular expressions will extend to -bounded regular expressions, and there are no ``surprises''. -We confirm this point because the correctness theorem would also -extend without surprise to $\blexersimp$. -The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on -do not need to be changed, -and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to -add one more line which can be solved by sledgehammer -to solve the $r^{\{n\}}$ inductive case. - - -\subsubsection{Finiteness Proof Augmentation} -The bounded repetitions are -very similar to stars, and therefore the treatment -is similar, with minor changes to handle some slight complications -when the counter reaches 0. -The exponential growth is similar: -\begin{center} - \begin{tabular}{ll} - $r^{\{n\}} $ & $\longrightarrow_{\backslash c}$\\ - $(r\backslash c) \cdot - r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c'}$\\ - \\ - $r \backslash cc' \cdot r^{\{n - 2\}}* + - r \backslash c' \cdot r^{\{n - 1\}}*$ & - $\longrightarrow_{\backslash c''}$\\ - \\ - $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* + - r \backslash c''\cdot r^{\{n-1\}}) + - (r \backslash c'c'' \cdot r^{\{n-2\}}* + - r \backslash c'' \cdot r^{\{n-1\}}*)$ & - $\longrightarrow_{\backslash c'''}$ \\ - \\ - $\ldots$\\ - \end{tabular} -\end{center} -Again, we assume that $r\backslash c$, $r \backslash cc'$ and so on -are all nullable. -The flattened list of terms for $r^{\{n\}} \backslash_{rs} s$ -\begin{center} - $[r_1 \backslash cc'c'' \cdot r^{\{n-3\}}*,\; - r \backslash c''\cdot r^{\{n-1\}}, \; - r \backslash c'c'' \cdot r^{\{n-2\}}*, \; - r \backslash c'' \cdot r^{\{n-1\}}*,\; \ldots ]$ -\end{center} -that comes from -\begin{center} - $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* + - r \backslash c''\cdot r^{\{n-1\}}) + - (r \backslash c'c'' \cdot r^{\{n-2\}}* + - r \backslash c'' \cdot r^{\{n-1\}}*)+ \ldots$ -\end{center} -are made of sequences with different tails, where the counters -might differ. -The observation for maintaining the bound is that -these counters never exceed $n$, the original -counter. With the number of counters staying finite, -$\rDistinct$ will deduplicate and keep the list finite. -We introduce this idea as a lemma once we describe all -the necessary helper functions. - -Similar to the star case, we want -\begin{center} - $\rderssimp{r^{\{n\}}}{s} = \rsimp{\sum rs}$. -\end{center} -where $rs$ -shall be in the form of -$\map \; f \; Ss$, where $f$ is a function and -$Ss$ a list of objects to act on. -For star, the object's datatype is string. -The list of strings $Ss$ -is generated using functions -$\starupdate$ and $\starupdates$. -The function that takes a string and returns a regular expression -is the anonymous function $ -(\lambda s'. \; r\backslash s' \cdot r^{\{m\}})$. -In the NTIMES setting, -the $\starupdate$ and $\starupdates$ functions are replaced by -$\textit{nupdate}$ and $\textit{nupdates}$: -\begin{center} - \begin{tabular}{lcl} - $\nupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ - $\nupdate \; c \; r \; - (\Some \; (s, \; n + 1) \; :: \; Ss)$ & $\dn$ & %\\ - $\textit{if} \; - (\rnullable \; (r \backslash_{rs} s))$ \\ - & & $\;\;\textit{then} - \;\; \Some \; (s @ [c], n + 1) :: \Some \; ([c], n) :: ( - \nupdate \; c \; r \; Ss)$ \\ - & & $\textit{else} \;\; \Some \; (s @ [c], n+1) :: ( - \nupdate \; c \; r \; Ss)$\\ - $\nupdate \; c \; r \; (\textit{None} :: Ss)$ & $\dn$ & - $(\None :: \nupdate \; c \; r \; Ss)$\\ - & & \\ - %\end{tabular} -%\end{center} -%\begin{center} - %\begin{tabular}{lcl} - $\nupdates \; [] \; r \; Ss$ & $\dn$ & $Ss$\\ - $\nupdates \; (c :: cs) \; r \; Ss$ & $\dn$ & $\nupdates \; cs \; r \; ( - \nupdate \; c \; r \; Ss)$ - \end{tabular} -\end{center} -\noindent -which take into account when a subterm -\begin{center} - $r \backslash_s s \cdot r^{\{n\}}$ -\end{center} -counter $n$ -is 0, and therefore expands to -\begin{center} -$r \backslash_s (s@[c]) \cdot r^{\{n\}} \;+ -\; \ZERO$ -\end{center} -after taking a derivative. -The object now has type -\begin{center} -$\textit{option} \;(\textit{string}, \textit{nat})$ -\end{center} -and therefore the function for converting such an option into -a regular expression term is called $\opterm$: - -\begin{center} - \begin{tabular}{lcl} - $\opterm \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ - & & $\;\;\Some \; (s, n) \Rightarrow - (r\backslash_{rs} s)\cdot r^{\{n\}}$\\ - & & $\;\;\None \Rightarrow - \ZERO$\\ - \end{tabular} -\end{center} -\noindent -Put together, the list $\map \; f \; Ss$ is instantiated as -\begin{center} - $\map \; (\opterm \; r) \; (\nupdates \; s \; r \; - [\Some \; ([c], n)])$. -\end{center} -For the closed form to be bounded, we would like -simplification to be applied to each term in the list. -Therefore we introduce some variants of $\opterm$, -which help conveniently express the rewriting steps -needed in the closed form proof. -\begin{center} - \begin{tabular}{lcl} - $\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ - & & $\;\;\Some \; (s, n) \Rightarrow - \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\ - & & $\;\;\None \Rightarrow - \ZERO$\\ - \\ - $\optermosimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ - & & $\;\;\Some \; (s, n) \Rightarrow - (\textit{rsimp} \; (r\backslash_{rs} s)) - \cdot r^{\{n\}}$\\ - & & $\;\;\None \Rightarrow - \ZERO$\\ - \\ - $\optermsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ - & & $\;\;\Some \; (s, n) \Rightarrow - (r\backslash_{rsimps} s)\cdot r^{\{n\}}$\\ - & & $\;\;\None \Rightarrow - \ZERO$\\ - \end{tabular} -\end{center} - - -For a list of -$\textit{option} \;(\textit{string}, \textit{nat})$ elements, -we define the highest power for it recursively: -\begin{center} - \begin{tabular}{lcl} - $\hpa \; [] \; n $ & $\dn$ & $n$\\ - $\hpa \; (\None :: os) \; n $ & $\dn$ & $\hpa \; os \; n$\\ - $\hpa \; (\Some \; (s, n) :: os) \; m$ & $\dn$ & - $\hpa \;os \; (\textit{max} \; n\; m)$\\ - \\ - $\hpower \; rs $ & $\dn$ & $\hpa \; rs \; 0$\\ - \end{tabular} -\end{center} - -Now the intuition that an NTIMES regular expression's power -does not increase can be easily expressed as -\begin{lemma}\label{nupdatesMono2} - $\hpower \; (\nupdates \;s \; r \; [\Some \; ([c], n)]) \leq n$ -\end{lemma} -\begin{proof} - Note that the power is non-increasing after a $\nupdate$ application: - \begin{center} - $\hpa \;\; (\nupdate \; c \; r \; Ss)\;\; m \leq - \hpa\; \; Ss \; m$. - \end{center} - This is also the case for $\nupdates$: - \begin{center} - $\hpa \;\; (\nupdates \; s \; r \; Ss)\;\; m \leq - \hpa\; \; Ss \; m$. - \end{center} - Therefore we have that - \begin{center} - $\hpower \;\; (\nupdates \; s \; r \; Ss) \leq - \hpower \;\; Ss$ - \end{center} - which leads to the lemma being proven. - - \end{proof} - - -We also define the inductive rules for -the shape of derivatives of the NTIMES regular expressions:\\[-3em] -\begin{center} - \begin{mathpar} - \inferrule{\mbox{}}{\cbn \;\ZERO} - - \inferrule{\mbox{}}{\cbn \; \; r_a \cdot (r^{\{n\}})} - - \inferrule{\cbn \; r_1 \;\; \; \cbn \; r_2}{\cbn \; r_1 + r_2} - - \inferrule{\cbn \; r}{\cbn \; r + \ZERO} - \end{mathpar} -\end{center} -\noindent -A derivative of NTIMES fits into the shape described by $\cbn$: -\begin{lemma}\label{ntimesDersCbn} - $\cbn \; ((r' \cdot r^{\{n\}}) \backslash_{rs} s)$ holds. -\end{lemma} -\begin{proof} - By a reverse induction on $s$. - For the inductive case, note that if $\cbn \; r$ holds, - then $\cbn \; (r\backslash_r c)$ holds. -\end{proof} -\noindent -In addition, for $\cbn$-shaped regular expressioins, one can flatten -them: -\begin{lemma}\label{ntimesHfauPushin} - If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} = - \textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \; - (\hflataux{r})})$ -\end{lemma} -\begin{proof} - By an induction on the inductive cases of $\cbn$. -\end{proof} -\noindent -This time we do not need to define the flattening functions for NTIMES only, -because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already. -\begin{lemma}\label{ntimesHfauInduct} -$\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} = - \map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$ -\end{lemma} -\begin{proof} - By a reverse induction on $s$. - The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used. -\end{proof} -\noindent -We have a recursive property for NTIMES with $\nupdate$ -similar to that for STAR, -and one for $\nupdates $ as well: -\begin{lemma}\label{nupdateInduct1} - \mbox{} - \begin{itemize} - \item - \begin{center} - $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( - \opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \; - c \; r \; Ss)$\\ - \end{center} - holds. -\item - \begin{center} - $\textit{concat} \; (\map \; \hflataux{\_}\; - \map \; (\_\backslash_r x) \; - (\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\ - $=$\\ - $\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$ - \end{center} - holds. - \end{itemize} -\end{lemma} -\begin{proof} - (i) is by an induction on $Ss$. - (ii) is by an induction on $xs$. -\end{proof} -\noindent -The $\nString$ predicate is defined for conveniently -expressing that there are no empty strings in the -$\Some \;(s, n)$ elements generated by $\nupdate$: -\begin{center} - \begin{tabular}{lcl} - $\nString \; \None$ & $\dn$ & $ \textit{true}$\\ - $\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\ - $\nString \; (\Some \; (c::s, n))$ & $\dn$ & $ \textit{true}$\\ - \end{tabular} -\end{center} -\begin{lemma}\label{nupdatesNonempty} - If for all elements $o \in \textit{set} \; Ss$, - $\nString \; o$ holds, the we have that - for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$, - $\nString \; o'$ holds. -\end{lemma} -\begin{proof} - By an induction on $s$, where $Ss$ is set to vary over all possible values. -\end{proof} - -\noindent - -\begin{lemma}\label{ntimesClosedFormsSteps} - The following list of equalities or rewriting relations hold:\\ - (i) $r^{\{n+1\}} \backslash_{rsimps} (c::s) = - \textit{rsimp} \; (\sum (\map \; (\opterm \;r \;\_) \; (\nupdates \; - s \; r \; [\Some \; ([c], n)])))$\\ - (ii) - \begin{center} - $\sum (\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [ - \Some \; ([c], n)]))$ \\ $ \sequal$\\ - $\sum (\map \; (\textit{rsimp} \circ (\opterm \; r))\; (\nupdates \; - s\;r \; [\Some \; ([c], n)]))$\\ - \end{center} - (iii) - \begin{center} - $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \; - ([c], n)]))$\\ - $\sequal$\\ - $\sum \;(\map \; (\optermsimp r) \; (\nupdates \; s \; r \; [\Some \; - ([c], n)])) $\\ - \end{center} - (iv) - \begin{center} - $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \; - ([c], n)])) $ \\ $\sequal$\\ - $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \; - ([c], n)])) $\\ - \end{center} - (v) - \begin{center} - $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \; - ([c], n)])) $ \\ $\sequal$\\ - $\sum \; (\map \; (\textit{rsimp} \circ (\opterm \; r)) \; - (\nupdates \; s \; r \; [\Some \; ([c], n)]))$ - \end{center} -\end{lemma} -\begin{proof} - Routine. - (iii) and (iv) make use of the fact that all the strings $s$ - inside $\Some \; (s, m)$ which are elements of the list - $\nupdates \; s\;r\;[\Some\; ([c], n)]$ are non-empty, - which is from lemma \ref{nupdatesNonempty}. - Once the string in $o = \Some \; (s, n)$ is - nonempty, $\optermsimp \; r \;o$, - $\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed - to be equal. - (v) uses \ref{nupdateInduct1}. -\end{proof} -\noindent -Now we are ready to present the closed form for NTIMES: -\begin{theorem}\label{ntimesClosedForm} - The derivative of $r^{\{n+1\}}$ can be described as an alternative - containing a list - of terms:\\ - $r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; ( - \sum (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \; - [\Some \; ([c], n)])))$ -\end{theorem} -\begin{proof} - By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}. -\end{proof} -\noindent -The key observation for bounding this closed form -is that the counter on $r^{\{n\}}$ will -only decrement during derivatives: -\begin{lemma}\label{nupdatesNLeqN} - For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \; - [\Some \; ([c], n)])$, either $o = \None$, or $o = \Some - \; (s', m)$ for some string $s'$ and number $m \leq n$. -\end{lemma} -\noindent -The proof is routine and therefore omitted. -This allows us to say what kind of terms -are in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; ( -\nupdates \; s \; r \; [\Some \; ([c], n)]))$: -only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$ -with a small $m$: -\begin{lemma}\label{ntimesClosedFormListElemShape} - For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; ( - \nupdates \; s \; r \; [\Some \; ([c], n)]))$, - we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot - r^{\{m\}}$ for some string $s'$ and number $m \leq n$. -\end{lemma} -\begin{proof} - Using lemma \ref{nupdatesNLeqN}. -\end{proof} - -\begin{theorem}\label{ntimesClosedFormBounded} - Assuming that for any string $s$, $\llbracket r \backslash_{rsimps} s - \rrbracket_r \leq N$ holds, then we have that\\ - $\llbracket r^{\{n+1\}} \backslash_{rsimps} s \rrbracket_r \leq - \textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$, - where $c_N = \textit{card} \; (\textit{sizeNregex} \; ( - N + \llbracket r^{\{n\}} \rrbracket_r+1))$. -\end{theorem} -\begin{proof} -We have that for all regular expressions $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; ( - \nupdates \; s \; r \; [\Some \; ([c], n)]))$, - $r'$'s size is less than or equal to $N + \llbracket r^{\{n\}} - \rrbracket_r + 1$ -because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot -r^{\{m\}}$ for some string $s'$ and number -$m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}). -In addition, we know that the list -$\map \; (\optermsimp \; r) \; ( -\nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most -$c_N = \textit{card} \; -(\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$. -This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r -\leq N * c_N$. -\end{proof} - -We aim to formalise the correctness and size bound -for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$ -and so on, which is still work in progress. -They should more or less follow the same recipe described in this section. -Once we know about how to deal with them recursively using suitable auxiliary -definitions, we are able to routinely establish the proofs. - %The closed form for them looks like: %%\begin{center}