--- a/ChengsongTanPhdThesis/Chapters/Cubic.tex Sat Nov 12 21:34:40 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Cubic.tex Thu Nov 17 23:13:57 2022 +0000
@@ -8,10 +8,14 @@
\lstset{style=myScalastyle}
-This chapter is a ``miscellaneous''
-chapter which records various
-extensions to our $\blexersimp$'s formalisations.\\
-Firstly we present further improvements
+This chapter is a ``work-in-progress''
+chapter which records
+extensions to our $\blexersimp$.
+We intend to formalise this part, which
+we have not been able to finish due to time constraints of the PhD.
+Nevertheless, we outline the ideas we intend to use for the proof.
+
+We present further improvements
made to our lexer algorithm $\blexersimp$.
We devise a stronger simplification algorithm,
called $\bsimpStrong$, which can prune away
@@ -35,13 +39,6 @@
by exploring the connection between the internal
data structure of our $\blexerStrong$ and
Animirov's partial derivatives.\\
-Secondly, we extend our $\blexersimp$
-to support bounded repetitions ($r^{\{n\}}$).
-We update our formalisation of
-the correctness and finiteness properties to
-include this new construct.
-we can out-compete other verified lexers such as
-Verbatim++ on bounded regular expressions.
%We also present the idempotency property proof
%of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
%This reinforces our claim that the fixpoint construction
@@ -595,507 +592,6 @@
% SECTION 2
%----------------------------------------------------------------------------------------
-\section{Bounded Repetitions}
-We have promised in chapter \ref{Introduction}
-that our lexing algorithm can potentially be extended
-to handle bounded repetitions
-in natural and elegant ways.
-Now we fulfill our promise by adding support for
-the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$.
-We add clauses in our derivatives-based lexing algorithms (with simplifications)
-introduced in chapter \ref{Bitcoded2}.
-
-\subsection{Augmented Definitions}
-There are a number of definitions that need to be augmented.
-The most notable one would be the POSIX rules for $r^{\{n\}}$:
-\begin{center}
- \begin{mathpar}
- \inferrule{\forall v \in vs_1. \vdash v:r \land
- |v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\
- \textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \;
- (vs_1 @ vs_2) : r^{\{n\}} }
- \end{mathpar}
-\end{center}
-As Ausaf had pointed out \cite{Ausaf},
-sometimes empty iterations have to be taken to get
-a match with exactly $n$ repetitions,
-and hence the $vs_2$ part.
-
-Another important definition would be the size:
-\begin{center}
- \begin{tabular}{lcl}
- $\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ &
- $\llbracket r \rrbracket_r + n$\\
- \end{tabular}
-\end{center}
-\noindent
-Arguably we should use $\log \; n$ for the size because
-the number of digits increase logarithmically w.r.t $n$.
-For simplicity we choose to add the counter directly to the size.
-
-The derivative w.r.t a bounded regular expression
-is given as
-\begin{center}
- \begin{tabular}{lcl}
- $r^{\{n\}} \backslash_r c$ & $\dn$ &
- $r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\
- & & $\RZERO \;\quad \quad\quad \quad
- \textit{otherwise}$\\
- \end{tabular}
-\end{center}
-\noindent
-For brevity, we sometimes use NTIMES to refer to bounded
-regular expressions.
-The $\mkeps$ function clause for NTIMES would be
-\begin{center}
- \begin{tabular}{lcl}
- $\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \;
- (\textit{replicate} \; n\; (\mkeps \; r))$
- \end{tabular}
-\end{center}
-\noindent
-The injection looks like
-\begin{center}
- \begin{tabular}{lcl}
- $\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ &
- $\dn$ & $\Stars \;
- ((\inj \; r \;c \;v ) :: vs)$
- \end{tabular}
-\end{center}
-\noindent
-
-
-\subsection{Proofs for the Augmented Lexing Algorithm}
-We need to maintain two proofs with the additional $r^{\{n\}}$
-construct: the
-correctness proof in chapter \ref{Bitcoded2},
-and the finiteness proof in chapter \ref{Finite}.
-
-\subsubsection{Correctness Proof Augmentation}
-The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitions
-have been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}.
-As they have commented, once the definitions are in place,
-the proofs given for the basic regular expressions will extend to
-bounded regular expressions, and there are no ``surprises''.
-We confirm this point because the correctness theorem would also
-extend without surprise to $\blexersimp$.
-The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on
-do not need to be changed,
-and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to
-add one more line which can be solved by sledgehammer
-to solve the $r^{\{n\}}$ inductive case.
-
-
-\subsubsection{Finiteness Proof Augmentation}
-The bounded repetitions are
-very similar to stars, and therefore the treatment
-is similar, with minor changes to handle some slight complications
-when the counter reaches 0.
-The exponential growth is similar:
-\begin{center}
- \begin{tabular}{ll}
- $r^{\{n\}} $ & $\longrightarrow_{\backslash c}$\\
- $(r\backslash c) \cdot
- r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c'}$\\
- \\
- $r \backslash cc' \cdot r^{\{n - 2\}}* +
- r \backslash c' \cdot r^{\{n - 1\}}*$ &
- $\longrightarrow_{\backslash c''}$\\
- \\
- $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
- r \backslash c''\cdot r^{\{n-1\}}) +
- (r \backslash c'c'' \cdot r^{\{n-2\}}* +
- r \backslash c'' \cdot r^{\{n-1\}}*)$ &
- $\longrightarrow_{\backslash c'''}$ \\
- \\
- $\ldots$\\
- \end{tabular}
-\end{center}
-Again, we assume that $r\backslash c$, $r \backslash cc'$ and so on
-are all nullable.
-The flattened list of terms for $r^{\{n\}} \backslash_{rs} s$
-\begin{center}
- $[r_1 \backslash cc'c'' \cdot r^{\{n-3\}}*,\;
- r \backslash c''\cdot r^{\{n-1\}}, \;
- r \backslash c'c'' \cdot r^{\{n-2\}}*, \;
- r \backslash c'' \cdot r^{\{n-1\}}*,\; \ldots ]$
-\end{center}
-that comes from
-\begin{center}
- $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
- r \backslash c''\cdot r^{\{n-1\}}) +
- (r \backslash c'c'' \cdot r^{\{n-2\}}* +
- r \backslash c'' \cdot r^{\{n-1\}}*)+ \ldots$
-\end{center}
-are made of sequences with different tails, where the counters
-might differ.
-The observation for maintaining the bound is that
-these counters never exceed $n$, the original
-counter. With the number of counters staying finite,
-$\rDistinct$ will deduplicate and keep the list finite.
-We introduce this idea as a lemma once we describe all
-the necessary helper functions.
-
-Similar to the star case, we want
-\begin{center}
- $\rderssimp{r^{\{n\}}}{s} = \rsimp{\sum rs}$.
-\end{center}
-where $rs$
-shall be in the form of
-$\map \; f \; Ss$, where $f$ is a function and
-$Ss$ a list of objects to act on.
-For star, the object's datatype is string.
-The list of strings $Ss$
-is generated using functions
-$\starupdate$ and $\starupdates$.
-The function that takes a string and returns a regular expression
-is the anonymous function $
-(\lambda s'. \; r\backslash s' \cdot r^{\{m\}})$.
-In the NTIMES setting,
-the $\starupdate$ and $\starupdates$ functions are replaced by
-$\textit{nupdate}$ and $\textit{nupdates}$:
-\begin{center}
- \begin{tabular}{lcl}
- $\nupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
- $\nupdate \; c \; r \;
- (\Some \; (s, \; n + 1) \; :: \; Ss)$ & $\dn$ & %\\
- $\textit{if} \;
- (\rnullable \; (r \backslash_{rs} s))$ \\
- & & $\;\;\textit{then}
- \;\; \Some \; (s @ [c], n + 1) :: \Some \; ([c], n) :: (
- \nupdate \; c \; r \; Ss)$ \\
- & & $\textit{else} \;\; \Some \; (s @ [c], n+1) :: (
- \nupdate \; c \; r \; Ss)$\\
- $\nupdate \; c \; r \; (\textit{None} :: Ss)$ & $\dn$ &
- $(\None :: \nupdate \; c \; r \; Ss)$\\
- & & \\
- %\end{tabular}
-%\end{center}
-%\begin{center}
- %\begin{tabular}{lcl}
- $\nupdates \; [] \; r \; Ss$ & $\dn$ & $Ss$\\
- $\nupdates \; (c :: cs) \; r \; Ss$ & $\dn$ & $\nupdates \; cs \; r \; (
- \nupdate \; c \; r \; Ss)$
- \end{tabular}
-\end{center}
-\noindent
-which take into account when a subterm
-\begin{center}
- $r \backslash_s s \cdot r^{\{n\}}$
-\end{center}
-counter $n$
-is 0, and therefore expands to
-\begin{center}
-$r \backslash_s (s@[c]) \cdot r^{\{n\}} \;+
-\; \ZERO$
-\end{center}
-after taking a derivative.
-The object now has type
-\begin{center}
-$\textit{option} \;(\textit{string}, \textit{nat})$
-\end{center}
-and therefore the function for converting such an option into
-a regular expression term is called $\opterm$:
-
-\begin{center}
- \begin{tabular}{lcl}
- $\opterm \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
- & & $\;\;\Some \; (s, n) \Rightarrow
- (r\backslash_{rs} s)\cdot r^{\{n\}}$\\
- & & $\;\;\None \Rightarrow
- \ZERO$\\
- \end{tabular}
-\end{center}
-\noindent
-Put together, the list $\map \; f \; Ss$ is instantiated as
-\begin{center}
- $\map \; (\opterm \; r) \; (\nupdates \; s \; r \;
- [\Some \; ([c], n)])$.
-\end{center}
-For the closed form to be bounded, we would like
-simplification to be applied to each term in the list.
-Therefore we introduce some variants of $\opterm$,
-which help conveniently express the rewriting steps
-needed in the closed form proof.
-\begin{center}
- \begin{tabular}{lcl}
- $\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
- & & $\;\;\Some \; (s, n) \Rightarrow
- \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\
- & & $\;\;\None \Rightarrow
- \ZERO$\\
- \\
- $\optermosimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
- & & $\;\;\Some \; (s, n) \Rightarrow
- (\textit{rsimp} \; (r\backslash_{rs} s))
- \cdot r^{\{n\}}$\\
- & & $\;\;\None \Rightarrow
- \ZERO$\\
- \\
- $\optermsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
- & & $\;\;\Some \; (s, n) \Rightarrow
- (r\backslash_{rsimps} s)\cdot r^{\{n\}}$\\
- & & $\;\;\None \Rightarrow
- \ZERO$\\
- \end{tabular}
-\end{center}
-
-
-For a list of
-$\textit{option} \;(\textit{string}, \textit{nat})$ elements,
-we define the highest power for it recursively:
-\begin{center}
- \begin{tabular}{lcl}
- $\hpa \; [] \; n $ & $\dn$ & $n$\\
- $\hpa \; (\None :: os) \; n $ & $\dn$ & $\hpa \; os \; n$\\
- $\hpa \; (\Some \; (s, n) :: os) \; m$ & $\dn$ &
- $\hpa \;os \; (\textit{max} \; n\; m)$\\
- \\
- $\hpower \; rs $ & $\dn$ & $\hpa \; rs \; 0$\\
- \end{tabular}
-\end{center}
-
-Now the intuition that an NTIMES regular expression's power
-does not increase can be easily expressed as
-\begin{lemma}\label{nupdatesMono2}
- $\hpower \; (\nupdates \;s \; r \; [\Some \; ([c], n)]) \leq n$
-\end{lemma}
-\begin{proof}
- Note that the power is non-increasing after a $\nupdate$ application:
- \begin{center}
- $\hpa \;\; (\nupdate \; c \; r \; Ss)\;\; m \leq
- \hpa\; \; Ss \; m$.
- \end{center}
- This is also the case for $\nupdates$:
- \begin{center}
- $\hpa \;\; (\nupdates \; s \; r \; Ss)\;\; m \leq
- \hpa\; \; Ss \; m$.
- \end{center}
- Therefore we have that
- \begin{center}
- $\hpower \;\; (\nupdates \; s \; r \; Ss) \leq
- \hpower \;\; Ss$
- \end{center}
- which leads to the lemma being proven.
-
- \end{proof}
-
-
-We also define the inductive rules for
-the shape of derivatives of the NTIMES regular expressions:\\[-3em]
-\begin{center}
- \begin{mathpar}
- \inferrule{\mbox{}}{\cbn \;\ZERO}
-
- \inferrule{\mbox{}}{\cbn \; \; r_a \cdot (r^{\{n\}})}
-
- \inferrule{\cbn \; r_1 \;\; \; \cbn \; r_2}{\cbn \; r_1 + r_2}
-
- \inferrule{\cbn \; r}{\cbn \; r + \ZERO}
- \end{mathpar}
-\end{center}
-\noindent
-A derivative of NTIMES fits into the shape described by $\cbn$:
-\begin{lemma}\label{ntimesDersCbn}
- $\cbn \; ((r' \cdot r^{\{n\}}) \backslash_{rs} s)$ holds.
-\end{lemma}
-\begin{proof}
- By a reverse induction on $s$.
- For the inductive case, note that if $\cbn \; r$ holds,
- then $\cbn \; (r\backslash_r c)$ holds.
-\end{proof}
-\noindent
-In addition, for $\cbn$-shaped regular expressioins, one can flatten
-them:
-\begin{lemma}\label{ntimesHfauPushin}
- If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} =
- \textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \;
- (\hflataux{r})})$
-\end{lemma}
-\begin{proof}
- By an induction on the inductive cases of $\cbn$.
-\end{proof}
-\noindent
-This time we do not need to define the flattening functions for NTIMES only,
-because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already.
-\begin{lemma}\label{ntimesHfauInduct}
-$\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} =
- \map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$
-\end{lemma}
-\begin{proof}
- By a reverse induction on $s$.
- The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used.
-\end{proof}
-\noindent
-We have a recursive property for NTIMES with $\nupdate$
-similar to that for STAR,
-and one for $\nupdates $ as well:
-\begin{lemma}\label{nupdateInduct1}
- \mbox{}
- \begin{itemize}
- \item
- \begin{center}
- $\textit{concat} \; (\map \; (\hflataux{\_} \circ (
- \opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \;
- c \; r \; Ss)$\\
- \end{center}
- holds.
-\item
- \begin{center}
- $\textit{concat} \; (\map \; \hflataux{\_}\;
- \map \; (\_\backslash_r x) \;
- (\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\
- $=$\\
- $\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$
- \end{center}
- holds.
- \end{itemize}
-\end{lemma}
-\begin{proof}
- (i) is by an induction on $Ss$.
- (ii) is by an induction on $xs$.
-\end{proof}
-\noindent
-The $\nString$ predicate is defined for conveniently
-expressing that there are no empty strings in the
-$\Some \;(s, n)$ elements generated by $\nupdate$:
-\begin{center}
- \begin{tabular}{lcl}
- $\nString \; \None$ & $\dn$ & $ \textit{true}$\\
- $\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\
- $\nString \; (\Some \; (c::s, n))$ & $\dn$ & $ \textit{true}$\\
- \end{tabular}
-\end{center}
-\begin{lemma}\label{nupdatesNonempty}
- If for all elements $o \in \textit{set} \; Ss$,
- $\nString \; o$ holds, the we have that
- for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$,
- $\nString \; o'$ holds.
-\end{lemma}
-\begin{proof}
- By an induction on $s$, where $Ss$ is set to vary over all possible values.
-\end{proof}
-
-\noindent
-
-\begin{lemma}\label{ntimesClosedFormsSteps}
- The following list of equalities or rewriting relations hold:\\
- (i) $r^{\{n+1\}} \backslash_{rsimps} (c::s) =
- \textit{rsimp} \; (\sum (\map \; (\opterm \;r \;\_) \; (\nupdates \;
- s \; r \; [\Some \; ([c], n)])))$\\
- (ii)
- \begin{center}
- $\sum (\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [
- \Some \; ([c], n)]))$ \\ $ \sequal$\\
- $\sum (\map \; (\textit{rsimp} \circ (\opterm \; r))\; (\nupdates \;
- s\;r \; [\Some \; ([c], n)]))$\\
- \end{center}
- (iii)
- \begin{center}
- $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
- ([c], n)]))$\\
- $\sequal$\\
- $\sum \;(\map \; (\optermsimp r) \; (\nupdates \; s \; r \; [\Some \;
- ([c], n)])) $\\
- \end{center}
- (iv)
- \begin{center}
- $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
- ([c], n)])) $ \\ $\sequal$\\
- $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
- ([c], n)])) $\\
- \end{center}
- (v)
- \begin{center}
- $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
- ([c], n)])) $ \\ $\sequal$\\
- $\sum \; (\map \; (\textit{rsimp} \circ (\opterm \; r)) \;
- (\nupdates \; s \; r \; [\Some \; ([c], n)]))$
- \end{center}
-\end{lemma}
-\begin{proof}
- Routine.
- (iii) and (iv) make use of the fact that all the strings $s$
- inside $\Some \; (s, m)$ which are elements of the list
- $\nupdates \; s\;r\;[\Some\; ([c], n)]$ are non-empty,
- which is from lemma \ref{nupdatesNonempty}.
- Once the string in $o = \Some \; (s, n)$ is
- nonempty, $\optermsimp \; r \;o$,
- $\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed
- to be equal.
- (v) uses \ref{nupdateInduct1}.
-\end{proof}
-\noindent
-Now we are ready to present the closed form for NTIMES:
-\begin{theorem}\label{ntimesClosedForm}
- The derivative of $r^{\{n+1\}}$ can be described as an alternative
- containing a list
- of terms:\\
- $r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; (
- \sum (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \;
- [\Some \; ([c], n)])))$
-\end{theorem}
-\begin{proof}
- By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}.
-\end{proof}
-\noindent
-The key observation for bounding this closed form
-is that the counter on $r^{\{n\}}$ will
-only decrement during derivatives:
-\begin{lemma}\label{nupdatesNLeqN}
- For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \;
- [\Some \; ([c], n)])$, either $o = \None$, or $o = \Some
- \; (s', m)$ for some string $s'$ and number $m \leq n$.
-\end{lemma}
-\noindent
-The proof is routine and therefore omitted.
-This allows us to say what kind of terms
-are in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; (
-\nupdates \; s \; r \; [\Some \; ([c], n)]))$:
-only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$
-with a small $m$:
-\begin{lemma}\label{ntimesClosedFormListElemShape}
- For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
- \nupdates \; s \; r \; [\Some \; ([c], n)]))$,
- we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot
- r^{\{m\}}$ for some string $s'$ and number $m \leq n$.
-\end{lemma}
-\begin{proof}
- Using lemma \ref{nupdatesNLeqN}.
-\end{proof}
-
-\begin{theorem}\label{ntimesClosedFormBounded}
- Assuming that for any string $s$, $\llbracket r \backslash_{rsimps} s
- \rrbracket_r \leq N$ holds, then we have that\\
- $\llbracket r^{\{n+1\}} \backslash_{rsimps} s \rrbracket_r \leq
- \textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$,
- where $c_N = \textit{card} \; (\textit{sizeNregex} \; (
- N + \llbracket r^{\{n\}} \rrbracket_r+1))$.
-\end{theorem}
-\begin{proof}
-We have that for all regular expressions $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
- \nupdates \; s \; r \; [\Some \; ([c], n)]))$,
- $r'$'s size is less than or equal to $N + \llbracket r^{\{n\}}
- \rrbracket_r + 1$
-because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot
-r^{\{m\}}$ for some string $s'$ and number
-$m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}).
-In addition, we know that the list
-$\map \; (\optermsimp \; r) \; (
-\nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most
-$c_N = \textit{card} \;
-(\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$.
-This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r
-\leq N * c_N$.
-\end{proof}
-
-We aim to formalise the correctness and size bound
-for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$
-and so on, which is still work in progress.
-They should more or less follow the same recipe described in this section.
-Once we know about how to deal with them recursively using suitable auxiliary
-definitions, we are able to routinely establish the proofs.
-
%The closed form for them looks like:
%%\begin{center}
--- a/ChengsongTanPhdThesis/Chapters/Finite.tex Sat Nov 12 21:34:40 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Thu Nov 17 23:13:57 2022 +0000
@@ -45,12 +45,24 @@
only gives empirical evidence on
some test cases (see Verbatim++ work\cite{Verbatimpp}).
\end{itemize}
+\noindent
+We then extend our $\blexersimp$
+to support bounded repetitions ($r^{\{n\}}$).
+We update our formalisation of
+the correctness and finiteness properties to
+include this new construct.
+we can out-compete other verified lexers such as
+Verbatim++ on bounded regular expressions.
+
In the next section we describe in more detail
what the finite bound means in our algorithm
and why the size of the internal data structures of
a typical derivative-based lexer such as
Sulzmann and Lu's need formal treatment.
+
+
+
\section{Formalising About Size}
\noindent
In our lexer ($\blexersimp$),
@@ -2592,6 +2604,507 @@
% SECTION 2
%-----------------------------------
+\section{Bounded Repetitions}
+We have promised in chapter \ref{Introduction}
+that our lexing algorithm can potentially be extended
+to handle bounded repetitions
+in natural and elegant ways.
+Now we fulfill our promise by adding support for
+the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$.
+We add clauses in our derivatives-based lexing algorithms (with simplifications)
+introduced in chapter \ref{Bitcoded2}.
+
+\subsection{Augmented Definitions}
+There are a number of definitions that need to be augmented.
+The most notable one would be the POSIX rules for $r^{\{n\}}$:
+\begin{center}
+ \begin{mathpar}
+ \inferrule{\forall v \in vs_1. \vdash v:r \land
+ |v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\
+ \textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \;
+ (vs_1 @ vs_2) : r^{\{n\}} }
+ \end{mathpar}
+\end{center}
+As Ausaf had pointed out \cite{Ausaf},
+sometimes empty iterations have to be taken to get
+a match with exactly $n$ repetitions,
+and hence the $vs_2$ part.
+
+Another important definition would be the size:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ &
+ $\llbracket r \rrbracket_r + n$\\
+ \end{tabular}
+\end{center}
+\noindent
+Arguably we should use $\log \; n$ for the size because
+the number of digits increase logarithmically w.r.t $n$.
+For simplicity we choose to add the counter directly to the size.
+
+The derivative w.r.t a bounded regular expression
+is given as
+\begin{center}
+ \begin{tabular}{lcl}
+ $r^{\{n\}} \backslash_r c$ & $\dn$ &
+ $r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\
+ & & $\RZERO \;\quad \quad\quad \quad
+ \textit{otherwise}$\\
+ \end{tabular}
+\end{center}
+\noindent
+For brevity, we sometimes use NTIMES to refer to bounded
+regular expressions.
+The $\mkeps$ function clause for NTIMES would be
+\begin{center}
+ \begin{tabular}{lcl}
+ $\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \;
+ (\textit{replicate} \; n\; (\mkeps \; r))$
+ \end{tabular}
+\end{center}
+\noindent
+The injection looks like
+\begin{center}
+ \begin{tabular}{lcl}
+ $\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ &
+ $\dn$ & $\Stars \;
+ ((\inj \; r \;c \;v ) :: vs)$
+ \end{tabular}
+\end{center}
+\noindent
+
+
+\subsection{Proofs for the Augmented Lexing Algorithm}
+We need to maintain two proofs with the additional $r^{\{n\}}$
+construct: the
+correctness proof in chapter \ref{Bitcoded2},
+and the finiteness proof in chapter \ref{Finite}.
+
+\subsubsection{Correctness Proof Augmentation}
+The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitions
+have been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}.
+As they have commented, once the definitions are in place,
+the proofs given for the basic regular expressions will extend to
+bounded regular expressions, and there are no ``surprises''.
+We confirm this point because the correctness theorem would also
+extend without surprise to $\blexersimp$.
+The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on
+do not need to be changed,
+and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to
+add one more line which can be solved by sledgehammer
+to solve the $r^{\{n\}}$ inductive case.
+
+
+\subsubsection{Finiteness Proof Augmentation}
+The bounded repetitions are
+very similar to stars, and therefore the treatment
+is similar, with minor changes to handle some slight complications
+when the counter reaches 0.
+The exponential growth is similar:
+\begin{center}
+ \begin{tabular}{ll}
+ $r^{\{n\}} $ & $\longrightarrow_{\backslash c}$\\
+ $(r\backslash c) \cdot
+ r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c'}$\\
+ \\
+ $r \backslash cc' \cdot r^{\{n - 2\}}* +
+ r \backslash c' \cdot r^{\{n - 1\}}*$ &
+ $\longrightarrow_{\backslash c''}$\\
+ \\
+ $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
+ r \backslash c''\cdot r^{\{n-1\}}) +
+ (r \backslash c'c'' \cdot r^{\{n-2\}}* +
+ r \backslash c'' \cdot r^{\{n-1\}}*)$ &
+ $\longrightarrow_{\backslash c'''}$ \\
+ \\
+ $\ldots$\\
+ \end{tabular}
+\end{center}
+Again, we assume that $r\backslash c$, $r \backslash cc'$ and so on
+are all nullable.
+The flattened list of terms for $r^{\{n\}} \backslash_{rs} s$
+\begin{center}
+ $[r_1 \backslash cc'c'' \cdot r^{\{n-3\}}*,\;
+ r \backslash c''\cdot r^{\{n-1\}}, \;
+ r \backslash c'c'' \cdot r^{\{n-2\}}*, \;
+ r \backslash c'' \cdot r^{\{n-1\}}*,\; \ldots ]$
+\end{center}
+that comes from
+\begin{center}
+ $(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
+ r \backslash c''\cdot r^{\{n-1\}}) +
+ (r \backslash c'c'' \cdot r^{\{n-2\}}* +
+ r \backslash c'' \cdot r^{\{n-1\}}*)+ \ldots$
+\end{center}
+are made of sequences with different tails, where the counters
+might differ.
+The observation for maintaining the bound is that
+these counters never exceed $n$, the original
+counter. With the number of counters staying finite,
+$\rDistinct$ will deduplicate and keep the list finite.
+We introduce this idea as a lemma once we describe all
+the necessary helper functions.
+
+Similar to the star case, we want
+\begin{center}
+ $\rderssimp{r^{\{n\}}}{s} = \rsimp{\sum rs}$.
+\end{center}
+where $rs$
+shall be in the form of
+$\map \; f \; Ss$, where $f$ is a function and
+$Ss$ a list of objects to act on.
+For star, the object's datatype is string.
+The list of strings $Ss$
+is generated using functions
+$\starupdate$ and $\starupdates$.
+The function that takes a string and returns a regular expression
+is the anonymous function $
+(\lambda s'. \; r\backslash s' \cdot r^{\{m\}})$.
+In the NTIMES setting,
+the $\starupdate$ and $\starupdates$ functions are replaced by
+$\textit{nupdate}$ and $\textit{nupdates}$:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\nupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
+ $\nupdate \; c \; r \;
+ (\Some \; (s, \; n + 1) \; :: \; Ss)$ & $\dn$ & %\\
+ $\textit{if} \;
+ (\rnullable \; (r \backslash_{rs} s))$ \\
+ & & $\;\;\textit{then}
+ \;\; \Some \; (s @ [c], n + 1) :: \Some \; ([c], n) :: (
+ \nupdate \; c \; r \; Ss)$ \\
+ & & $\textit{else} \;\; \Some \; (s @ [c], n+1) :: (
+ \nupdate \; c \; r \; Ss)$\\
+ $\nupdate \; c \; r \; (\textit{None} :: Ss)$ & $\dn$ &
+ $(\None :: \nupdate \; c \; r \; Ss)$\\
+ & & \\
+ %\end{tabular}
+%\end{center}
+%\begin{center}
+ %\begin{tabular}{lcl}
+ $\nupdates \; [] \; r \; Ss$ & $\dn$ & $Ss$\\
+ $\nupdates \; (c :: cs) \; r \; Ss$ & $\dn$ & $\nupdates \; cs \; r \; (
+ \nupdate \; c \; r \; Ss)$
+ \end{tabular}
+\end{center}
+\noindent
+which take into account when a subterm
+\begin{center}
+ $r \backslash_s s \cdot r^{\{n\}}$
+\end{center}
+counter $n$
+is 0, and therefore expands to
+\begin{center}
+$r \backslash_s (s@[c]) \cdot r^{\{n\}} \;+
+\; \ZERO$
+\end{center}
+after taking a derivative.
+The object now has type
+\begin{center}
+$\textit{option} \;(\textit{string}, \textit{nat})$
+\end{center}
+and therefore the function for converting such an option into
+a regular expression term is called $\opterm$:
+
+\begin{center}
+ \begin{tabular}{lcl}
+ $\opterm \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
+ & & $\;\;\Some \; (s, n) \Rightarrow
+ (r\backslash_{rs} s)\cdot r^{\{n\}}$\\
+ & & $\;\;\None \Rightarrow
+ \ZERO$\\
+ \end{tabular}
+\end{center}
+\noindent
+Put together, the list $\map \; f \; Ss$ is instantiated as
+\begin{center}
+ $\map \; (\opterm \; r) \; (\nupdates \; s \; r \;
+ [\Some \; ([c], n)])$.
+\end{center}
+For the closed form to be bounded, we would like
+simplification to be applied to each term in the list.
+Therefore we introduce some variants of $\opterm$,
+which help conveniently express the rewriting steps
+needed in the closed form proof.
+\begin{center}
+ \begin{tabular}{lcl}
+ $\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
+ & & $\;\;\Some \; (s, n) \Rightarrow
+ \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\
+ & & $\;\;\None \Rightarrow
+ \ZERO$\\
+ \\
+ $\optermosimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
+ & & $\;\;\Some \; (s, n) \Rightarrow
+ (\textit{rsimp} \; (r\backslash_{rs} s))
+ \cdot r^{\{n\}}$\\
+ & & $\;\;\None \Rightarrow
+ \ZERO$\\
+ \\
+ $\optermsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
+ & & $\;\;\Some \; (s, n) \Rightarrow
+ (r\backslash_{rsimps} s)\cdot r^{\{n\}}$\\
+ & & $\;\;\None \Rightarrow
+ \ZERO$\\
+ \end{tabular}
+\end{center}
+
+
+For a list of
+$\textit{option} \;(\textit{string}, \textit{nat})$ elements,
+we define the highest power for it recursively:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\hpa \; [] \; n $ & $\dn$ & $n$\\
+ $\hpa \; (\None :: os) \; n $ & $\dn$ & $\hpa \; os \; n$\\
+ $\hpa \; (\Some \; (s, n) :: os) \; m$ & $\dn$ &
+ $\hpa \;os \; (\textit{max} \; n\; m)$\\
+ \\
+ $\hpower \; rs $ & $\dn$ & $\hpa \; rs \; 0$\\
+ \end{tabular}
+\end{center}
+
+Now the intuition that an NTIMES regular expression's power
+does not increase can be easily expressed as
+\begin{lemma}\label{nupdatesMono2}
+ $\hpower \; (\nupdates \;s \; r \; [\Some \; ([c], n)]) \leq n$
+\end{lemma}
+\begin{proof}
+ Note that the power is non-increasing after a $\nupdate$ application:
+ \begin{center}
+ $\hpa \;\; (\nupdate \; c \; r \; Ss)\;\; m \leq
+ \hpa\; \; Ss \; m$.
+ \end{center}
+ This is also the case for $\nupdates$:
+ \begin{center}
+ $\hpa \;\; (\nupdates \; s \; r \; Ss)\;\; m \leq
+ \hpa\; \; Ss \; m$.
+ \end{center}
+ Therefore we have that
+ \begin{center}
+ $\hpower \;\; (\nupdates \; s \; r \; Ss) \leq
+ \hpower \;\; Ss$
+ \end{center}
+ which leads to the lemma being proven.
+
+ \end{proof}
+
+
+We also define the inductive rules for
+the shape of derivatives of the NTIMES regular expressions:\\[-3em]
+\begin{center}
+ \begin{mathpar}
+ \inferrule{\mbox{}}{\cbn \;\ZERO}
+
+ \inferrule{\mbox{}}{\cbn \; \; r_a \cdot (r^{\{n\}})}
+
+ \inferrule{\cbn \; r_1 \;\; \; \cbn \; r_2}{\cbn \; r_1 + r_2}
+
+ \inferrule{\cbn \; r}{\cbn \; r + \ZERO}
+ \end{mathpar}
+\end{center}
+\noindent
+A derivative of NTIMES fits into the shape described by $\cbn$:
+\begin{lemma}\label{ntimesDersCbn}
+ $\cbn \; ((r' \cdot r^{\{n\}}) \backslash_{rs} s)$ holds.
+\end{lemma}
+\begin{proof}
+ By a reverse induction on $s$.
+ For the inductive case, note that if $\cbn \; r$ holds,
+ then $\cbn \; (r\backslash_r c)$ holds.
+\end{proof}
+\noindent
+In addition, for $\cbn$-shaped regular expressioins, one can flatten
+them:
+\begin{lemma}\label{ntimesHfauPushin}
+ If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} =
+ \textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \;
+ (\hflataux{r})})$
+\end{lemma}
+\begin{proof}
+ By an induction on the inductive cases of $\cbn$.
+\end{proof}
+\noindent
+This time we do not need to define the flattening functions for NTIMES only,
+because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already.
+\begin{lemma}\label{ntimesHfauInduct}
+$\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} =
+ \map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$
+\end{lemma}
+\begin{proof}
+ By a reverse induction on $s$.
+ The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used.
+\end{proof}
+\noindent
+We have a recursive property for NTIMES with $\nupdate$
+similar to that for STAR,
+and one for $\nupdates $ as well:
+\begin{lemma}\label{nupdateInduct1}
+ \mbox{}
+ \begin{itemize}
+ \item
+ \begin{center}
+ $\textit{concat} \; (\map \; (\hflataux{\_} \circ (
+ \opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \;
+ c \; r \; Ss)$\\
+ \end{center}
+ holds.
+\item
+ \begin{center}
+ $\textit{concat} \; (\map \; \hflataux{\_}\;
+ \map \; (\_\backslash_r x) \;
+ (\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\
+ $=$\\
+ $\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$
+ \end{center}
+ holds.
+ \end{itemize}
+\end{lemma}
+\begin{proof}
+ (i) is by an induction on $Ss$.
+ (ii) is by an induction on $xs$.
+\end{proof}
+\noindent
+The $\nString$ predicate is defined for conveniently
+expressing that there are no empty strings in the
+$\Some \;(s, n)$ elements generated by $\nupdate$:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\nString \; \None$ & $\dn$ & $ \textit{true}$\\
+ $\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\
+ $\nString \; (\Some \; (c::s, n))$ & $\dn$ & $ \textit{true}$\\
+ \end{tabular}
+\end{center}
+\begin{lemma}\label{nupdatesNonempty}
+ If for all elements $o \in \textit{set} \; Ss$,
+ $\nString \; o$ holds, the we have that
+ for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$,
+ $\nString \; o'$ holds.
+\end{lemma}
+\begin{proof}
+ By an induction on $s$, where $Ss$ is set to vary over all possible values.
+\end{proof}
+
+\noindent
+
+\begin{lemma}\label{ntimesClosedFormsSteps}
+ The following list of equalities or rewriting relations hold:\\
+ (i) $r^{\{n+1\}} \backslash_{rsimps} (c::s) =
+ \textit{rsimp} \; (\sum (\map \; (\opterm \;r \;\_) \; (\nupdates \;
+ s \; r \; [\Some \; ([c], n)])))$\\
+ (ii)
+ \begin{center}
+ $\sum (\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [
+ \Some \; ([c], n)]))$ \\ $ \sequal$\\
+ $\sum (\map \; (\textit{rsimp} \circ (\opterm \; r))\; (\nupdates \;
+ s\;r \; [\Some \; ([c], n)]))$\\
+ \end{center}
+ (iii)
+ \begin{center}
+ $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
+ ([c], n)]))$\\
+ $\sequal$\\
+ $\sum \;(\map \; (\optermsimp r) \; (\nupdates \; s \; r \; [\Some \;
+ ([c], n)])) $\\
+ \end{center}
+ (iv)
+ \begin{center}
+ $\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
+ ([c], n)])) $ \\ $\sequal$\\
+ $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
+ ([c], n)])) $\\
+ \end{center}
+ (v)
+ \begin{center}
+ $\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
+ ([c], n)])) $ \\ $\sequal$\\
+ $\sum \; (\map \; (\textit{rsimp} \circ (\opterm \; r)) \;
+ (\nupdates \; s \; r \; [\Some \; ([c], n)]))$
+ \end{center}
+\end{lemma}
+\begin{proof}
+ Routine.
+ (iii) and (iv) make use of the fact that all the strings $s$
+ inside $\Some \; (s, m)$ which are elements of the list
+ $\nupdates \; s\;r\;[\Some\; ([c], n)]$ are non-empty,
+ which is from lemma \ref{nupdatesNonempty}.
+ Once the string in $o = \Some \; (s, n)$ is
+ nonempty, $\optermsimp \; r \;o$,
+ $\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed
+ to be equal.
+ (v) uses \ref{nupdateInduct1}.
+\end{proof}
+\noindent
+Now we are ready to present the closed form for NTIMES:
+\begin{theorem}\label{ntimesClosedForm}
+ The derivative of $r^{\{n+1\}}$ can be described as an alternative
+ containing a list
+ of terms:\\
+ $r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; (
+ \sum (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \;
+ [\Some \; ([c], n)])))$
+\end{theorem}
+\begin{proof}
+ By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}.
+\end{proof}
+\noindent
+The key observation for bounding this closed form
+is that the counter on $r^{\{n\}}$ will
+only decrement during derivatives:
+\begin{lemma}\label{nupdatesNLeqN}
+ For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \;
+ [\Some \; ([c], n)])$, either $o = \None$, or $o = \Some
+ \; (s', m)$ for some string $s'$ and number $m \leq n$.
+\end{lemma}
+\noindent
+The proof is routine and therefore omitted.
+This allows us to say what kind of terms
+are in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; (
+\nupdates \; s \; r \; [\Some \; ([c], n)]))$:
+only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$
+with a small $m$:
+\begin{lemma}\label{ntimesClosedFormListElemShape}
+ For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
+ \nupdates \; s \; r \; [\Some \; ([c], n)]))$,
+ we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot
+ r^{\{m\}}$ for some string $s'$ and number $m \leq n$.
+\end{lemma}
+\begin{proof}
+ Using lemma \ref{nupdatesNLeqN}.
+\end{proof}
+
+\begin{theorem}\label{ntimesClosedFormBounded}
+ Assuming that for any string $s$, $\llbracket r \backslash_{rsimps} s
+ \rrbracket_r \leq N$ holds, then we have that\\
+ $\llbracket r^{\{n+1\}} \backslash_{rsimps} s \rrbracket_r \leq
+ \textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$,
+ where $c_N = \textit{card} \; (\textit{sizeNregex} \; (
+ N + \llbracket r^{\{n\}} \rrbracket_r+1))$.
+\end{theorem}
+\begin{proof}
+We have that for all regular expressions $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
+ \nupdates \; s \; r \; [\Some \; ([c], n)]))$,
+ $r'$'s size is less than or equal to $N + \llbracket r^{\{n\}}
+ \rrbracket_r + 1$
+because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot
+r^{\{m\}}$ for some string $s'$ and number
+$m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}).
+In addition, we know that the list
+$\map \; (\optermsimp \; r) \; (
+\nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most
+$c_N = \textit{card} \;
+(\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$.
+This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r
+\leq N * c_N$.
+\end{proof}
+
+We aim to formalise the correctness and size bound
+for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$
+and so on, which is still work in progress.
+They should more or less follow the same recipe described in this section.
+Once we know about how to deal with them recursively using suitable auxiliary
+definitions, we are able to routinely establish the proofs.
+
%----------------------------------------------------------------------------------------
% SECTION 3
--- a/ChengsongTanPhdThesis/Chapters/RelatedWork.tex Sat Nov 12 21:34:40 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/RelatedWork.tex Thu Nov 17 23:13:57 2022 +0000
@@ -9,11 +9,16 @@
\section{Work on Back-References}
-We introduced back-references
+We introduced regular expressions with back-references
in chapter \ref{Introduction}.
-This is a quite deep problem,
-with theoretical work on them being
-fairly recent.
+We adopt the common practice of calling them rewbrs (Regular Expressions
+With Back References) in this section for brevity.
+It has been shown by Aho \cite{Aho1990}
+that the k-vertex cover problem can be reduced
+to the problem of rewbrs matching problem, and is therefore NP-complete.
+Given the depth of the problem, the
+theoretical work on them is
+fairly recent.
Campaneu et al. studied regexes with back-references
in the context of formal languages theory in
@@ -34,7 +39,8 @@
great power to certain programming tasks but are difficult to harness.
An interesting question would then be
whether we can add restrictions to them,
-so that they become expressive enough, but not too powerful.
+so that they become expressive enough for practical use such
+as html files, but not too powerful.
Freydenberger and Schmid \cite{FREYDENBERGER20191}
introduced the notion of deterministic
regular expressions with back-references to achieve
@@ -44,6 +50,7 @@
Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
investigated the time complexity of different variants
of back-references.
+We are not aware of any work that uses derivatives on back-references.
See \cite{BERGLUND2022} for a survey
of these works and comparison between different
@@ -60,7 +67,7 @@
Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
\subsection{Static Analysis of Evil Regex Patterns}
- When a regular expression does not behave as intended,
+When a regular expression does not behave as intended,
people usually try to rewrite the regex to some equivalent form
or they try to avoid the possibly problematic patterns completely,
for which many false positives exist\parencite{Davis18}.
@@ -103,3 +110,86 @@
does not grow indefinitely),
then they cannot claim with confidence having solved the problem
of catastrophic backtracking.
+
+
+\section{Derivatives and Zippers}
+Zipper is a data structure designed to focus on
+and navigate between local parts of a tree.
+The idea is that often operations on a large tree only deal with
+local regions each time.
+It was first formally described by Huet \cite{HuetZipper}.
+Typical applications of zippers involve text editor buffers
+and proof system databases.
+In our setting, the idea is to compactify the representation
+of derivatives with zippers, thereby making our algorithm faster.
+We draw our inspirations from several works on parsing, derivatives
+and zippers.
+
+Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
+\cite{Zippy2020}.
+They adopted zippers to improve the speed, and argued that the runtime
+complexity of their algorithm was linear with respect to the input string.
+
+The idea of using Brzozowski derivatives on general context-free
+parsing was first implemented
+by Might et al. \ref{Might2011}.
+They used memoization and fixpoint construction to eliminate infinite recursion,
+which is a major problem for using derivatives on context-free grammars.
+The initial version was quite slow----exponential on the size of the input.
+Adams et al. \ref{Adams2016} improved the speed and argued that their version
+was cubic with respect to the input.
+Darragh and Adams \cite{10.1145/3408990} further improved the performance
+by using zippers in an innovative way--their zippers had multiple focuses
+instead of just one in a traditional zipper to handle ambiguity.
+Their algorithm was not formalised, though.
+
+We aim to integrate the zipper data structure into our lexer.
+The idea is very simple: using a zipper with multiple focuses
+just like Darragh did, we could represent
+\[
+ x\cdot r + y \cdot r + \ldots
+\]
+as
+\[
+ (x+y + \ldots) \cdot r.
+\]
+This would greatly reduce the amount of space needed
+when we have many terms like $x\cdot r$.
+
+into the current lexer.
+This allows the lexer to not traverse the entire
+derivative tree generated
+in every intermediate step, but only a small segment
+that is currently active.
+
+Some initial results have been obtained, but more care needs to be taken to make sure
+that the lexer output conforms to the POSIX standard. Formalising correctness of
+such a lexer using zippers will probably require using an imperative
+framework with separation logic as it involves
+mutable data structures, which is also interesting to look into.
+
+To further optimise the algorithm,
+I plan to add a ``deduplicate'' function that tells
+whether two regular expressions denote the same
+language using
+an efficient and verified equivalence checker.
+In this way, the internal data structures can be pruned more aggressively,
+leading to better simplifications and ultimately faster algorithms.
+
+
+Some initial results
+We first give a brief introduction to what zippers are,
+and other works
+that apply zippers to derivatives
+When dealing with large trees, it would be a waste to
+traverse the entire tree if
+the operation only
+involves a small fraction of it.
+The idea is to put the focus on that subtree, turning other parts
+of the tree into a context
+
+
+One observation about our derivative-based lexing algorithm is that
+the derivative operation sometimes traverses the entire regular expression
+unnecessarily:
+
--- a/ChengsongTanPhdThesis/example.bib Sat Nov 12 21:34:40 2022 +0000
+++ b/ChengsongTanPhdThesis/example.bib Thu Nov 17 23:13:57 2022 +0000
@@ -233,9 +233,93 @@
bdsk-url-1 = {https://www.sciencedirect.com/science/article/pii/S0304397509001789},
bdsk-url-2 = {https://doi.org/10.1016/j.tcs.2009.02.022}}
+
+@incollection{Aho1990,
+title = {CHAPTER 5 - Algorithms for Finding Patterns in Strings},
+editor = {JAN {VAN LEEUWEN}},
+booktitle = {Algorithms and Complexity},
+publisher = {Elsevier},
+address = {Amsterdam},
+pages = {255-300},
+year = {1990},
+series = {Handbook of Theoretical Computer Science},
+isbn = {978-0-444-88071-0},
+doi = {https://doi.org/10.1016/B978-0-444-88071-0.50010-2},
+url = {https://www.sciencedirect.com/science/article/pii/B9780444880710500102},
+author = {A.V.~Aho},
+abstract = {Publisher Summary
+This chapter discusses the algorithms for solving string-matching problems that have proven useful for text-editing and text-processing applications. String pattern matching is an important problem that occurs in many areas of science and information processing. In computing, it occurs naturally as part of data processing, text editing, term rewriting, lexical analysis, and information retrieval. Many text editors and programming languages have facilities for matching strings. In biology, string-matching problems arise in the analysis of nucleic acids and protein sequences, and in the investigation of molecular phylogeny. String matching is also one of the central and most widely studied problems in theoretical computer science. The simplest form of the problem is to locate an occurrence of a keyword as a substring in a sequence of characters, which is called the input string. For example, the input string queueing contains the keyword ueuei as a substring. Even for this problem, several innovative, theoretically interesting algorithms have been devised that run significantly faster than the obvious brute-force method.}
+}
+
%% back-references--------------------
%% -------------------------------------
+
+
+%%----------------------------------------------------------------
+%%----------------------------------------zippers
+@article{10.1145/3408990,
+author = {Darragh, Pierce and Adams, Michael D.},
+title = {Parsing with Zippers (Functional Pearl)},
+year = {2020},
+issue_date = {August 2020},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+volume = {4},
+number = {ICFP},
+url = {https://doi.org/10.1145/3408990},
+doi = {10.1145/3408990},
+abstract = {Parsing with Derivatives (PwD) is an elegant approach to parsing context-free grammars (CFGs). It takes the equational theory behind Brzozowski's derivative for regular expressions and augments that theory with laziness, memoization, and fixed points. The result is a simple parser for arbitrary CFGs. Although recent work improved the performance of PwD, it remains inefficient due to the algorithm repeatedly traversing some parts of the grammar. In this functional pearl, we show how to avoid this inefficiency by suspending the state of the traversal in a zipper. When subsequent derivatives are taken, we can resume the traversal from where we left off without retraversing already traversed parts of the grammar. However, the original zipper is designed for use with trees, and we want to parse CFGs. CFGs can include shared regions, cycles, and choices between alternates, which makes them incompatible with the traditional tree model for zippers. This paper develops a generalization of zippers to properly handle these additional features. Just as PwD generalized Brzozowski's derivatives from regular expressions to CFGs, we generalize Huet's zippers from trees to CFGs. Abstract The resulting parsing algorithm is concise and efficient: it takes only 31 lines of OCaml code to implement the derivative function but performs 6,500 times faster than the original PwD and 3.24 times faster than the optimized implementation of PwD.},
+journal = {Proc. ACM Program. Lang.},
+month = {aug},
+articleno = {108},
+numpages = {28},
+keywords = {Derivatives, Zippers, Parsing with Derivatives, Parsing}
+}
+
+
+
+@article{Edelmann:287059,
+ title = {Efficient Parsing with Derivatives and Zippers},
+ author = {Edelmann, Romain},
+ institution = {IINFCOM},
+ publisher = {EPFL},
+ address = {Lausanne},
+ pages = {246},
+ year = {2021},
+ abstract = {Parsing is the process that enables a computer system to make sense of raw data. Parsing is common to almost all computer systems: It is involved every time sequential data is read and elaborated into structured data. The theory of parsing usually focuses on the binary recognition aspect of parsing and eschews this essential data-elaboration aspect. In this thesis, I present a declarative framework for value-aware parsing that explicitly integrates data elaboration.
+Within the framework of the thesis, I present parsing algorithms that are based on the concept of Brzozowski's derivatives. Derivative-based parsing algorithms present several advantages: they are elegant, amenable to formal reasoning, and easy to implement. Unfortunately, the performance of these algorithms in practice is often not competitive with other approaches. In this thesis, I show a general technique inspired by Huet's Zipper to greatly enhance the performance of derivative-based algorithms, and I do so without compromising their elegance, amenability to formal reasoning, or ease of implementation.
+First, I present a technique for building efficient tokenisers that is based on Brzozowski's derivatives and Huet's zipper and that does not require the usual burdensome explicit conversion to automata. I prove the technique is correct in Coq and present SILEX, a Scala lexing library based on the technique. I demonstrate that the approach is competitive with state-of-the-art solutions.
+Then, I present a characterisation of LL(1) languages based on the concept of should-not-follow sets. I present an algorithm for parsing LL(1) languages with derivatives and zippers. I show a formal proof of the algorithm's correctness and prove its worst-case linear-time complexity. I show how the LL(1) parsing with derivatives and zippers algorithm corresponds to the traditional LL(1) parsing algorithm.
+I then present SCALL1ON, a Scala parsing combinators library for LL(1) languages that incorporates the LL(1) parsing with derivatives and zippers algorithm. I present an expressive and familiar combinator-based interface for describing LL(1) languages. I present techniques that help precisely locate LL(1) conflicts in user code. I discuss several advantages of the parsing with derivatives approach within the context of a parsing library. I also present SCALL1ON's enumeration and pretty-printing features and discuss their implementation. Through a series of benchmarks, I demonstrate the good performance and practicality of the approach. Finally, I present how to adapt the LL(1) parsing with derivatives and zippers algorithm to support arbitrary context-free languages. I show how the adapted algorithm corresponds to general parsing algorithms, such as Earley's parsing algorithm.},
+ url = {http://infoscience.epfl.ch/record/287059},
+ doi = {10.5075/epfl-thesis-7357},
+}
+
+@inproceedings{Zippy2020,
+author = {Edelmann, Romain and Hamza, Jad and Kun\v{c}ak, Viktor},
+title = {Zippy LL(1) Parsing with Derivatives},
+year = {2020},
+isbn = {9781450376136},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3385412.3385992},
+doi = {10.1145/3385412.3385992},
+abstract = {In this paper, we present an efficient, functional, and formally verified parsing algorithm for LL(1) context-free expressions based on the concept of derivatives of formal languages. Parsing with derivatives is an elegant parsing technique, which, in the general case, suffers from cubic worst-case time complexity and slow performance in practice. We specialise the parsing with derivatives algorithm to LL(1) context-free expressions, where alternatives can be chosen given a single token of lookahead. We formalise the notion of LL(1) expressions and show how to efficiently check the LL(1) property. Next, we present a novel linear-time parsing with derivatives algorithm for LL(1) expressions operating on a zipper-inspired data structure. We prove the algorithm correct in Coq and present an implementation as a part of Scallion, a parser combinators framework in Scala with enumeration and pretty printing capabilities.},
+booktitle = {Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation},
+pages = {1036–1051},
+numpages = {16},
+keywords = {Parsing, Zipper, Formal proof, LL(1), Derivatives},
+location = {London, UK},
+series = {PLDI 2020}
+}
+
+
+
+%%----------------------------------------zippers
+%%----------------------------------------------------------------
+
+
@article{fowler2003,
title={An interpretation of the POSIX regex standard},
author={Fowler, Glenn},