--- a/ChengsongTanPhdThesis/Chapters/Finite.tex Mon Jul 10 00:44:45 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Mon Jul 10 01:33:45 2023 +0100
@@ -6,7 +6,7 @@
% In Chapter 4 \ref{Chapter4} we give the second guarantee
%of our bitcoded algorithm, that is a finite bound on the size of any
%regex's derivatives.
-
+%(this is cahpter 5 now)
In this chapter we give a bound in terms of the size of
the calculated derivatives:
@@ -14,7 +14,7 @@
our algorithm $\blexersimp$'s derivatives
are finitely bounded
by a constant that only depends on $a$.
-Formally we show that there exists an $N_a$ such that
+Formally we show that there exists an integer $N_a$ such that
\begin{center}
$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
\end{center}
@@ -25,16 +25,47 @@
tree structure (its recursive definition is given in the next page).
We believe this size bound
is important in the context of POSIX lexing because
+\marginpar{Addressing Gerog comment: "how does this relate to backtracking?"}
\begin{itemize}
\item
It is a stepping stone towards the goal
of eliminating ``catastrophic backtracking''.
- If the internal data structures used by our algorithm
- grows beyond a finite bound, then clearly
- the algorithm (which traverses these structures) will
- be slow.
- The next step is to refine the bound $N_a$ so that it
- is not just finite but polynomial in $\llbracket a\rrbracket$.
+ The derivative-based lexing algorithm avoids backtracking
+ by a trade-off between space and time.
+ Backtracking algorithms
+ save other possibilities on a stack when exploring one possible
+ path of matching. Catastrophic backtracking typically occurs
+ when the number of steps increase exponentially with respect
+ to input. In other words, the runtime is $O((c_r)^n)$ of the input
+ string length $n$, where the base of the exponent is determined by the
+ regular expression $r$.
+ %so that they
+ %can be traversed in the future in a DFS manner,
+ %different matchings are stored as sub-expressions
+ %in a regular expression derivative.
+ Derivatives saves these possibilities as sub-expressions
+ and traverse those during future derivatives. If we denote the size
+ of intermediate derivatives as $S_{r,n}$ (where the subscripts
+ $r,n$ indicate that $S$ depends on them), then the runtime of
+ derivative-based approaches would be $O(S_{r,n} * n)$.
+ We observe that if $S_{r,n}$ continously grows with $n$ (for example
+ growing exponentially fast), then this
+ is equally bad as catastrophic backtracking.
+ Our finiteness bound seeks to find a constant upper bound $C$ of $\S_{r,n}$,
+ so that the complexity of the algorithm can be seen as linear ($O(C * n)$).
+ Even if $C$ is still large in our current work, it is still a constant
+ rather than ever-increasing number with respect to input length $n$.
+ More importantly this $C$ constant can potentially
+ be shrunken as we optimize our simplification procedure.
+ %and showing the potential
+ %improvements can be by the notion of partial derivatives.
+
+ %If the internal data structures used by our algorithm
+ %grows beyond a finite bound, then clearly
+ %the algorithm (which traverses these structures) will
+ %be slow.
+ %The next step is to refine the bound $N_a$ so that it
+ %is not just finite but polynomial in $\llbracket a\rrbracket$.
\item
Having the finite bound formalised
gives us higher confidence that