addresses Gerog "N_r meaning and relation with backtracking?" comment
authorChengsong
Mon, 10 Jul 2023 01:33:45 +0100
changeset 660 eddc4eaba7c4
parent 659 2e05f04ed6b3
child 661 71502e4d8691
addresses Gerog "N_r meaning and relation with backtracking?" comment
ChengsongTanPhdThesis/Chapters/Finite.tex
--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Mon Jul 10 00:44:45 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Mon Jul 10 01:33:45 2023 +0100
@@ -6,7 +6,7 @@
 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
 %of our bitcoded algorithm, that is a finite bound on the size of any 
 %regex's derivatives. 
-
+%(this is cahpter 5 now)
 
 In this chapter we give a bound in terms of the size of 
 the calculated derivatives: 
@@ -14,7 +14,7 @@
 our algorithm $\blexersimp$'s derivatives
 are finitely bounded
 by a constant that only depends on $a$.
-Formally we show that there exists an $N_a$ such that
+Formally we show that there exists an integer $N_a$ such that
 \begin{center}
 	$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
 \end{center}
@@ -25,16 +25,47 @@
 tree structure (its recursive definition is given in the next page).
 We believe this size bound
 is important in the context of POSIX lexing because 
+\marginpar{Addressing Gerog comment: "how does this relate to backtracking?"}
 \begin{itemize}
 	\item
 		It is a stepping stone towards the goal 
 		of eliminating ``catastrophic backtracking''. 
-		If the internal data structures used by our algorithm
-		grows beyond a finite bound, then clearly 
-		the algorithm (which traverses these structures) will
-		be slow.
-		The next step is to refine the bound $N_a$ so that it
-		is not just finite but polynomial in $\llbracket a\rrbracket$.
+		The derivative-based lexing algorithm avoids backtracking
+		by a trade-off between space and time.
+		Backtracking algorithms
+		save other possibilities on a stack when exploring one possible
+		path of matching. Catastrophic backtracking typically occurs
+		when the number of steps increase exponentially with respect
+		to input. In other words, the runtime is $O((c_r)^n)$ of the input
+		string length $n$, where the base of the exponent is determined by the
+		regular expression $r$.
+		%so that they
+		%can be traversed in the future in a DFS manner,
+		%different matchings are stored as sub-expressions 
+		%in a regular expression derivative.
+		Derivatives saves these possibilities as sub-expressions
+		and traverse those during future derivatives. If we denote the size
+		of intermediate derivatives as $S_{r,n}$ (where the subscripts
+		$r,n$ indicate that $S$ depends on them), then the runtime of 
+		derivative-based approaches would be $O(S_{r,n} * n)$.
+		We observe that if $S_{r,n}$ continously grows with $n$ (for example
+		growing exponentially fast), then this
+		is equally bad as catastrophic backtracking.
+		Our finiteness bound seeks to find a constant upper bound $C$ of $\S_{r,n}$,
+		so that the complexity of the algorithm can be seen as linear ($O(C * n)$).
+		Even if $C$ is still large in our current work, it is still a constant
+		rather than ever-increasing number with respect to input length $n$.
+		More importantly this $C$ constant can potentially
+		be shrunken as we optimize our simplification procedure. 
+		%and showing the potential
+		%improvements can be by the notion of partial derivatives.
+		
+		%If the internal data structures used by our algorithm
+		%grows beyond a finite bound, then clearly 
+		%the algorithm (which traverses these structures) will
+		%be slow.
+		%The next step is to refine the bound $N_a$ so that it
+		%is not just finite but polynomial in $\llbracket a\rrbracket$.
 	\item
 		Having the finite bound formalised 
 		gives us higher confidence that