lexing: comparison ChengsongTanPhdThesis/Chapters/Finite.tex

equal deleted inserted replaced

-:2e05f04ed6b3
+:eddc4eaba7c4
 \label{Finite}
 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
 %of our bitcoded algorithm, that is a finite bound on the size of any
 %regex's derivatives.
+%(this is cahpter 5 now)
 In this chapter we give a bound in terms of the size of
 the calculated derivatives:
 given an annotated regular expression $a$, for any string $s$
 our algorithm $\blexersimp$'s derivatives
 are finitely bounded
 by a constant that only depends on $a$.
-Formally we show that there exists an $N_a$ such that
+Formally we show that there exists an integer $N_a$ such that
 \begin{center}
 	$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
 \end{center}
 \noindent
 where the size ($\llbracket \_ \rrbracket$) of
 an annotated regular expression is defined
 in terms of the number of nodes in its
 tree structure (its recursive definition is given in the next page).
 We believe this size bound
 is important in the context of POSIX lexing because
+\marginpar{Addressing Gerog comment: "how does this relate to backtracking?"}
 \begin{itemize}
 	\item
 		It is a stepping stone towards the goal
 		of eliminating ``catastrophic backtracking''.
-		If the internal data structures used by our algorithm
+		The derivative-based lexing algorithm avoids backtracking
-		grows beyond a finite bound, then clearly
+		by a trade-off between space and time.
-		the algorithm (which traverses these structures) will
+		Backtracking algorithms
-		be slow.
+		save other possibilities on a stack when exploring one possible
-		The next step is to refine the bound $N_a$ so that it
+		path of matching. Catastrophic backtracking typically occurs
-		is not just finite but polynomial in $\llbracket a\rrbracket$.
+		when the number of steps increase exponentially with respect
+		to input. In other words, the runtime is $O((c_r)^n)$ of the input
+		string length $n$, where the base of the exponent is determined by the
+		regular expression $r$.
+		%so that they
+		%can be traversed in the future in a DFS manner,
+		%different matchings are stored as sub-expressions
+		%in a regular expression derivative.
+		Derivatives saves these possibilities as sub-expressions
+		and traverse those during future derivatives. If we denote the size
+		of intermediate derivatives as $S_{r,n}$ (where the subscripts
+		$r,n$ indicate that $S$ depends on them), then the runtime of
+		derivative-based approaches would be $O(S_{r,n} * n)$.
+		We observe that if $S_{r,n}$ continously grows with $n$ (for example
+		growing exponentially fast), then this
+		is equally bad as catastrophic backtracking.
+		Our finiteness bound seeks to find a constant upper bound $C$ of $\S_{r,n}$,
+		so that the complexity of the algorithm can be seen as linear ($O(C * n)$).
+		Even if $C$ is still large in our current work, it is still a constant
+		rather than ever-increasing number with respect to input length $n$.
+		More importantly this $C$ constant can potentially
+		be shrunken as we optimize our simplification procedure.
+		%and showing the potential
+		%improvements can be by the notion of partial derivatives.
+		%If the internal data structures used by our algorithm
+		%grows beyond a finite bound, then clearly
+		%the algorithm (which traverses these structures) will
+		%be slow.
+		%The next step is to refine the bound $N_a$ so that it
+		%is not just finite but polynomial in $\llbracket a\rrbracket$.
 	\item
 		Having the finite bound formalised
 		gives us higher confidence that
 		our simplification algorithm $\simp$ does not ``misbehave''
 		like $\textit{simpSL}$ does.

changeset 660	eddc4eaba7c4
parent 659	2e05f04ed6b3
child 661	71502e4d8691