# HG changeset patch # User Chengsong # Date 1688949225 -3600 # Node ID eddc4eaba7c4a9b148c9d21b9e3e68f8211b8e6d # Parent 2e05f04ed6b34548dcc3b594fedabc6fad81c946 addresses Gerog "N_r meaning and relation with backtracking?" comment diff -r 2e05f04ed6b3 -r eddc4eaba7c4 ChengsongTanPhdThesis/Chapters/Finite.tex --- a/ChengsongTanPhdThesis/Chapters/Finite.tex Mon Jul 10 00:44:45 2023 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Mon Jul 10 01:33:45 2023 +0100 @@ -6,7 +6,7 @@ % In Chapter 4 \ref{Chapter4} we give the second guarantee %of our bitcoded algorithm, that is a finite bound on the size of any %regex's derivatives. - +%(this is cahpter 5 now) In this chapter we give a bound in terms of the size of the calculated derivatives: @@ -14,7 +14,7 @@ our algorithm $\blexersimp$'s derivatives are finitely bounded by a constant that only depends on $a$. -Formally we show that there exists an $N_a$ such that +Formally we show that there exists an integer $N_a$ such that \begin{center} $\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$ \end{center} @@ -25,16 +25,47 @@ tree structure (its recursive definition is given in the next page). We believe this size bound is important in the context of POSIX lexing because +\marginpar{Addressing Gerog comment: "how does this relate to backtracking?"} \begin{itemize} \item It is a stepping stone towards the goal of eliminating ``catastrophic backtracking''. - If the internal data structures used by our algorithm - grows beyond a finite bound, then clearly - the algorithm (which traverses these structures) will - be slow. - The next step is to refine the bound $N_a$ so that it - is not just finite but polynomial in $\llbracket a\rrbracket$. + The derivative-based lexing algorithm avoids backtracking + by a trade-off between space and time. + Backtracking algorithms + save other possibilities on a stack when exploring one possible + path of matching. Catastrophic backtracking typically occurs + when the number of steps increase exponentially with respect + to input. In other words, the runtime is $O((c_r)^n)$ of the input + string length $n$, where the base of the exponent is determined by the + regular expression $r$. + %so that they + %can be traversed in the future in a DFS manner, + %different matchings are stored as sub-expressions + %in a regular expression derivative. + Derivatives saves these possibilities as sub-expressions + and traverse those during future derivatives. If we denote the size + of intermediate derivatives as $S_{r,n}$ (where the subscripts + $r,n$ indicate that $S$ depends on them), then the runtime of + derivative-based approaches would be $O(S_{r,n} * n)$. + We observe that if $S_{r,n}$ continously grows with $n$ (for example + growing exponentially fast), then this + is equally bad as catastrophic backtracking. + Our finiteness bound seeks to find a constant upper bound $C$ of $\S_{r,n}$, + so that the complexity of the algorithm can be seen as linear ($O(C * n)$). + Even if $C$ is still large in our current work, it is still a constant + rather than ever-increasing number with respect to input length $n$. + More importantly this $C$ constant can potentially + be shrunken as we optimize our simplification procedure. + %and showing the potential + %improvements can be by the notion of partial derivatives. + + %If the internal data structures used by our algorithm + %grows beyond a finite bound, then clearly + %the algorithm (which traverses these structures) will + %be slow. + %The next step is to refine the bound $N_a$ so that it + %is not just finite but polynomial in $\llbracket a\rrbracket$. \item Having the finite bound formalised gives us higher confidence that