ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 660 eddc4eaba7c4
parent 659 2e05f04ed6b3
child 661 71502e4d8691
equal deleted inserted replaced
659:2e05f04ed6b3 660:eddc4eaba7c4
     4 
     4 
     5 \label{Finite} 
     5 \label{Finite} 
     6 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
     6 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
     7 %of our bitcoded algorithm, that is a finite bound on the size of any 
     7 %of our bitcoded algorithm, that is a finite bound on the size of any 
     8 %regex's derivatives. 
     8 %regex's derivatives. 
     9 
     9 %(this is cahpter 5 now)
    10 
    10 
    11 In this chapter we give a bound in terms of the size of 
    11 In this chapter we give a bound in terms of the size of 
    12 the calculated derivatives: 
    12 the calculated derivatives: 
    13 given an annotated regular expression $a$, for any string $s$
    13 given an annotated regular expression $a$, for any string $s$
    14 our algorithm $\blexersimp$'s derivatives
    14 our algorithm $\blexersimp$'s derivatives
    15 are finitely bounded
    15 are finitely bounded
    16 by a constant that only depends on $a$.
    16 by a constant that only depends on $a$.
    17 Formally we show that there exists an $N_a$ such that
    17 Formally we show that there exists an integer $N_a$ such that
    18 \begin{center}
    18 \begin{center}
    19 	$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
    19 	$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
    20 \end{center}
    20 \end{center}
    21 \noindent
    21 \noindent
    22 where the size ($\llbracket \_ \rrbracket$) of 
    22 where the size ($\llbracket \_ \rrbracket$) of 
    23 an annotated regular expression is defined
    23 an annotated regular expression is defined
    24 in terms of the number of nodes in its 
    24 in terms of the number of nodes in its 
    25 tree structure (its recursive definition is given in the next page).
    25 tree structure (its recursive definition is given in the next page).
    26 We believe this size bound
    26 We believe this size bound
    27 is important in the context of POSIX lexing because 
    27 is important in the context of POSIX lexing because 
       
    28 \marginpar{Addressing Gerog comment: "how does this relate to backtracking?"}
    28 \begin{itemize}
    29 \begin{itemize}
    29 	\item
    30 	\item
    30 		It is a stepping stone towards the goal 
    31 		It is a stepping stone towards the goal 
    31 		of eliminating ``catastrophic backtracking''. 
    32 		of eliminating ``catastrophic backtracking''. 
    32 		If the internal data structures used by our algorithm
    33 		The derivative-based lexing algorithm avoids backtracking
    33 		grows beyond a finite bound, then clearly 
    34 		by a trade-off between space and time.
    34 		the algorithm (which traverses these structures) will
    35 		Backtracking algorithms
    35 		be slow.
    36 		save other possibilities on a stack when exploring one possible
    36 		The next step is to refine the bound $N_a$ so that it
    37 		path of matching. Catastrophic backtracking typically occurs
    37 		is not just finite but polynomial in $\llbracket a\rrbracket$.
    38 		when the number of steps increase exponentially with respect
       
    39 		to input. In other words, the runtime is $O((c_r)^n)$ of the input
       
    40 		string length $n$, where the base of the exponent is determined by the
       
    41 		regular expression $r$.
       
    42 		%so that they
       
    43 		%can be traversed in the future in a DFS manner,
       
    44 		%different matchings are stored as sub-expressions 
       
    45 		%in a regular expression derivative.
       
    46 		Derivatives saves these possibilities as sub-expressions
       
    47 		and traverse those during future derivatives. If we denote the size
       
    48 		of intermediate derivatives as $S_{r,n}$ (where the subscripts
       
    49 		$r,n$ indicate that $S$ depends on them), then the runtime of 
       
    50 		derivative-based approaches would be $O(S_{r,n} * n)$.
       
    51 		We observe that if $S_{r,n}$ continously grows with $n$ (for example
       
    52 		growing exponentially fast), then this
       
    53 		is equally bad as catastrophic backtracking.
       
    54 		Our finiteness bound seeks to find a constant upper bound $C$ of $\S_{r,n}$,
       
    55 		so that the complexity of the algorithm can be seen as linear ($O(C * n)$).
       
    56 		Even if $C$ is still large in our current work, it is still a constant
       
    57 		rather than ever-increasing number with respect to input length $n$.
       
    58 		More importantly this $C$ constant can potentially
       
    59 		be shrunken as we optimize our simplification procedure. 
       
    60 		%and showing the potential
       
    61 		%improvements can be by the notion of partial derivatives.
       
    62 		
       
    63 		%If the internal data structures used by our algorithm
       
    64 		%grows beyond a finite bound, then clearly 
       
    65 		%the algorithm (which traverses these structures) will
       
    66 		%be slow.
       
    67 		%The next step is to refine the bound $N_a$ so that it
       
    68 		%is not just finite but polynomial in $\llbracket a\rrbracket$.
    38 	\item
    69 	\item
    39 		Having the finite bound formalised 
    70 		Having the finite bound formalised 
    40 		gives us higher confidence that
    71 		gives us higher confidence that
    41 		our simplification algorithm $\simp$ does not ``misbehave''
    72 		our simplification algorithm $\simp$ does not ``misbehave''
    42 		like $\textit{simpSL}$ does.
    73 		like $\textit{simpSL}$ does.