ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 590 988e92a70704
parent 577 f47fc4840579
child 591 b2d0de6aee18
equal deleted inserted replaced
589:86e0203db2da 590:988e92a70704
     6 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
     6 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
     7 %of our bitcoded algorithm, that is a finite bound on the size of any 
     7 %of our bitcoded algorithm, that is a finite bound on the size of any 
     8 %regex's derivatives. 
     8 %regex's derivatives. 
     9 
     9 
    10 In this chapter we give a guarantee in terms of size: 
    10 In this chapter we give a guarantee in terms of size: 
    11 given an annotated regular expression $a$, for any string $s$ 
    11 given an annotated regular expression $a$, for any string $s$
    12 our algorithm's internal annotated regular expression 
    12 our algorithm $\blexersimp$'s internal annotated regular expression 
    13 size
    13 size  is finitely bounded
    14 \begin{center}
    14 by a constant $N_a$ that only depends on $a$:
    15 $\llbracket \bderssimp{a}{s} \rrbracket$
    15 \begin{center}
    16 \end{center}
    16 $\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
    17 \noindent
    17 \end{center}
    18 is finitely bounded
    18 \noindent
    19 by a constant $N_a$ that only depends on $a$,
       
    20 where the size of an annotated regular expression is defined
    19 where the size of an annotated regular expression is defined
    21 in terms of the number of nodes in its tree structure:
    20 in terms of the number of nodes in its tree structure:
    22 \begin{center}
    21 \begin{center}
    23 \begin{tabular}{ccc}
    22 \begin{tabular}{ccc}
    24 	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
    23 	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
    25 	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
    24 	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
    26 	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
    25 	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
    27 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
    26 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
    28 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
    27 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
    29 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
    28 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$.
    30 \end{tabular}
    29 \end{tabular}
    31 \end{center}
    30 \end{center}
    32 
    31 We believe this size formalisation 
       
    32 of the algorithm is important in our context, because 
       
    33 \begin{itemize}
       
    34 	\item
       
    35 		It is a stepping stone towards an ``absence of catastrophic-backtracking''
       
    36 		guarantee. The next step would be to refine the bound $N_a$ so that it
       
    37 		is polynomial on $\llbracket a\rrbracket$.
       
    38 	\item
       
    39 		The size bound proof gives us a higher confidence that
       
    40 		our simplification algorithm $\simp$ does not ``mis-behave''
       
    41 		like $\simpsulz$ does.
       
    42 		The bound is universal, which is an advantage over work which 
       
    43 		only gives empirical evidence on some test cases.
       
    44 \end{itemize}
    33 \section{Formalising About Size}
    45 \section{Formalising About Size}
    34 \noindent
    46 \noindent
    35 In our lexer $\blexersimp$,
    47 In our lexer $\blexersimp$,
    36 The regular expression is repeatedly being taken derivative of
    48 The regular expression is repeatedly being taken derivative of
    37 and then simplified.
    49 and then simplified.
    38 \begin{center}
    50 \begin{figure}[H]
    39 \begin{tikzpicture}[scale=2,node distance=1.4cm,
    51 \begin{tikzpicture}[scale=2,
    40                     every node/.style={minimum size=9mm}]
    52                     every node/.style={minimum size=11mm},
    41 \node (r0)  {$r$};
    53 		    ->,>=stealth',shorten >=1pt,auto,thick
    42 \node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$};
    54 		    ]
       
    55 \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
       
    56 \node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$};
    43 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};
    57 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};
    44 \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$r_{1s}$};
    58 
    45 \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {simp};
    59 \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$};
    46 \node (r2) [rectangle, draw=black, thick,  right=of r1s]{$r_2$};
    60 \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$};
       
    61 
       
    62 \node (r2) [rectangle, draw=black, thick,  right=of r1s, minimum size = 12mm]{$a_2$};
    47 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
    63 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
    48 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$};
    64 
    49 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
    65 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$};
    50 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
    66 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$};
       
    67 
       
    68 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$};
    51 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
    69 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
    52 \node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$};
    70 
    53 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
    71 \node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$};
       
    72 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode};
    54 \end{tikzpicture}
    73 \end{tikzpicture}
    55 \end{center}
    74 \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks}
    56 \noindent
    75 \end{figure}
    57 As illustrated in the picture, 
    76 \noindent
    58 each time the regular expression 
    77 Each time
    59 is taken derivative of, it grows (the black nodes),
    78 a derivative is taken, a regular expression might grow a bit,
    60 and after simplification, it shrinks (the blue nodes).
    79 but simplification always takes care that 
       
    80 it stays small.
    61 This intuition is depicted by the relative size
    81 This intuition is depicted by the relative size
    62 difference between the black and blue nodes.
    82 change between the black and blue nodes:
    63 We give a mechanised proof that after simplification 
    83 After $\simp$ the node always shrinks.
    64 the regular expression's size (the blue ones)
    84 Our proof says that all the blue nodes
    65 $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
    85 stay below a size bound $N_a$ determined by $a$.
    66 a fixed $a$. 
    86 
    67 
    87 \noindent
    68 There are two problems with this finiteness proof, though.
    88 Sulzmann and Lu's assumed something similar about their algorithm,
    69 \begin{itemize}
    89 though in fact their algorithm's size might be better depicted by the following graph:
    70 	\item
    90 \begin{figure}[H]
    71 First, It is not yet a direct formalisation of our lexer's complexity,
    91 \begin{tikzpicture}[scale=2,
    72 as a complexity proof would require looking into 
    92                     every node/.style={minimum size=11mm},
    73 the time it takes to execute {\bf all} the operations
    93 		    ->,>=stealth',shorten >=1pt,auto,thick
    74 involved in the lexer (simp, collect, decode), not just the derivative.
    94 		    ]
    75 \item
    95 \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
    76 Second, the bound is not yet tight, and we seek to improve $N_a$ so that
    96 \node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$};
    77 it is polynomial on $\llbracket a \rrbracket$.
    97 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};
    78 \end{itemize}
    98 
    79 Still, we believe this size formalisation 
    99 \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$};
    80 of the algorithm is important in our context, because 
   100 \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$};
    81 \begin{itemize}
   101 
    82 	\item
   102 \node (r2) [rectangle, draw=black, thick,  right=of r1s, minimum size = 17mm]{$a_2$};
    83 		Derivatives are the most important phases of our lexer algorithm.
   103 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
    84 		Size properties about derivatives covers the majority of the algorithm
   104 
    85 		and is therefore a good indication of complexity of the entire program.
   105 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$};
    86 	\item
   106 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$};
    87 		What the size bound proof does ensure is an absence of 
   107 
    88 		catastrophic backtracking, which is prevalent in regular expression engines
   108 \node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$};
    89 		in popular programming languages like Java.
   109 \draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$};
    90 		We prove catastrophic backtracking cannot happen for {\bf all} inputs,
   110 
    91 		which is an advantage over work which 
   111 \node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$};
    92 		only gives empirical evidence on some test cases.
   112 \draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$};
    93 \end{itemize}
   113 
    94 For example, Sulzmann and Lu's made claimed that their algorithm
   114 \node (rnn) [right = of rns, minimum size = 1mm]{};
    95 with bitcodes and simplifications can lex in linear time with respect
   115 \draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$};
    96 to the input string. This assumes that each
   116 
    97 derivative operation takes constant time. 
   117 \end{tikzpicture}
    98 However, it turns out that on certain cases their lexer 
   118 \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}
       
   119 \end{figure}
       
   120 \noindent
       
   121 That is, on certain cases their lexer 
    99 will have an indefinite size explosion, causing the running time 
   122 will have an indefinite size explosion, causing the running time 
   100 of each derivative step to grow arbitrarily large.
   123 of each derivative step to grow arbitrarily large (for example 
   101 Here in our proof we state that such explosions cannot happen 
   124 in \ref{SulzmannLuLexerTime}).
   102 with our simplification function.
   125 The reason they made this mistake was that
       
   126 they tested out the run time of their
       
   127 lexer on particular examples such as $(a+b+ab)^*$
       
   128 and generalised to all cases, which
       
   129 cannot happen with our mecahnised proof.\\
       
   130 We give details of the proof in the next sections.
       
   131 
   103 \subsection{Overview of the Proof}
   132 \subsection{Overview of the Proof}
   104 Here is a bird's eye view of how the proof of finiteness works,
   133 Here is a bird's eye view of how the proof of finiteness works,
   105 which involves three steps:
   134 which involves three steps:
   106 \begin{center}
   135 \begin{center}
   107   \begin{tikzpicture}[scale=1,font=\bf,
   136   \begin{tikzpicture}[scale=1,font=\bf,
   110                       ultra thick,draw=black!50,minimum height=18mm, 
   139                       ultra thick,draw=black!50,minimum height=18mm, 
   111                       minimum width=20mm,
   140                       minimum width=20mm,
   112                       top color=white,bottom color=black!20}]
   141                       top color=white,bottom color=black!20}]
   113 
   142 
   114 
   143 
   115 		      \node (0) at (-5,0) [node, text width=1.8cm, text centered] {$\llbracket \bderssimp{a}{s} \rrbracket$};
   144 		      \node (0) at (-5,0) 
   116 		      \node (A) at (0,0) [node,text width=1.6cm,  text centered] {$\llbracket \rderssimp{r}{s} \rrbracket_r$};
   145 			      [node, text width=1.8cm, text centered] 
   117 		      \node (B) at (3,0) [node,text width=3.0cm, anchor=west, minimum width = 40mm] {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$};
   146 			      {$\llbracket \bderssimp{a}{s} \rrbracket$};
       
   147 		      \node (A) at (0,0) 
       
   148 			      [node,text width=1.6cm,  text centered] 
       
   149 			      {$\llbracket \rderssimp{r}{s} \rrbracket_r$};
       
   150 		      \node (B) at (3,0) 
       
   151 			      [node,text width=3.0cm, anchor=west, minimum width = 40mm] 
       
   152 			      {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$};
   118   \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$};
   153   \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$};
   119  
   154  
   120   \draw [->,line width=0.5mm] (0) -- node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$r = a \downarrow_r$} (A); 
   155   \draw [->,line width=0.5mm] (0) -- 
   121   \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B); 
   156 	  node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A); 
   122   \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); 
   157   \draw [->,line width=0.5mm] (A) -- 
       
   158 	  node [above,pos=0.35] {$\quad =\ldots=$} (B); 
       
   159   \draw [->,line width=0.5mm] (B) -- 
       
   160 	  node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); 
   123   \end{tikzpicture}
   161   \end{tikzpicture}
   124 \end{center}
   162 \end{center}
   125 \noindent
   163 \noindent
   126 We explain the steps one by one:
   164 We explain the steps one by one:
   127 \begin{itemize}
   165 \begin{itemize}
   128 \item
   166 	\item
   129 We first define a new datatype
   167 		We first introduce the operations such as 
   130 that is more straightforward to tweak 
   168 		derivatives, simplification, size calculation, etc.
   131 into the shape we want 
   169 		associated with $\rrexp$s, which we have given
   132 compared with an annotated regular expression,
   170 		a very brief introduction to in chapter \ref{Bitcoded2}.
   133 called $\textit{rrexp}$s.
   171 	\item
   134 Its inductive type definition and 
   172 		We have a set of equalities for this new datatype that enables one to
   135 derivative and simplification operations are
   173 		rewrite $\rderssimp{r_1 \cdot r_2}{s}$ 
   136 almost identical to those of the annotated regular expressions,
   174 		and $\rderssimp{r^*}{s}$ (the most involved
   137 except that no bitcodes are attached.
   175 		inductive cases)
   138 \item
   176 		by a combinatioin of derivatives of their 
   139 We have a set of equalities for this new datatype that enables one to
   177 		children regular expressions ($r_1$, $r_2$, and $r$, respectively),
   140 rewrite $\rderssimp{r_1 \cdot r_2}{s}$ 
   178 		which we call the \emph{Closed Forms}
   141 and $\rderssimp{r^*}{s}$ (the most involved
   179 		of the derivatives.
   142 inductive cases)
   180 	\item
   143 by a combinatioin of derivatives of their 
   181 		The Closed Forms of the regular expressions
   144 children regular expressions ($r_1$, $r_2$, and $r$, respectively),
   182 		are controlled by terms that
   145 which we call the \emph{Closed Forms}
   183 		are easier to deal with.
   146 of the derivatives.
   184 		Using inductive hypothesis, these terms
   147 \item
   185 		are in turn bounded loosely
   148 The Closed Forms of the regular expressions
   186 		by a large yet constant number.
   149 are controlled by terms that
       
   150 are easier to deal with.
       
   151 Using inductive hypothesis, these terms
       
   152 are in turn bounded loosely
       
   153 by a large yet constant number.
       
   154 \end{itemize}
   187 \end{itemize}
   155 We give details of these steps in the next sections.
   188 We give details of these steps in the next sections.
   156 The first step is to use 
   189 The first step is to use 
   157 $\textit{rrexp}$s,
   190 $\textit{rrexp}$s,
   158 something simpler than
   191 something simpler than
   159 annotated regular expressions. 
   192 annotated regular expressions. 
   160 
   193 
   161 \section{the $\textit{rrexp}$ Datatype and Its Size Functions}
   194 \section{the $\textit{rrexp}$ Datatype and Its Size Functions}
       
   195 
       
   196 We first recap a bit about the new datatype
       
   197 we defined in \ref{rrexpDef},
       
   198 called $\textit{rrexp}$s.
       
   199 We chose $\rrexp$ over 
       
   200 the basic regular expressions
       
   201 because it is more straightforward to tweak 
       
   202 into the shape we want 
       
   203 compared with an annotated regular expression.
   162 We want to prove the size property about annotated regular expressions.
   204 We want to prove the size property about annotated regular expressions.
   163 The size is 
   205 The size is 
   164 written $\llbracket r\rrbracket$, whose intuitive definition is as below
   206 written $\llbracket r\rrbracket$, whose intuitive definition is as below
   165 \noindent
   207 \noindent
   166 We first note that $\llbracket \_ \rrbracket$
   208 We first note that $\llbracket \_ \rrbracket$
   230 $\rrexp$ give the exact correspondence between an annotated regular expression
   272 $\rrexp$ give the exact correspondence between an annotated regular expression
   231 and its (r-)erased version:
   273 and its (r-)erased version:
   232 
   274 
   233 This does not hold for plain $\rexp$s. 
   275 This does not hold for plain $\rexp$s. 
   234 
   276 
       
   277 
       
   278 		These operations are 
       
   279 		almost identical to those of the annotated regular expressions,
       
   280 		except that no bitcodes are attached.
   235 Of course, the bits which encode the lexing information would grow linearly with respect 
   281 Of course, the bits which encode the lexing information would grow linearly with respect 
   236 to the input, which should be taken into account when we wish to tackle the runtime comlexity.
   282 to the input, which should be taken into account when we wish to tackle the runtime comlexity.
   237 But at the current stage 
   283 But at the current stage 
   238 we can safely ignore them.
   284 we can safely ignore them.
   239 Similarly there is a size function for plain regular expressions:
   285 Similarly there is a size function for plain regular expressions:
  2131 equality as below in the style of Arden's lemma:\\
  2177 equality as below in the style of Arden's lemma:\\
  2132 \begin{center}
  2178 \begin{center}
  2133 	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
  2179 	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
  2134 \end{center}
  2180 \end{center}
  2135 
  2181 
       
  2182 There are two problems with this finiteness result, though.
       
  2183 \begin{itemize}
       
  2184 	\item
       
  2185 First, It is not yet a direct formalisation of our lexer's complexity,
       
  2186 as a complexity proof would require looking into 
       
  2187 the time it takes to execute {\bf all} the operations
       
  2188 involved in the lexer (simp, collect, decode), not just the derivative.
       
  2189 \item
       
  2190 Second, the bound is not yet tight, and we seek to improve $N_a$ so that
       
  2191 it is polynomial on $\llbracket a \rrbracket$.
       
  2192 \end{itemize}
       
  2193 Still, we believe this contribution is fruitful,
       
  2194 because
       
  2195 \begin{itemize}
       
  2196 	\item
       
  2197 
       
  2198 		The size proof can serve as a cornerstone for a complexity
       
  2199 		formalisation.
       
  2200 		Derivatives are the most important phases of our lexer algorithm.
       
  2201 		Size properties about derivatives covers the majority of the algorithm
       
  2202 		and is therefore a good indication of complexity of the entire program.
       
  2203 	\item
       
  2204 		The bound is already a strong indication that catastrophic
       
  2205 		backtracking is much less likely to occur in our $\blexersimp$
       
  2206 		algorithm.
       
  2207 		We refine $\blexersimp$ with $\blexerStrong$ in the next chapter
       
  2208 		so that the bound becomes polynomial.
       
  2209 \end{itemize}
       
  2210 		
  2136 %----------------------------------------------------------------------------------------
  2211 %----------------------------------------------------------------------------------------
  2137 %	SECTION 4
  2212 %	SECTION 4
  2138 %----------------------------------------------------------------------------------------
  2213 %----------------------------------------------------------------------------------------
  2139  
  2214  
  2140  
  2215