ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 577 f47fc4840579
parent 576 3e1b699696b6
child 590 988e92a70704
equal deleted inserted replaced
576:3e1b699696b6 577:f47fc4840579
     7 %of our bitcoded algorithm, that is a finite bound on the size of any 
     7 %of our bitcoded algorithm, that is a finite bound on the size of any 
     8 %regex's derivatives. 
     8 %regex's derivatives. 
     9 
     9 
    10 In this chapter we give a guarantee in terms of size: 
    10 In this chapter we give a guarantee in terms of size: 
    11 given an annotated regular expression $a$, for any string $s$ 
    11 given an annotated regular expression $a$, for any string $s$ 
    12 our algorithm's internal data structure 
    12 our algorithm's internal annotated regular expression 
    13 $\bderssimp{a}{s}$'s size
    13 size
    14 \begin{center}
    14 \begin{center}
    15 $\llbracket \bderssimp{a}{s} \rrbracket$
    15 $\llbracket \bderssimp{a}{s} \rrbracket$
    16 \end{center}
    16 \end{center}
    17 \noindent
    17 \noindent
    18 is finitely bounded
    18 is finitely bounded
    19 by a constant $N_a$ that only depends on $a$.
    19 by a constant $N_a$ that only depends on $a$,
       
    20 where the size of an annotated regular expression is defined
       
    21 in terms of the number of nodes in its tree structure:
       
    22 \begin{center}
       
    23 \begin{tabular}{ccc}
       
    24 	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
       
    25 	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
       
    26 	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
       
    27 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
       
    28 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
       
    29 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
       
    30 \end{tabular}
       
    31 \end{center}
    20 
    32 
    21 \section{Formalising About Size}
    33 \section{Formalising About Size}
       
    34 \noindent
    22 In our lexer $\blexersimp$,
    35 In our lexer $\blexersimp$,
    23 The regular expression is repeatedly being taken derivative of
    36 The regular expression is repeatedly being taken derivative of
    24 and then simplified.
    37 and then simplified.
    25 
       
    26 \begin{center}
    38 \begin{center}
    27 \begin{tikzpicture}[scale=2,node distance=1.4cm,
    39 \begin{tikzpicture}[scale=2,node distance=1.4cm,
    28                     every node/.style={minimum size=9mm}]
    40                     every node/.style={minimum size=9mm}]
    29 \node (r0)  {$r$};
    41 \node (r0)  {$r$};
    30 \node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$};
    42 \node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$};
    35 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
    47 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
    36 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$};
    48 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$};
    37 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
    49 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
    38 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
    50 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
    39 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
    51 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
    40 \node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$};
    52 \node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$};
    41 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
    53 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
    42 \end{tikzpicture}
    54 \end{tikzpicture}
    43 \end{center}
    55 \end{center}
    44 \noindent
    56 \noindent
    45 As illustrated in the picture, 
    57 As illustrated in the picture, 
    46 each time the derivative is taken derivative of, it grows,
    58 each time the regular expression 
    47 and when it is being simplified, it shrinks.
    59 is taken derivative of, it grows (the black nodes),
    48 The blue ones are the regular expressions after simplification,
    60 and after simplification, it shrinks (the blue nodes).
    49 which would be smaller than before.
    61 This intuition is depicted by the relative size
       
    62 difference between the black and blue nodes.
    50 We give a mechanised proof that after simplification 
    63 We give a mechanised proof that after simplification 
    51 the regular expression's size (the blue ones)
    64 the regular expression's size (the blue ones)
    52 $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
    65 $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
    53 a fixed $a$. 
    66 a fixed $a$. 
    54 While it is not yet a direct formalisation of our lexer's complexity,
    67 
    55 it is a stepping stone towards a complexity proof because
    68 There are two problems with this finiteness proof, though.
    56 if the data structures out lexer has to traverse is large, the program
    69 \begin{itemize}
    57 will certainly be slow.
    70 	\item
    58 
    71 First, It is not yet a direct formalisation of our lexer's complexity,
    59 The bound is not yet tight, and we seek to improve $N_a$ so that
    72 as a complexity proof would require looking into 
       
    73 the time it takes to execute {\bf all} the operations
       
    74 involved in the lexer (simp, collect, decode), not just the derivative.
       
    75 \item
       
    76 Second, the bound is not yet tight, and we seek to improve $N_a$ so that
    60 it is polynomial on $\llbracket a \rrbracket$.
    77 it is polynomial on $\llbracket a \rrbracket$.
    61 We believe a formalisation of the complexity-related properties
    78 \end{itemize}
    62 of the algorithm is important in our context, because we want to address
    79 Still, we believe this size formalisation 
    63 catastrophic backtracking, which is not a correctness issue but
    80 of the algorithm is important in our context, because 
    64 in essence a performance issue, a formal proof can give
    81 \begin{itemize}
    65 the strongest assurance that such issues cannot arise 
    82 	\item
    66 regardless of what the input is.
    83 		Derivatives are the most important phases of our lexer algorithm.
    67 This level of certainty cannot come from a pencil and paper proof or
    84 		Size properties about derivatives covers the majority of the algorithm
    68 eimpirical evidence.
    85 		and is therefore a good indication of complexity of the entire program.
    69 
    86 	\item
       
    87 		What the size bound proof does ensure is an absence of 
       
    88 		catastrophic backtracking, which is prevalent in regular expression engines
       
    89 		in popular programming languages like Java.
       
    90 		We prove catastrophic backtracking cannot happen for {\bf all} inputs,
       
    91 		which is an advantage over work which 
       
    92 		only gives empirical evidence on some test cases.
       
    93 \end{itemize}
    70 For example, Sulzmann and Lu's made claimed that their algorithm
    94 For example, Sulzmann and Lu's made claimed that their algorithm
    71 with bitcodes and simplifications can lex in linear time with respect
    95 with bitcodes and simplifications can lex in linear time with respect
    72 to the input string. This assumes that each
    96 to the input string. This assumes that each
    73 derivative operation takes constant time. 
    97 derivative operation takes constant time. 
    74 However, it turns out that on certain cases their lexer 
    98 However, it turns out that on certain cases their lexer 
    75 will have an indefinite size explosion, causing the running time 
    99 will have an indefinite size explosion, causing the running time 
    76 of each derivative step to grow arbitrarily large.
   100 of each derivative step to grow arbitrarily large.
    77 Here in our proof we state that such explosions cannot happen 
   101 Here in our proof we state that such explosions cannot happen 
    78 with our simplification function.
   102 with our simplification function.
    79 
   103 \subsection{Overview of the Proof}
    80 Here is a bird's eye view of how the proof of finiteness works: 
   104 Here is a bird's eye view of how the proof of finiteness works,
       
   105 which involves three steps:
    81 \begin{center}
   106 \begin{center}
    82   \begin{tikzpicture}[scale=1,font=\bf,
   107   \begin{tikzpicture}[scale=1,font=\bf,
    83                       node/.style={
   108                       node/.style={
    84                       rectangle,rounded corners=3mm,
   109                       rectangle,rounded corners=3mm,
    85                       ultra thick,draw=black!50,minimum height=18mm, 
   110                       ultra thick,draw=black!50,minimum height=18mm, 
    96   \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B); 
   121   \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B); 
    97   \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); 
   122   \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); 
    98   \end{tikzpicture}
   123   \end{tikzpicture}
    99 \end{center}
   124 \end{center}
   100 \noindent
   125 \noindent
   101 We explain the above steps one by one:
   126 We explain the steps one by one:
   102 \begin{itemize}
   127 \begin{itemize}
   103 \item
   128 \item
   104 We first use a new datatype, called $\textit{rrexp}$s, whose 
   129 We first define a new datatype
   105 inductive type definition and derivative and simplification operations are
   130 that is more straightforward to tweak 
       
   131 into the shape we want 
       
   132 compared with an annotated regular expression,
       
   133 called $\textit{rrexp}$s.
       
   134 Its inductive type definition and 
       
   135 derivative and simplification operations are
   106 almost identical to those of the annotated regular expressions,
   136 almost identical to those of the annotated regular expressions,
   107 except that no bitcodes are attached.
   137 except that no bitcodes are attached.
   108 This new datatype is more straightforward to tweak 
       
   109 compared with an annotated regular expression.
       
   110 \item
   138 \item
   111 We have a set of equalities for this new datatype that enables one to
   139 We have a set of equalities for this new datatype that enables one to
   112 rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc.
   140 rewrite $\rderssimp{r_1 \cdot r_2}{s}$ 
       
   141 and $\rderssimp{r^*}{s}$ (the most involved
       
   142 inductive cases)
   113 by a combinatioin of derivatives of their 
   143 by a combinatioin of derivatives of their 
   114 children regular expressions ($r_1$, $r_2$, and $r$, respectively).
   144 children regular expressions ($r_1$, $r_2$, and $r$, respectively),
   115 These equalities are chained together to get into a shape
   145 which we call the \emph{Closed Forms}
   116 that is very easy to estimate, which we call the \emph{Closed Forms}
       
   117 of the derivatives.
   146 of the derivatives.
   118 \item
   147 \item
   119 This closed form is controlled by terms that
   148 The Closed Forms of the regular expressions
   120 are easier to deal with, wich are in turn bounded loosely
   149 are controlled by terms that
       
   150 are easier to deal with.
       
   151 Using inductive hypothesis, these terms
       
   152 are in turn bounded loosely
   121 by a large yet constant number.
   153 by a large yet constant number.
   122 \end{itemize}
   154 \end{itemize}
   123 We give details of these steps in the next sections.
   155 We give details of these steps in the next sections.
   124 The first step is to use something simpler than annotated regular expressions. 
   156 The first step is to use 
   125 
   157 $\textit{rrexp}$s,
   126 \section{the $\mathbf{r}$-rexp datatype and the size functions}
   158 something simpler than
       
   159 annotated regular expressions. 
       
   160 
       
   161 \section{the $\textit{rrexp}$ Datatype and Its Size Functions}
   127 We want to prove the size property about annotated regular expressions.
   162 We want to prove the size property about annotated regular expressions.
   128 The size is 
   163 The size is 
   129 written $\llbracket r\rrbracket$, whose intuitive definition is as below
   164 written $\llbracket r\rrbracket$, whose intuitive definition is as below
   130 \begin{center}
       
   131 \begin{tabular}{ccc}
       
   132 	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
       
   133 	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
       
   134 	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
       
   135 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
       
   136 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
       
   137 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
       
   138 \end{tabular}
       
   139 \end{center}
       
   140 \noindent
   165 \noindent
   141 We first note that $\llbracket \_ \rrbracket$
   166 We first note that $\llbracket \_ \rrbracket$
   142 is unaware of bitcodes, since 
   167 is unaware of bitcodes, since 
   143 it only counts the number of nodes
   168 it only counts the number of nodes
   144 if we regard $r$ as a tree.
   169 if we regard $r$ as a tree.