7 %of our bitcoded algorithm, that is a finite bound on the size of any |
7 %of our bitcoded algorithm, that is a finite bound on the size of any |
8 %regex's derivatives. |
8 %regex's derivatives. |
9 |
9 |
10 In this chapter we give a guarantee in terms of size: |
10 In this chapter we give a guarantee in terms of size: |
11 given an annotated regular expression $a$, for any string $s$ |
11 given an annotated regular expression $a$, for any string $s$ |
12 our algorithm's internal data structure |
12 our algorithm's internal annotated regular expression |
13 $\bderssimp{a}{s}$'s size |
13 size |
14 \begin{center} |
14 \begin{center} |
15 $\llbracket \bderssimp{a}{s} \rrbracket$ |
15 $\llbracket \bderssimp{a}{s} \rrbracket$ |
16 \end{center} |
16 \end{center} |
17 \noindent |
17 \noindent |
18 is finitely bounded |
18 is finitely bounded |
19 by a constant $N_a$ that only depends on $a$. |
19 by a constant $N_a$ that only depends on $a$, |
|
20 where the size of an annotated regular expression is defined |
|
21 in terms of the number of nodes in its tree structure: |
|
22 \begin{center} |
|
23 \begin{tabular}{ccc} |
|
24 $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ |
|
25 $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ |
|
26 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ |
|
27 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ |
|
28 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ |
|
29 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$ |
|
30 \end{tabular} |
|
31 \end{center} |
20 |
32 |
21 \section{Formalising About Size} |
33 \section{Formalising About Size} |
|
34 \noindent |
22 In our lexer $\blexersimp$, |
35 In our lexer $\blexersimp$, |
23 The regular expression is repeatedly being taken derivative of |
36 The regular expression is repeatedly being taken derivative of |
24 and then simplified. |
37 and then simplified. |
25 |
|
26 \begin{center} |
38 \begin{center} |
27 \begin{tikzpicture}[scale=2,node distance=1.4cm, |
39 \begin{tikzpicture}[scale=2,node distance=1.4cm, |
28 every node/.style={minimum size=9mm}] |
40 every node/.style={minimum size=9mm}] |
29 \node (r0) {$r$}; |
41 \node (r0) {$r$}; |
30 \node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$}; |
42 \node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$}; |
35 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
47 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
36 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$}; |
48 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$}; |
37 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$}; |
49 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$}; |
38 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$}; |
50 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$}; |
39 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; |
51 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; |
40 \node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$}; |
52 \node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$}; |
41 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode}; |
53 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode}; |
42 \end{tikzpicture} |
54 \end{tikzpicture} |
43 \end{center} |
55 \end{center} |
44 \noindent |
56 \noindent |
45 As illustrated in the picture, |
57 As illustrated in the picture, |
46 each time the derivative is taken derivative of, it grows, |
58 each time the regular expression |
47 and when it is being simplified, it shrinks. |
59 is taken derivative of, it grows (the black nodes), |
48 The blue ones are the regular expressions after simplification, |
60 and after simplification, it shrinks (the blue nodes). |
49 which would be smaller than before. |
61 This intuition is depicted by the relative size |
|
62 difference between the black and blue nodes. |
50 We give a mechanised proof that after simplification |
63 We give a mechanised proof that after simplification |
51 the regular expression's size (the blue ones) |
64 the regular expression's size (the blue ones) |
52 $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for |
65 $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for |
53 a fixed $a$. |
66 a fixed $a$. |
54 While it is not yet a direct formalisation of our lexer's complexity, |
67 |
55 it is a stepping stone towards a complexity proof because |
68 There are two problems with this finiteness proof, though. |
56 if the data structures out lexer has to traverse is large, the program |
69 \begin{itemize} |
57 will certainly be slow. |
70 \item |
58 |
71 First, It is not yet a direct formalisation of our lexer's complexity, |
59 The bound is not yet tight, and we seek to improve $N_a$ so that |
72 as a complexity proof would require looking into |
|
73 the time it takes to execute {\bf all} the operations |
|
74 involved in the lexer (simp, collect, decode), not just the derivative. |
|
75 \item |
|
76 Second, the bound is not yet tight, and we seek to improve $N_a$ so that |
60 it is polynomial on $\llbracket a \rrbracket$. |
77 it is polynomial on $\llbracket a \rrbracket$. |
61 We believe a formalisation of the complexity-related properties |
78 \end{itemize} |
62 of the algorithm is important in our context, because we want to address |
79 Still, we believe this size formalisation |
63 catastrophic backtracking, which is not a correctness issue but |
80 of the algorithm is important in our context, because |
64 in essence a performance issue, a formal proof can give |
81 \begin{itemize} |
65 the strongest assurance that such issues cannot arise |
82 \item |
66 regardless of what the input is. |
83 Derivatives are the most important phases of our lexer algorithm. |
67 This level of certainty cannot come from a pencil and paper proof or |
84 Size properties about derivatives covers the majority of the algorithm |
68 eimpirical evidence. |
85 and is therefore a good indication of complexity of the entire program. |
69 |
86 \item |
|
87 What the size bound proof does ensure is an absence of |
|
88 catastrophic backtracking, which is prevalent in regular expression engines |
|
89 in popular programming languages like Java. |
|
90 We prove catastrophic backtracking cannot happen for {\bf all} inputs, |
|
91 which is an advantage over work which |
|
92 only gives empirical evidence on some test cases. |
|
93 \end{itemize} |
70 For example, Sulzmann and Lu's made claimed that their algorithm |
94 For example, Sulzmann and Lu's made claimed that their algorithm |
71 with bitcodes and simplifications can lex in linear time with respect |
95 with bitcodes and simplifications can lex in linear time with respect |
72 to the input string. This assumes that each |
96 to the input string. This assumes that each |
73 derivative operation takes constant time. |
97 derivative operation takes constant time. |
74 However, it turns out that on certain cases their lexer |
98 However, it turns out that on certain cases their lexer |
75 will have an indefinite size explosion, causing the running time |
99 will have an indefinite size explosion, causing the running time |
76 of each derivative step to grow arbitrarily large. |
100 of each derivative step to grow arbitrarily large. |
77 Here in our proof we state that such explosions cannot happen |
101 Here in our proof we state that such explosions cannot happen |
78 with our simplification function. |
102 with our simplification function. |
79 |
103 \subsection{Overview of the Proof} |
80 Here is a bird's eye view of how the proof of finiteness works: |
104 Here is a bird's eye view of how the proof of finiteness works, |
|
105 which involves three steps: |
81 \begin{center} |
106 \begin{center} |
82 \begin{tikzpicture}[scale=1,font=\bf, |
107 \begin{tikzpicture}[scale=1,font=\bf, |
83 node/.style={ |
108 node/.style={ |
84 rectangle,rounded corners=3mm, |
109 rectangle,rounded corners=3mm, |
85 ultra thick,draw=black!50,minimum height=18mm, |
110 ultra thick,draw=black!50,minimum height=18mm, |
96 \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B); |
121 \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B); |
97 \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); |
122 \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); |
98 \end{tikzpicture} |
123 \end{tikzpicture} |
99 \end{center} |
124 \end{center} |
100 \noindent |
125 \noindent |
101 We explain the above steps one by one: |
126 We explain the steps one by one: |
102 \begin{itemize} |
127 \begin{itemize} |
103 \item |
128 \item |
104 We first use a new datatype, called $\textit{rrexp}$s, whose |
129 We first define a new datatype |
105 inductive type definition and derivative and simplification operations are |
130 that is more straightforward to tweak |
|
131 into the shape we want |
|
132 compared with an annotated regular expression, |
|
133 called $\textit{rrexp}$s. |
|
134 Its inductive type definition and |
|
135 derivative and simplification operations are |
106 almost identical to those of the annotated regular expressions, |
136 almost identical to those of the annotated regular expressions, |
107 except that no bitcodes are attached. |
137 except that no bitcodes are attached. |
108 This new datatype is more straightforward to tweak |
|
109 compared with an annotated regular expression. |
|
110 \item |
138 \item |
111 We have a set of equalities for this new datatype that enables one to |
139 We have a set of equalities for this new datatype that enables one to |
112 rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc. |
140 rewrite $\rderssimp{r_1 \cdot r_2}{s}$ |
|
141 and $\rderssimp{r^*}{s}$ (the most involved |
|
142 inductive cases) |
113 by a combinatioin of derivatives of their |
143 by a combinatioin of derivatives of their |
114 children regular expressions ($r_1$, $r_2$, and $r$, respectively). |
144 children regular expressions ($r_1$, $r_2$, and $r$, respectively), |
115 These equalities are chained together to get into a shape |
145 which we call the \emph{Closed Forms} |
116 that is very easy to estimate, which we call the \emph{Closed Forms} |
|
117 of the derivatives. |
146 of the derivatives. |
118 \item |
147 \item |
119 This closed form is controlled by terms that |
148 The Closed Forms of the regular expressions |
120 are easier to deal with, wich are in turn bounded loosely |
149 are controlled by terms that |
|
150 are easier to deal with. |
|
151 Using inductive hypothesis, these terms |
|
152 are in turn bounded loosely |
121 by a large yet constant number. |
153 by a large yet constant number. |
122 \end{itemize} |
154 \end{itemize} |
123 We give details of these steps in the next sections. |
155 We give details of these steps in the next sections. |
124 The first step is to use something simpler than annotated regular expressions. |
156 The first step is to use |
125 |
157 $\textit{rrexp}$s, |
126 \section{the $\mathbf{r}$-rexp datatype and the size functions} |
158 something simpler than |
|
159 annotated regular expressions. |
|
160 |
|
161 \section{the $\textit{rrexp}$ Datatype and Its Size Functions} |
127 We want to prove the size property about annotated regular expressions. |
162 We want to prove the size property about annotated regular expressions. |
128 The size is |
163 The size is |
129 written $\llbracket r\rrbracket$, whose intuitive definition is as below |
164 written $\llbracket r\rrbracket$, whose intuitive definition is as below |
130 \begin{center} |
|
131 \begin{tabular}{ccc} |
|
132 $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ |
|
133 $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ |
|
134 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ |
|
135 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ |
|
136 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ |
|
137 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$ |
|
138 \end{tabular} |
|
139 \end{center} |
|
140 \noindent |
165 \noindent |
141 We first note that $\llbracket \_ \rrbracket$ |
166 We first note that $\llbracket \_ \rrbracket$ |
142 is unaware of bitcodes, since |
167 is unaware of bitcodes, since |
143 it only counts the number of nodes |
168 it only counts the number of nodes |
144 if we regard $r$ as a tree. |
169 if we regard $r$ as a tree. |