6 % In Chapter 4 \ref{Chapter4} we give the second guarantee |
6 % In Chapter 4 \ref{Chapter4} we give the second guarantee |
7 %of our bitcoded algorithm, that is a finite bound on the size of any |
7 %of our bitcoded algorithm, that is a finite bound on the size of any |
8 %regex's derivatives. |
8 %regex's derivatives. |
9 |
9 |
10 In this chapter we give a guarantee in terms of size: |
10 In this chapter we give a guarantee in terms of size: |
11 given an annotated regular expression $a$, for any string $s$ |
11 given an annotated regular expression $a$, for any string $s$ |
12 our algorithm's internal annotated regular expression |
12 our algorithm $\blexersimp$'s internal annotated regular expression |
13 size |
13 size is finitely bounded |
14 \begin{center} |
14 by a constant $N_a$ that only depends on $a$: |
15 $\llbracket \bderssimp{a}{s} \rrbracket$ |
15 \begin{center} |
16 \end{center} |
16 $\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$ |
17 \noindent |
17 \end{center} |
18 is finitely bounded |
18 \noindent |
19 by a constant $N_a$ that only depends on $a$, |
|
20 where the size of an annotated regular expression is defined |
19 where the size of an annotated regular expression is defined |
21 in terms of the number of nodes in its tree structure: |
20 in terms of the number of nodes in its tree structure: |
22 \begin{center} |
21 \begin{center} |
23 \begin{tabular}{ccc} |
22 \begin{tabular}{ccc} |
24 $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ |
23 $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ |
25 $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ |
24 $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ |
26 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ |
25 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ |
27 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ |
26 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ |
28 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ |
27 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ |
29 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$ |
28 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. |
30 \end{tabular} |
29 \end{tabular} |
31 \end{center} |
30 \end{center} |
32 |
31 We believe this size formalisation |
|
32 of the algorithm is important in our context, because |
|
33 \begin{itemize} |
|
34 \item |
|
35 It is a stepping stone towards an ``absence of catastrophic-backtracking'' |
|
36 guarantee. The next step would be to refine the bound $N_a$ so that it |
|
37 is polynomial on $\llbracket a\rrbracket$. |
|
38 \item |
|
39 The size bound proof gives us a higher confidence that |
|
40 our simplification algorithm $\simp$ does not ``mis-behave'' |
|
41 like $\simpsulz$ does. |
|
42 The bound is universal, which is an advantage over work which |
|
43 only gives empirical evidence on some test cases. |
|
44 \end{itemize} |
33 \section{Formalising About Size} |
45 \section{Formalising About Size} |
34 \noindent |
46 \noindent |
35 In our lexer $\blexersimp$, |
47 In our lexer $\blexersimp$, |
36 The regular expression is repeatedly being taken derivative of |
48 The regular expression is repeatedly being taken derivative of |
37 and then simplified. |
49 and then simplified. |
38 \begin{center} |
50 \begin{figure}[H] |
39 \begin{tikzpicture}[scale=2,node distance=1.4cm, |
51 \begin{tikzpicture}[scale=2, |
40 every node/.style={minimum size=9mm}] |
52 every node/.style={minimum size=11mm}, |
41 \node (r0) {$r$}; |
53 ->,>=stealth',shorten >=1pt,auto,thick |
42 \node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$}; |
54 ] |
|
55 \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; |
|
56 \node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; |
43 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; |
57 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; |
44 \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$r_{1s}$}; |
58 |
45 \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {simp}; |
59 \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$}; |
46 \node (r2) [rectangle, draw=black, thick, right=of r1s]{$r_2$}; |
60 \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$}; |
|
61 |
|
62 \node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 12mm]{$a_2$}; |
47 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
63 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
48 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$}; |
64 |
49 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$}; |
65 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$}; |
50 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$}; |
66 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$}; |
|
67 |
|
68 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$}; |
51 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; |
69 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; |
52 \node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$}; |
70 |
53 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode}; |
71 \node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$}; |
|
72 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode}; |
54 \end{tikzpicture} |
73 \end{tikzpicture} |
55 \end{center} |
74 \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks} |
56 \noindent |
75 \end{figure} |
57 As illustrated in the picture, |
76 \noindent |
58 each time the regular expression |
77 Each time |
59 is taken derivative of, it grows (the black nodes), |
78 a derivative is taken, a regular expression might grow a bit, |
60 and after simplification, it shrinks (the blue nodes). |
79 but simplification always takes care that |
|
80 it stays small. |
61 This intuition is depicted by the relative size |
81 This intuition is depicted by the relative size |
62 difference between the black and blue nodes. |
82 change between the black and blue nodes: |
63 We give a mechanised proof that after simplification |
83 After $\simp$ the node always shrinks. |
64 the regular expression's size (the blue ones) |
84 Our proof says that all the blue nodes |
65 $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for |
85 stay below a size bound $N_a$ determined by $a$. |
66 a fixed $a$. |
86 |
67 |
87 \noindent |
68 There are two problems with this finiteness proof, though. |
88 Sulzmann and Lu's assumed something similar about their algorithm, |
69 \begin{itemize} |
89 though in fact their algorithm's size might be better depicted by the following graph: |
70 \item |
90 \begin{figure}[H] |
71 First, It is not yet a direct formalisation of our lexer's complexity, |
91 \begin{tikzpicture}[scale=2, |
72 as a complexity proof would require looking into |
92 every node/.style={minimum size=11mm}, |
73 the time it takes to execute {\bf all} the operations |
93 ->,>=stealth',shorten >=1pt,auto,thick |
74 involved in the lexer (simp, collect, decode), not just the derivative. |
94 ] |
75 \item |
95 \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; |
76 Second, the bound is not yet tight, and we seek to improve $N_a$ so that |
96 \node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; |
77 it is polynomial on $\llbracket a \rrbracket$. |
97 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; |
78 \end{itemize} |
98 |
79 Still, we believe this size formalisation |
99 \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$}; |
80 of the algorithm is important in our context, because |
100 \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$}; |
81 \begin{itemize} |
101 |
82 \item |
102 \node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 17mm]{$a_2$}; |
83 Derivatives are the most important phases of our lexer algorithm. |
103 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
84 Size properties about derivatives covers the majority of the algorithm |
104 |
85 and is therefore a good indication of complexity of the entire program. |
105 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$}; |
86 \item |
106 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$}; |
87 What the size bound proof does ensure is an absence of |
107 |
88 catastrophic backtracking, which is prevalent in regular expression engines |
108 \node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$}; |
89 in popular programming languages like Java. |
109 \draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$}; |
90 We prove catastrophic backtracking cannot happen for {\bf all} inputs, |
110 |
91 which is an advantage over work which |
111 \node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$}; |
92 only gives empirical evidence on some test cases. |
112 \draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$}; |
93 \end{itemize} |
113 |
94 For example, Sulzmann and Lu's made claimed that their algorithm |
114 \node (rnn) [right = of rns, minimum size = 1mm]{}; |
95 with bitcodes and simplifications can lex in linear time with respect |
115 \draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$}; |
96 to the input string. This assumes that each |
116 |
97 derivative operation takes constant time. |
117 \end{tikzpicture} |
98 However, it turns out that on certain cases their lexer |
118 \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks} |
|
119 \end{figure} |
|
120 \noindent |
|
121 That is, on certain cases their lexer |
99 will have an indefinite size explosion, causing the running time |
122 will have an indefinite size explosion, causing the running time |
100 of each derivative step to grow arbitrarily large. |
123 of each derivative step to grow arbitrarily large (for example |
101 Here in our proof we state that such explosions cannot happen |
124 in \ref{SulzmannLuLexerTime}). |
102 with our simplification function. |
125 The reason they made this mistake was that |
|
126 they tested out the run time of their |
|
127 lexer on particular examples such as $(a+b+ab)^*$ |
|
128 and generalised to all cases, which |
|
129 cannot happen with our mecahnised proof.\\ |
|
130 We give details of the proof in the next sections. |
|
131 |
103 \subsection{Overview of the Proof} |
132 \subsection{Overview of the Proof} |
104 Here is a bird's eye view of how the proof of finiteness works, |
133 Here is a bird's eye view of how the proof of finiteness works, |
105 which involves three steps: |
134 which involves three steps: |
106 \begin{center} |
135 \begin{center} |
107 \begin{tikzpicture}[scale=1,font=\bf, |
136 \begin{tikzpicture}[scale=1,font=\bf, |
110 ultra thick,draw=black!50,minimum height=18mm, |
139 ultra thick,draw=black!50,minimum height=18mm, |
111 minimum width=20mm, |
140 minimum width=20mm, |
112 top color=white,bottom color=black!20}] |
141 top color=white,bottom color=black!20}] |
113 |
142 |
114 |
143 |
115 \node (0) at (-5,0) [node, text width=1.8cm, text centered] {$\llbracket \bderssimp{a}{s} \rrbracket$}; |
144 \node (0) at (-5,0) |
116 \node (A) at (0,0) [node,text width=1.6cm, text centered] {$\llbracket \rderssimp{r}{s} \rrbracket_r$}; |
145 [node, text width=1.8cm, text centered] |
117 \node (B) at (3,0) [node,text width=3.0cm, anchor=west, minimum width = 40mm] {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$}; |
146 {$\llbracket \bderssimp{a}{s} \rrbracket$}; |
|
147 \node (A) at (0,0) |
|
148 [node,text width=1.6cm, text centered] |
|
149 {$\llbracket \rderssimp{r}{s} \rrbracket_r$}; |
|
150 \node (B) at (3,0) |
|
151 [node,text width=3.0cm, anchor=west, minimum width = 40mm] |
|
152 {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$}; |
118 \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$}; |
153 \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$}; |
119 |
154 |
120 \draw [->,line width=0.5mm] (0) -- node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$r = a \downarrow_r$} (A); |
155 \draw [->,line width=0.5mm] (0) -- |
121 \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B); |
156 node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A); |
122 \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); |
157 \draw [->,line width=0.5mm] (A) -- |
|
158 node [above,pos=0.35] {$\quad =\ldots=$} (B); |
|
159 \draw [->,line width=0.5mm] (B) -- |
|
160 node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); |
123 \end{tikzpicture} |
161 \end{tikzpicture} |
124 \end{center} |
162 \end{center} |
125 \noindent |
163 \noindent |
126 We explain the steps one by one: |
164 We explain the steps one by one: |
127 \begin{itemize} |
165 \begin{itemize} |
128 \item |
166 \item |
129 We first define a new datatype |
167 We first introduce the operations such as |
130 that is more straightforward to tweak |
168 derivatives, simplification, size calculation, etc. |
131 into the shape we want |
169 associated with $\rrexp$s, which we have given |
132 compared with an annotated regular expression, |
170 a very brief introduction to in chapter \ref{Bitcoded2}. |
133 called $\textit{rrexp}$s. |
171 \item |
134 Its inductive type definition and |
172 We have a set of equalities for this new datatype that enables one to |
135 derivative and simplification operations are |
173 rewrite $\rderssimp{r_1 \cdot r_2}{s}$ |
136 almost identical to those of the annotated regular expressions, |
174 and $\rderssimp{r^*}{s}$ (the most involved |
137 except that no bitcodes are attached. |
175 inductive cases) |
138 \item |
176 by a combinatioin of derivatives of their |
139 We have a set of equalities for this new datatype that enables one to |
177 children regular expressions ($r_1$, $r_2$, and $r$, respectively), |
140 rewrite $\rderssimp{r_1 \cdot r_2}{s}$ |
178 which we call the \emph{Closed Forms} |
141 and $\rderssimp{r^*}{s}$ (the most involved |
179 of the derivatives. |
142 inductive cases) |
180 \item |
143 by a combinatioin of derivatives of their |
181 The Closed Forms of the regular expressions |
144 children regular expressions ($r_1$, $r_2$, and $r$, respectively), |
182 are controlled by terms that |
145 which we call the \emph{Closed Forms} |
183 are easier to deal with. |
146 of the derivatives. |
184 Using inductive hypothesis, these terms |
147 \item |
185 are in turn bounded loosely |
148 The Closed Forms of the regular expressions |
186 by a large yet constant number. |
149 are controlled by terms that |
|
150 are easier to deal with. |
|
151 Using inductive hypothesis, these terms |
|
152 are in turn bounded loosely |
|
153 by a large yet constant number. |
|
154 \end{itemize} |
187 \end{itemize} |
155 We give details of these steps in the next sections. |
188 We give details of these steps in the next sections. |
156 The first step is to use |
189 The first step is to use |
157 $\textit{rrexp}$s, |
190 $\textit{rrexp}$s, |
158 something simpler than |
191 something simpler than |
159 annotated regular expressions. |
192 annotated regular expressions. |
160 |
193 |
161 \section{the $\textit{rrexp}$ Datatype and Its Size Functions} |
194 \section{the $\textit{rrexp}$ Datatype and Its Size Functions} |
|
195 |
|
196 We first recap a bit about the new datatype |
|
197 we defined in \ref{rrexpDef}, |
|
198 called $\textit{rrexp}$s. |
|
199 We chose $\rrexp$ over |
|
200 the basic regular expressions |
|
201 because it is more straightforward to tweak |
|
202 into the shape we want |
|
203 compared with an annotated regular expression. |
162 We want to prove the size property about annotated regular expressions. |
204 We want to prove the size property about annotated regular expressions. |
163 The size is |
205 The size is |
164 written $\llbracket r\rrbracket$, whose intuitive definition is as below |
206 written $\llbracket r\rrbracket$, whose intuitive definition is as below |
165 \noindent |
207 \noindent |
166 We first note that $\llbracket \_ \rrbracket$ |
208 We first note that $\llbracket \_ \rrbracket$ |