5 \label{Finite} |
5 \label{Finite} |
6 % In Chapter 4 \ref{Chapter4} we give the second guarantee |
6 % In Chapter 4 \ref{Chapter4} we give the second guarantee |
7 %of our bitcoded algorithm, that is a finite bound on the size of any |
7 %of our bitcoded algorithm, that is a finite bound on the size of any |
8 %regex's derivatives. |
8 %regex's derivatives. |
9 |
9 |
10 In this chapter we give a guarantee in terms of time complexity: |
10 In this chapter we give a guarantee in terms of size: |
11 given a regular expression $r$, for any string $s$ |
11 given an annotated regular expression $a$, for any string $s$ |
12 our algorithm's internal data structure is finitely bounded. |
12 our algorithm's internal data structure |
13 Note that it is not immediately obvious that $\llbracket \bderssimp{r}{s} \rrbracket$ (the internal |
13 $\bderssimp{a}{s}$'s size |
14 data structure used in our $\blexer$) |
14 \begin{center} |
15 is bounded by a constant $N_r$, where $N$ only depends on the regular expression |
15 $\llbracket \bderssimp{a}{s} \rrbracket$ |
16 $r$, not the string $s$. |
16 \end{center} |
17 When doing a time complexity analysis of any |
17 \noindent |
18 lexer/parser based on Brzozowski derivatives, one needs to take into account that |
18 is finitely bounded |
19 not only the "derivative steps". |
19 by a constant $N_a$ that only depends on $a$. |
20 |
20 |
21 %TODO: get a grpah for internal data structure growing arbitrary large |
21 \section{Formalising About Size} |
22 |
22 In our lexer $\blexersimp$, |
23 |
23 The regular expression is repeatedly being taken derivative of |
24 To obtain such a proof, we need to |
24 and then simplified. |
|
25 |
|
26 \begin{center} |
|
27 \begin{tikzpicture}[scale=2,node distance=1.4cm, |
|
28 every node/.style={minimum size=9mm}] |
|
29 \node (r0) {$r$}; |
|
30 \node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$}; |
|
31 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; |
|
32 \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$r_{1s}$}; |
|
33 \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {simp}; |
|
34 \node (r2) [rectangle, draw=black, thick, right=of r1s]{$r_2$}; |
|
35 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
|
36 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$}; |
|
37 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$}; |
|
38 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$}; |
|
39 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; |
|
40 \node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$}; |
|
41 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode}; |
|
42 \end{tikzpicture} |
|
43 \end{center} |
|
44 \noindent |
|
45 As illustrated in the picture, |
|
46 each time the derivative is taken derivative of, it grows, |
|
47 and when it is being simplified, it shrinks. |
|
48 The blue ones are the regular expressions after simplification, |
|
49 which would be smaller than before. |
|
50 We give a mechanised proof that after simplification |
|
51 the regular expression's size (the blue ones) |
|
52 $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for |
|
53 a fixed $a$. |
|
54 While it is not yet a direct formalisation of our lexer's complexity, |
|
55 it is a stepping stone towards a complexity proof because |
|
56 if the data structures out lexer has to traverse is large, the program |
|
57 will certainly be slow. |
|
58 |
|
59 The bound is not yet tight, and we seek to improve $N_a$ so that |
|
60 it is polynomial on $\llbracket a \rrbracket$. |
|
61 We believe a formalisation of the complexity-related properties |
|
62 of the algorithm is important in our context, because we want to address |
|
63 catastrophic backtracking, which is not a correctness issue but |
|
64 in essence a performance issue, a formal proof can give |
|
65 the strongest assurance that such issues cannot arise |
|
66 regardless of what the input is. |
|
67 This level of certainty cannot come from a pencil and paper proof or |
|
68 eimpirical evidence. |
|
69 |
|
70 For example, Sulzmann and Lu's made claimed that their algorithm |
|
71 with bitcodes and simplifications can lex in linear time with respect |
|
72 to the input string. This assumes that each |
|
73 derivative operation takes constant time. |
|
74 However, it turns out that on certain cases their lexer |
|
75 will have an indefinite size explosion, causing the running time |
|
76 of each derivative step to grow arbitrarily large. |
|
77 Here in our proof we state that such explosions cannot happen |
|
78 with our simplification function. |
|
79 |
|
80 Here is a bird's eye view of how the proof of finiteness works: |
|
81 \begin{center} |
|
82 \begin{tikzpicture}[scale=1,font=\bf, |
|
83 node/.style={ |
|
84 rectangle,rounded corners=3mm, |
|
85 ultra thick,draw=black!50,minimum height=18mm, |
|
86 minimum width=20mm, |
|
87 top color=white,bottom color=black!20}] |
|
88 |
|
89 |
|
90 \node (0) at (-5,0) [node, text width=1.8cm, text centered] {$\llbracket \bderssimp{a}{s} \rrbracket$}; |
|
91 \node (A) at (0,0) [node,text width=1.6cm, text centered] {$\llbracket \rderssimp{r}{s} \rrbracket_r$}; |
|
92 \node (B) at (3,0) [node,text width=3.0cm, anchor=west, minimum width = 40mm] {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$}; |
|
93 \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$}; |
|
94 |
|
95 \draw [->,line width=0.5mm] (0) -- node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$r = a \downarrow_r$} (A); |
|
96 \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B); |
|
97 \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); |
|
98 \end{tikzpicture} |
|
99 \end{center} |
|
100 \noindent |
|
101 We explain the above steps one by one: |
25 \begin{itemize} |
102 \begin{itemize} |
26 \item |
103 \item |
27 Define an new datatype for regular expressions that makes it easy |
104 We first use a new datatype, called $\textit{rrexp}$s, whose |
28 to reason about the size of an annotated regular expression. |
105 inductive type definition and derivative and simplification operations are |
|
106 almost identical to those of the annotated regular expressions, |
|
107 except that no bitcodes are attached. |
|
108 This new datatype is more straightforward to tweak |
|
109 compared with an annotated regular expression. |
29 \item |
110 \item |
30 A set of equalities for this new datatype that enables one to |
111 We have a set of equalities for this new datatype that enables one to |
31 rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc. |
112 rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc. |
32 by their children regexes $r_1$, $r_2$, and $r$. |
113 by a combinatioin of derivatives of their |
|
114 children regular expressions ($r_1$, $r_2$, and $r$, respectively). |
|
115 These equalities are chained together to get into a shape |
|
116 that is very easy to estimate, which we call the \emph{Closed Forms} |
|
117 of the derivatives. |
33 \item |
118 \item |
34 Using those equalities to actually get those rewriting equations, which we call |
119 This closed form is controlled by terms that |
35 "closed forms". |
120 are easier to deal with, wich are in turn bounded loosely |
36 \item |
121 by a large yet constant number. |
37 Bound the closed forms, thereby bounding the original |
|
38 $\blexersimp$'s internal data structures. |
|
39 \end{itemize} |
122 \end{itemize} |
|
123 We give details of these steps in the next sections. |
|
124 The first step is to use something simpler than annotated regular expressions. |
40 |
125 |
41 \section{the $\mathbf{r}$-rexp datatype and the size functions} |
126 \section{the $\mathbf{r}$-rexp datatype and the size functions} |
42 |
127 We want to prove the size property about annotated regular expressions. |
43 We have a size function for bitcoded regular expressions, written |
128 The size is |
44 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree |
129 written $\llbracket r\rrbracket$, whose intuitive definition is as below |
45 |
|
46 \begin{center} |
130 \begin{center} |
47 \begin{tabular}{ccc} |
131 \begin{tabular}{ccc} |
48 $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ |
132 $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ |
49 $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ |
133 $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ |
50 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ |
134 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ |
51 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ |
135 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ |
52 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ |
136 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ |
53 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$ |
137 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$ |
54 \end{tabular} |
138 \end{tabular} |
55 \end{center} |
139 \end{center} |
56 |
140 \noindent |
57 \noindent |
141 We first note that $\llbracket \_ \rrbracket$ |
|
142 is unaware of bitcodes, since |
|
143 it only counts the number of nodes |
|
144 if we regard $r$ as a tree. |
|
145 This suggests that if we define a new datatype that is similar to plain $\rexp$s, |
|
146 \[ \rrexp ::= \RZERO \mid \RONE |
|
147 \mid \RCHAR{c} |
|
148 \mid \RSEQ{r_1}{r_2} |
|
149 \mid \RALTS{rs} |
|
150 \mid \RSTAR{r} |
|
151 \] |
|
152 and define |
|
153 \begin{center} |
|
154 \begin{tabular}{lcl} |
|
155 $\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\ |
|
156 $\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\ |
|
157 $\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\ |
|
158 $\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ |
|
159 $\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ |
|
160 $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$ |
|
161 \end{tabular} |
|
162 \end{center} |
|
163 \noindent |
|
164 we could calculate the size of annotated regular expressions in terms of |
|
165 its un-annotated counterpart: |
|
166 \begin{lemma} |
|
167 $\rsize{\rerase a} = \asize a$ |
|
168 \end{lemma} |
|
169 \begin{proof} |
|
170 By routine induction on the structure of $a$. |
|
171 \end{proof} |
|
172 \noindent |
|
173 We call them \emph{r}-regular expressions: |
|
174 we use |
|
175 $\rrexp$ as its type name, so as to make a distinction |
|
176 with plain regular expressions. |
|
177 In $\rrexp$ we throw away the bitcodes on the annotated regular expressions, |
|
178 but keep everything else intact. |
|
179 To transform an annotated regular expression into an r-regular expression, |
|
180 we use the function \emph{rerase}, written $\downarrow_r$. |
|
181 The $r$ in the subscript of $\downarrow$ is to |
|
182 differentiate with the $\downarrow$ for the $\erase$ operation. |
|
183 Before we introduce more functions related to r-regular expressions, |
|
184 we first give out the reason why we take all the trouble |
|
185 defining a new datatype in the first place. |
|
186 \subsection{Motivation Behind a New Datatype} |
|
187 The main difference between a plain regular expression |
|
188 and an r-regular expression is that it can take |
|
189 non-binary arguments for its alternative constructor. |
|
190 This turned out to be necessary if we want our proofs to be |
|
191 simple. |
|
192 We initially started by using |
|
193 plain regular expressions and tried to prove |
|
194 equalities like |
|
195 \begin{center} |
|
196 $\llbracket a \rrbracket = \llbracket a\downarrow \rrbracket_p$ |
|
197 \end{center} |
|
198 and |
|
199 \[ |
|
200 \llbracket a \backslash_{bsimps} s \rrbracket = |
|
201 \llbracket a \downarrow \backslash_s s |
|
202 \] |
|
203 One might be able to prove that |
|
204 $\llbracket a \downarrow \rrbracket_p \leq \llbracket a \rrbracket$. |
|
205 $\rrexp$ give the exact correspondence between an annotated regular expression |
|
206 and its (r-)erased version: |
|
207 |
|
208 This does not hold for plain $\rexp$s. |
|
209 |
|
210 Of course, the bits which encode the lexing information would grow linearly with respect |
|
211 to the input, which should be taken into account when we wish to tackle the runtime comlexity. |
|
212 But at the current stage |
|
213 we can safely ignore them. |
58 Similarly there is a size function for plain regular expressions: |
214 Similarly there is a size function for plain regular expressions: |
59 \begin{center} |
215 \begin{center} |
60 \begin{tabular}{ccc} |
216 \begin{tabular}{ccc} |
61 $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ |
217 $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ |
62 $\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\ |
218 $\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\ |