89 \begin{document} |
89 \begin{document} |
90 |
90 |
91 \maketitle |
91 \maketitle |
92 |
92 |
93 \begin{abstract} |
93 \begin{abstract} |
94 Brzozowski introduced in 1964 a beautifully simple algorithm for |
94 Regular expressions, a useful concept in computer science |
95 regular expression matching based on the notion of derivatives of |
95 that was initially designed to match string patterns, |
96 regular expressions. In 2014, Sulzmann and Lu extended this |
96 need fast matching algorithms. |
97 algorithm to not just give a YES/NO answer for whether or not a |
97 If we do matching by naively converting |
98 regular expression matches a string, but in case it does also |
98 the regular expression to a |
99 answers with \emph{how} it matches the string. This is important for |
99 Nondeterministic-Finite-Automaton(NFA) and simulating it, |
100 applications such as lexing (tokenising a string). The problem is to |
100 they can run into trouble on certain evil regular expression and string |
101 make the algorithm by Sulzmann and Lu fast on all inputs without |
101 pairs. |
102 breaking its correctness. Being fast depends on a complete set of |
102 This necessitates the introudction of a |
103 simplification rules, some of which |
103 method called derivatives of regular expressions\cite{Brzozowski1964}. |
104 have been put forward by Sulzmann and Lu. We have extended their |
104 With some variation introduced by Sulzmann and Lu\cite{Sulzmann2014}, |
105 rules in order to obtain a tight bound on the size of regular expressions. |
105 it circumvents the problem of |
106 We have tested these extended rules, but have not |
106 catastrphoic backtracking elegantly when |
107 formally established their correctness. We have also not yet looked |
107 compared to backtracking NFA regex engines. |
108 at extended regular expressions, such as bounded repetitions, |
108 We believe that such a method can help to address |
109 negation and back-references. |
109 the status-quo where badly written code for |
|
110 regular expression matching can cause a software grief. |
|
111 We have come up with a set of simplification rules |
|
112 that makes the space and time requirements below a tight bound. |
|
113 We have proved certain properties of the algorithm |
|
114 and simplification. |
|
115 We have not yet worked out a proof for the correctness |
|
116 although test data suggested this. |
110 \end{abstract} |
117 \end{abstract} |
111 |
118 |
112 |
119 |
|
120 |
113 \section{Introduction} |
121 \section{Introduction} |
114 %Regular expressions' derivatives, which have received |
122 The following brief introduction gives the background for |
115 %renewed interest in the new millenium, is a beautiful.... |
123 derivatives--how they are originally defined and the |
116 While we believe derivatives of regular expressions, written |
124 variety\cite{Sulzmann2014} that is more suitable for lexing. |
117 $r\backslash s$, are a beautiful concept (in terms of ease of |
125 Regular expressions can be formalized as follows:\\ |
118 implementing them in functional programming languages and in terms of |
126 \begin{center} |
119 reasoning about them formally), they have one major drawback: every |
127 $r ::= \ZERO | \ONE | c | r_1 +r_2 | r_1 \cdot r_2 | r^*$ |
120 derivative step can make regular expressions grow drastically in |
128 \end{center} |
121 size. This in turn has negative effect on the runtime of the |
129 Derivatives of regular expression can be |
122 corresponding lexing algorithms. Consider for example the regular |
130 seen as the regular expression that corresponds to |
123 expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The |
131 the language derived by computing |
124 corresponding derivative contains already 8668 nodes where we assume |
132 the left quotient of the original regular language |
125 the derivative is given as a tree. The reason for the poor runtime of |
133 w.r.t a character. |
126 the derivative-based lexing algorithms is that they need to traverse |
134 Derivative of regular expression is also a |
127 such trees over and over again. The solution is to find a complete set |
135 regular expression because |
128 of simplification rules that keep the sizes of derivatives uniformly |
136 the left quotient of a regular language is also regular. |
129 small. |
137 With this we know such a function exist without having |
130 |
138 to worry that it might not be defined. |
131 This has been partially addressed by the function $\blexer_{simp}$, |
139 In fact, regular expressions derivatives |
132 which after the simplification the $(a+aa)^*$ example's 8000 nodes will be |
140 are not just defined but also simple and elegant. |
133 reduced to just 6 and stays constant in each derivative step. |
141 They are composed of two recursive functions, |
134 The part that still needs more work is the correctness proof of this |
142 and can be easily explained by intuition. |
135 function, namely, |
143 The function |
136 \begin{equation}\label{mainthm} |
144 $\nullable$ defined below, |
137 \blexers \; r \; s = \blexer \;r\;s |
145 tests whether the empty string belongs |
138 \end{equation} |
146 to the regular expression being checked. |
139 |
|
140 \noindent |
|
141 and this is what this report is mainly about. A condensed |
|
142 version of the last report will be provided in the next section |
|
143 to help the reader understand the report, and the attempts |
|
144 on the problem will follow. |
|
145 |
|
146 |
|
147 \section{Recapitulation of Concepts From the Last Report} |
|
148 |
|
149 \subsection*{Regular Expressions and Derivatives} |
|
150 Suppose (basic) regular expressions are given by the following grammar: |
|
151 |
|
152 \[ r ::= \ZERO \mid \ONE |
|
153 \mid c |
|
154 \mid r_1 \cdot r_2 |
|
155 \mid r_1 + r_2 |
|
156 \mid r^* |
|
157 \] |
|
158 |
|
159 \noindent |
|
160 The ingenious contribution of Brzozowski is the notion of \emph{derivatives} of |
|
161 regular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of |
|
162 $\nullable$ defined below. |
|
163 |
147 |
164 \begin{center} |
148 \begin{center} |
165 \begin{tabular}{lcl} |
149 \begin{tabular}{lcl} |
166 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
150 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
167 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
151 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
168 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
152 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
169 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
153 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
170 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
154 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
171 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
155 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
172 \end{tabular} |
156 \end{tabular} |
173 \end{center} |
157 \end{center} |
|
158 \noindent |
|
159 The empty set does not contain any string and |
|
160 therefore not the empty string, the empty string |
|
161 regular expression contains the empty string |
|
162 by definition, the character regular expression |
|
163 is the singleton that contains character only, |
|
164 and therefore does not contain the empty string, |
|
165 the alternative regular expression(or "or" expression) |
|
166 might have one of its children regular expressions |
|
167 being nullable and any one of its children being nullable |
|
168 would suffice. The sequence regular expression |
|
169 would require both children to have the empty string |
|
170 to compose an empty string and the Kleene star |
|
171 operation naturally introduced the empty string. |
|
172 All nice and easy. |
|
173 This definition is used in the derivative operation |
|
174 as a condition: |
174 |
175 |
175 \begin{center} |
176 \begin{center} |
176 \begin{tabular}{lcl} |
177 \begin{tabular}{lcl} |
177 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
178 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
178 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
179 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
184 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
185 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
185 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
186 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
186 \end{tabular} |
187 \end{tabular} |
187 \end{center} |
188 \end{center} |
188 \noindent |
189 \noindent |
189 And defines how a regular expression evolves into |
190 \noindent |
|
191 The function derivative, written $\backslash c$, |
|
192 defines how a regular expression evolves into |
190 a new regular expression after all the string it contains |
193 a new regular expression after all the string it contains |
191 is chopped off a certain head character $c$. |
194 is chopped off a certain head character $c$. |
192 |
195 The most involved cases are the sequence |
193 The main property of the derivative operation is that |
196 and star case. |
|
197 The sequence case says that if the first regular expression |
|
198 contains an empty string then second component of the sequence |
|
199 might be chosen as the target regular expression to be chopped |
|
200 off its head character. |
|
201 The star regular expression unwraps the iteration of |
|
202 regular expression and attack the star regular expression |
|
203 to its back again to make sure there are 0 or more iterations |
|
204 following this unfolded iteration. |
|
205 |
|
206 |
|
207 The main property of the derivative operation |
|
208 that enables us to reason about the correctness of |
|
209 an algorithm using derivatives is |
194 |
210 |
195 \begin{center} |
211 \begin{center} |
196 $c\!::\!s \in L(r)$ holds |
212 $c\!::\!s \in L(r)$ holds |
197 if and only if $s \in L(r\backslash c)$. |
213 if and only if $s \in L(r\backslash c)$. |
198 \end{center} |
214 \end{center} |
230 derivatives until we exhaust the string and then use \textit{nullable} |
246 derivatives until we exhaust the string and then use \textit{nullable} |
231 to test whether the result can match the empty string. It can be |
247 to test whether the result can match the empty string. It can be |
232 relatively easily shown that this matcher is correct (that is given |
248 relatively easily shown that this matcher is correct (that is given |
233 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
249 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
234 |
250 |
235 |
251 Beautiful and simple definition. |
|
252 |
|
253 If we implement the above algorithm naively, however, |
|
254 the algorithm can be excruciatingly slow. For example, when starting with the regular |
|
255 expression $(a + aa)^*$ and building 12 successive derivatives |
|
256 w.r.t.~the character $a$, one obtains a derivative regular expression |
|
257 with more than 8000 nodes (when viewed as a tree). Operations like |
|
258 $\backslash$ and $\nullable$ need to traverse such trees and |
|
259 consequently the bigger the size of the derivative the slower the |
|
260 algorithm. |
|
261 |
|
262 Brzozowski was quick in finding that during this process a lot useless |
|
263 $\ONE$s and $\ZERO$s are generated and therefore not optimal. |
|
264 He also introduced some "similarity rules" such |
|
265 as $P+(Q+R) = (P+Q)+R$ to merge syntactically |
|
266 different but language-equivalent sub-regexes to further decrease the size |
|
267 of the intermediate regexes. |
|
268 |
|
269 More simplifications are possible, such as deleting duplicates |
|
270 and opening up nested alternatives to trigger even more simplifications. |
|
271 And suppose we apply simplification after each derivative step, and compose |
|
272 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn |
|
273 \textit{simp}(a \backslash c)$. Then we can build |
|
274 a matcher without having cumbersome regular expressions. |
|
275 |
|
276 |
|
277 If we want the size of derivatives in the algorithm to |
|
278 stay even lower, we would need more aggressive simplifications. |
|
279 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
|
280 deleting duplicates whenever possible. For example, the parentheses in |
|
281 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
|
282 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
|
283 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
|
284 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
|
285 to achieve a very tight size bound, namely, |
|
286 the same size bound as that of the \emph{partial derivatives}. |
|
287 |
|
288 Building derivatives and then simplify them. |
|
289 So far so good. But what if we want to |
|
290 do lexing instead of just a YES/NO answer? |
|
291 This requires us to go back again to the world |
|
292 without simplification first for a moment. |
|
293 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
|
294 elegant(arguably as beautiful as the original |
|
295 derivatives definition) solution for this. |
|
296 |
236 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} |
297 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} |
237 |
298 |
238 One limitation of Brzozowski's algorithm is that it only produces a |
299 |
239 YES/NO answer for whether a string is being matched by a regular |
300 They first defined the datatypes for storing the |
240 expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm |
301 lexing information called a \emph{value} or |
241 to allow generation of an actual matching, called a \emph{value} or |
|
242 sometimes also \emph{lexical value}. These values and regular |
302 sometimes also \emph{lexical value}. These values and regular |
243 expressions correspond to each other as illustrated in the following |
303 expressions correspond to each other as illustrated in the following |
244 table: |
304 table: |
245 |
|
246 |
305 |
247 \begin{center} |
306 \begin{center} |
248 \begin{tabular}{c@{\hspace{20mm}}c} |
307 \begin{tabular}{c@{\hspace{20mm}}c} |
249 \begin{tabular}{@{}rrl@{}} |
308 \begin{tabular}{@{}rrl@{}} |
250 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
309 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
269 \end{tabular} |
328 \end{tabular} |
270 \end{tabular} |
329 \end{tabular} |
271 \end{center} |
330 \end{center} |
272 |
331 |
273 \noindent |
332 \noindent |
|
333 One regular expression can have multiple lexical values. For example |
|
334 for the regular expression $(a+b)^*$, it has a infinite list of |
|
335 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, |
|
336 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, |
|
337 $\ldots$, and vice versa. |
|
338 Even for the regular expression matching a certain string, there could |
|
339 still be more than one value corresponding to it. |
|
340 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
|
341 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
342 The number of different ways of matching |
|
343 without allowing any value under a star to be flattened |
|
344 to an empty string can be given by the following formula: |
|
345 \begin{center} |
|
346 $C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$ |
|
347 \end{center} |
|
348 and a closed form formula can be calculated to be |
|
349 \begin{equation} |
|
350 C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} |
|
351 \end{equation} |
|
352 which is clearly in exponential order. |
|
353 A lexer aimed at getting all the possible values has an exponential |
|
354 worst case runtime. Therefore it is impractical to try to generate |
|
355 all possible matches in a run. In practice, we are usually |
|
356 interested about POSIX values, which by intuition always |
|
357 match the leftmost regular expression when there is a choice |
|
358 and always match a sub part as much as possible before proceeding |
|
359 to the next token. For example, the above example has the POSIX value |
|
360 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
|
361 The output of an algorithm we want would be a POSIX matching |
|
362 encoded as a value. |
274 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
363 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
275 algorithm by a second phase (the first phase being building successive |
364 algorithm by a second phase (the first phase being building successive |
276 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
365 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
277 is generated in case the regular expression matches the string. |
366 is generated in case the regular expression matches the string. |
278 Pictorially, the Sulzmann and Lu algorithm is as follows: |
367 Pictorially, the Sulzmann and Lu algorithm is as follows: |
318 \noindent |
407 \noindent |
319 After the $\mkeps$-call, we inject back the characters one by one in order to build |
408 After the $\mkeps$-call, we inject back the characters one by one in order to build |
320 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
409 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
321 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
410 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
322 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
411 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
323 matches $s$. For this Sulzmann and Lu defined a function that reverses |
412 matches $s$. The POSIX value is maintained throught out the process. |
|
413 For this Sulzmann and Lu defined a function that reverses |
324 the ``chopping off'' of characters during the derivative phase. The |
414 the ``chopping off'' of characters during the derivative phase. The |
325 corresponding function is called \emph{injection}, written |
415 corresponding function is called \emph{injection}, written |
326 $\textit{inj}$; it takes three arguments: the first one is a regular |
416 $\textit{inj}$; it takes three arguments: the first one is a regular |
327 expression ${r_{i-1}}$, before the character is chopped off, the second |
417 expression ${r_{i-1}}$, before the character is chopped off, the second |
328 is a character ${c_{i-1}}$, the character we want to inject and the |
418 is a character ${c_{i-1}}$, the character we want to inject and the |
343 \end{tabular} |
433 \end{tabular} |
344 \end{center} |
434 \end{center} |
345 |
435 |
346 \noindent This definition is by recursion on the ``shape'' of regular |
436 \noindent This definition is by recursion on the ``shape'' of regular |
347 expressions and values. |
437 expressions and values. |
348 |
438 The clauses basically do one thing--identifying the ``holes'' on |
349 |
439 value to inject the character back into. |
350 \subsection*{Simplification of Regular Expressions} |
440 For instance, in the last clause for injecting back to a value |
351 |
441 that would turn into a new star value that corresponds to a star, |
352 The main drawback of building successive derivatives according |
442 we know it must be a sequence value. And we know that the first |
353 to Brzozowski's definition is that they can grow very quickly in size. |
443 value of that sequence corresponds to the child regex of the star |
354 This is mainly due to the fact that the derivative operation generates |
444 with the first character being chopped off--an iteration of the star |
355 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, if |
445 that had just been unfolded. This value is followed by the already |
356 implemented naively both algorithms by Brzozowski and by Sulzmann and Lu |
446 matched star iterations we collected before. So we inject the character |
357 are excruciatingly slow. For example when starting with the regular |
447 back to the first value and form a new value with this new iteration |
358 expression $(a + aa)^*$ and building 12 successive derivatives |
448 being added to the previous list of iterations, all under the $Stars$ |
359 w.r.t.~the character $a$, one obtains a derivative regular expression |
449 top level. |
360 with more than 8000 nodes (when viewed as a tree). Operations like |
450 |
361 $\textit{der}$ and $\nullable$ need to traverse such trees and |
451 We have mentioned before that derivatives without simplification |
362 consequently the bigger the size of the derivative the slower the |
452 can get clumsy, and this is true for values as well--they reflect |
363 algorithm. |
453 the regular expressions size by definition. |
364 |
454 |
365 Fortunately, one can simplify regular expressions after each derivative |
455 One can introduce simplification on the regex and values, but have to |
366 step. |
456 be careful in not breaking the correctness as the injection |
367 Various simplifications of regular expressions are possible, such |
457 function heavily relies on the structure of the regexes and values |
368 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r |
458 being correct and match each other. |
369 \cdot \ONE$, and $r + r$ to just $r$. |
459 It can be achieved by recording some extra rectification functions |
370 Suppose we apply simplification after each derivative step, and compose |
460 during the derivatives step, and applying these rectifications in |
371 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn |
461 each run during the injection phase. |
372 \textit{simp}(a \backslash c)$. Then we can build values without having |
462 And we can prove that the POSIX value of how |
373 a cumbersome regular expression, and fortunately if we are careful |
|
374 enough in making some extra rectifications, the POSIX value of how |
|
375 regular expressions match strings will not be affected---although is much harder |
463 regular expressions match strings will not be affected---although is much harder |
376 to establish. Some initial results in this regard have been |
464 to establish. Some initial results in this regard have been |
377 obtained in \cite{AusafDyckhoffUrban2016}. |
465 obtained in \cite{AusafDyckhoffUrban2016}. |
378 |
466 |
379 |
467 %Brzozowski, after giving the derivatives and simplification, |
380 If we want the size of derivatives in Sulzmann and Lu's algorithm to |
468 %did not explore lexing with simplification or he may well be |
381 stay even lower, we would need more aggressive simplifications. |
469 %stuck on an efficient simplificaiton with a proof. |
382 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
470 %He went on to explore the use of derivatives together with |
383 deleting duplicates whenever possible. For example, the parentheses in |
471 %automaton, and did not try lexing using derivatives. |
384 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
472 |
385 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
473 We want to get rid of complex and fragile rectification of values. |
386 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
474 Can we not create those intermediate values $v_1,\ldots v_n$, |
387 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
475 and get the lexing information that should be already there while |
388 to achieve a very tight size bound, namely, |
476 doing derivatives in one pass, without a second phase of injection? |
389 the same size bound as that of the \emph{partial derivatives}. |
477 In the meantime, can we make sure that simplifications |
390 And we want to get rid of complex and fragile rectification of values. |
478 are easily handled without breaking the correctness of the algorithm? |
391 |
479 |
392 |
480 Sulzmann and Lu solved this problem by |
393 In order to implement the idea of ``spilling out alternatives'' and to |
481 introducing additional informtaion to the |
394 make them compatible with the $\textit{inj}$-mechanism, we use |
482 regular expressions called \emph{bitcodes}. |
395 \emph{bitcodes}. They were first introduced by Sulzmann and Lu. |
483 |
396 Here bits and bitcodes (lists of bits) are defined as: |
484 \subsection*{Bit-coded Algorithm} |
|
485 Bits and bitcodes (lists of bits) are defined as: |
397 |
486 |
398 \begin{center} |
487 \begin{center} |
399 $b ::= 1 \mid 0 \qquad |
488 $b ::= 1 \mid 0 \qquad |
400 bs ::= [] \mid b::bs |
489 bs ::= [] \mid b::bs |
401 $ |
490 $ |
672 \noindent |
761 \noindent |
673 In this definition $\_\backslash s$ is the generalisation of the derivative |
762 In this definition $\_\backslash s$ is the generalisation of the derivative |
674 operation from characters to strings (just like the derivatives for un-annotated |
763 operation from characters to strings (just like the derivatives for un-annotated |
675 regular expressions). |
764 regular expressions). |
676 |
765 |
|
766 Remember tha one of the important reasons we introduced bitcodes |
|
767 is that they can make simplification more structured and therefore guaranteeing |
|
768 the correctness. |
677 |
769 |
678 \subsection*{Our Simplification Rules} |
770 \subsection*{Our Simplification Rules} |
679 |
771 |
680 The main point of the bitcodes and annotated regular expressions is that |
772 In this section we introduce aggressive (in terms of size) simplification rules |
681 we can apply rather aggressive (in terms of size) simplification rules |
773 on annotated regular expressions |
682 in order to keep derivatives small. We have developed such |
774 in order to keep derivatives small. Such simplifications are promising |
683 ``aggressive'' simplification rules and generated test data that show |
775 as we have |
684 that the expected bound can be achieved. Obviously we could only |
776 generated test data that show |
|
777 that a good tight bound can be achieved. Obviously we could only |
685 partially cover the search space as there are infinitely many regular |
778 partially cover the search space as there are infinitely many regular |
686 expressions and strings. |
779 expressions and strings. |
687 |
780 |
688 One modification we introduced is to allow a list of annotated regular |
781 One modification we introduced is to allow a list of annotated regular |
689 expressions in the $\sum$ constructor. This allows us to not just |
782 expressions in the $\sum$ constructor. This allows us to not just |
1754 regular expression that is simplified at the final moment. |
1847 regular expression that is simplified at the final moment. |
1755 We are almost there, but a last step is needed to make the proof work. |
1848 We are almost there, but a last step is needed to make the proof work. |
1756 Hopefully in the next few weeks we will be able to find one. |
1849 Hopefully in the next few weeks we will be able to find one. |
1757 This would be an important milestone for my dissertation. |
1850 This would be an important milestone for my dissertation. |
1758 |
1851 |
|
1852 |
|
1853 \section{Future Plans} |
|
1854 |
|
1855 Before the proof comes out, a plan is needed to make |
|
1856 sure some progress is happening. |
|
1857 We plan to get |
|
1858 a practical implementation of our current |
|
1859 method, which would involve extending the set of allowed syntax |
|
1860 to include common regular expression constructs |
|
1861 such as bounded repitions, negation(complement) |
|
1862 and character range or even Boolean functions. |
|
1863 Adding these concise way of expressing regular expressions |
|
1864 while keeping the correctness of the simplification |
|
1865 makes the work more practical. |
|
1866 For example, using the added constructs we can |
|
1867 demonstrate that our implementation can actually |
|
1868 help the cloudflare web-application firewall |
|
1869 to run smoothly even if it had been fed with |
|
1870 evil regular expression |
|
1871 $.*(?:.*=.*)))$ |
|
1872 and strings like $x=xxxxxxxxxx....$. |
|
1873 This could help prove that our algorithm is not just |
|
1874 some nice theory work but is actually competent in reality |
|
1875 as this regular expression and string pair made the cloudflare |
|
1876 servers to grind |
|
1877 to a halt on July 2nd, 2019. |
|
1878 After all, simply the "basic" regular expressions |
|
1879 we gave in the introduction is really so basic, that |
|
1880 even Brzozowski's 1964 paper has included more |
|
1881 varieties than that(Boolean functions instead of |
|
1882 just alternatives)\cite{Brzozowski1964}. |
|
1883 A function only implementing the basic ones |
|
1884 would lack practical interests. |
|
1885 Making the algorithm practical also means to make it fast. |
|
1886 Right now the Scala implementation |
|
1887 is slower than the naive algorithm without bitcodes, even with those |
|
1888 drastic simplification measures, which is not suggested by theory. |
|
1889 We might need to implement it in other languages such as C |
|
1890 to see if the problem is language specific. |
|
1891 This is about 3 months' work. |
|
1892 |
|
1893 We believe that there |
|
1894 is still potential for more simplification as there are a lot |
|
1895 of rules for regular expression similarity, however, |
|
1896 this needs to be explored and checked. |
|
1897 Looking for maximal simplification and |
|
1898 best size bound and proving them would take around 3 months to complete. |
|
1899 |
|
1900 Context-free languages are a bit harder to deal with by derivatives |
|
1901 when compared with regular expressions. |
|
1902 The main reason is that the child node of a context grammar |
|
1903 can be the grammar itself--this left-recursion makes |
|
1904 the computation procedure an impossibility when trying to solve directly. |
|
1905 An example would be the following\cite{might2011cfgder}: |
|
1906 \begin{center} |
|
1907 $L = L \cdot x | \ONE$ |
|
1908 \end{center} |
|
1909 then a derivative would give us the following: |
|
1910 \begin{center} |
|
1911 $L\backslash x = (L \backslash x)\cdot x | \ONE$ |
|
1912 \end{center} |
|
1913 which is essentially what we have previously. |
|
1914 This recursion can go on forever if we do not use some |
|
1915 other methods such as introducing methods |
|
1916 or manually decide that the two equations |
|
1917 are actually equivalent and $L$ and $L\backslash x$ |
|
1918 denote the same language. And then give a mathematical proof |
|
1919 that this is actually the case. |
|
1920 This seems well beyond any mechanical algorithm one can |
|
1921 expect to be able to do..... |
|
1922 People have tried various ways to deal with the problems |
|
1923 of looping when computing derivatives of context-free grammars |
|
1924 such as memoization and laziness, with some degree of success\cite{might2011cfgder}. |
|
1925 However this field seems largely unexplored and further |
|
1926 optimizations seem possible. It would be great |
|
1927 if we could find a simple augmentation to the current |
|
1928 derivative method so that the subset of context free language |
|
1929 of interest can be handled. |
|
1930 This may take longer than the two previous tasks, but we still |
|
1931 give it 3 months time and see how it goes. |
|
1932 |
|
1933 |
|
1934 |
|
1935 |
|
1936 |
|
1937 |
1759 \bibliographystyle{plain} |
1938 \bibliographystyle{plain} |
1760 \bibliography{root} |
1939 \bibliography{root} |
1761 |
1940 |
1762 |
1941 |
1763 \end{document} |
1942 \end{document} |