66 Brzozowski introduced in 1964 a beautifully simple algorithm for |
66 Brzozowski introduced in 1964 a beautifully simple algorithm for |
67 regular expression matching based on the notion of derivatives of |
67 regular expression matching based on the notion of derivatives of |
68 regular expressions. In 2014, Sulzmann and Lu extended this |
68 regular expressions. In 2014, Sulzmann and Lu extended this |
69 algorithm to not just give a YES/NO answer for whether or not a |
69 algorithm to not just give a YES/NO answer for whether or not a |
70 regular expression matches a string, but in case it matches also |
70 regular expression matches a string, but in case it matches also |
71 \emph{how} it matches the string. This is important for |
71 answers with \emph{how} it matches the string. This is important for |
72 applications such as lexing (tokenising a string). The problem is to |
72 applications such as lexing (tokenising a string). The problem is to |
73 make the algorithm by Sulzmann and Lu fast on all inputs without |
73 make the algorithm by Sulzmann and Lu fast on all inputs without |
74 breaking its correctness. We have already developed some |
74 breaking its correctness. We have already developed some |
75 simplification rules for this, but have not proved yet that they |
75 simplification rules for this, but have not proved yet that they |
76 preserve the correctness of the algorithm. We also have not yet |
76 preserve the correctness of the algorithm. We also have not yet |
84 lexing. Given the maturity of this topic, the reader might wonder: |
84 lexing. Given the maturity of this topic, the reader might wonder: |
85 Surely, regular expressions must have already been studied to death? |
85 Surely, regular expressions must have already been studied to death? |
86 What could possibly be \emph{not} known in this area? And surely all |
86 What could possibly be \emph{not} known in this area? And surely all |
87 implemented algorithms for regular expression matching are blindingly |
87 implemented algorithms for regular expression matching are blindingly |
88 fast? |
88 fast? |
89 |
|
90 |
|
91 |
89 |
92 Unfortunately these preconceptions are not supported by evidence: Take |
90 Unfortunately these preconceptions are not supported by evidence: Take |
93 for example the regular expression $(a^*)^*\,b$ and ask whether |
91 for example the regular expression $(a^*)^*\,b$ and ask whether |
94 strings of the form $aa..a$ match this regular |
92 strings of the form $aa..a$ match this regular |
95 expression. Obviously they do not match---the expected $b$ in the last |
93 expression. Obviously they do not match---the expected $b$ in the last |
167 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
165 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
168 of the form $\underbrace{aa..a}_{n}$.} |
166 of the form $\underbrace{aa..a}_{n}$.} |
169 \end{tabular} |
167 \end{tabular} |
170 \end{center} |
168 \end{center} |
171 |
169 |
172 \noindent These are clearly abysmal and possibly surprising results. |
170 \noindent These are clearly abysmal and possibly surprising results. One |
173 One would expect these systems doing much better than that---after |
171 would expect these systems doing much better than that---after all, |
174 all, given a DFA and a string, whether a string is matched by this DFA |
172 given a DFA and a string, deciding whether a string is matched by this |
175 should be linear. |
173 DFA should be linear. |
176 |
174 |
177 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to |
175 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to |
178 exhibit this ``exponential behaviour''. Unfortunately, such regular |
176 exhibit this ``exponential behaviour''. Unfortunately, such regular |
179 expressions are not just a few ``outliers'', but actually they are |
177 expressions are not just a few ``outliers'', but actually they are |
180 frequent enough that a separate name has been created for |
178 frequent enough that a separate name has been created for |
181 them---\emph{evil regular expressions}. In empiric work, Davis et al |
179 them---\emph{evil regular expressions}. In empiric work, Davis et al |
182 report that they have found thousands of such evil regular expressions |
180 report that they have found thousands of such evil regular expressions |
183 in the JavaScript and Python ecosystems \cite{Davis18}. |
181 in the JavaScript and Python ecosystems \cite{Davis18}. |
184 |
182 |
185 This exponential blowup sometimes causes pain in real life: |
183 This exponential blowup in matching algorithms sometimes causes |
186 for example on 20 July 2016 one evil regular expression brought the |
184 considerable grief in real life: for example on 20 July 2016 one evil |
187 webpage \href{http://stackexchange.com}{Stack Exchange} to its |
185 regular expression brought the webpage |
|
186 \href{http://stackexchange.com}{Stack Exchange} to its |
188 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
187 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
189 In this instance, a regular expression intended to just trim white |
188 In this instance, a regular expression intended to just trim white |
190 spaces from the beginning and the end of a line actually consumed |
189 spaces from the beginning and the end of a line actually consumed |
191 massive amounts of CPU-resources and because of this the web servers |
190 massive amounts of CPU-resources and because of this the web servers |
192 ground to a halt. This happened when a post with 20,000 white spaces |
191 ground to a halt. This happened when a post with 20,000 white spaces was |
193 was submitted, but importantly the white spaces were neither at the |
192 submitted, but importantly the white spaces were neither at the |
194 beginning nor at the end. As a result, the regular expression matching |
193 beginning nor at the end. As a result, the regular expression matching |
195 engine needed to backtrack over many choices. |
194 engine needed to backtrack over many choices. The underlying problem is |
196 |
195 that many ``real life'' regular expression matching engines do not use |
197 The underlying problem is that many ``real life'' regular expression |
196 DFAs for matching. This is because they support regular expressions that |
198 matching engines do not use DFAs for matching. This is because they |
197 are not covered by the classical automata theory, and in this more |
199 support regular expressions that are not covered by the classical |
198 general setting there are quite a few research questions still |
200 automata theory, and in this more general setting there are quite a |
199 unanswered and fast algorithms still need to be developed (for example |
201 few research questions still unanswered and fast algorithms still need |
200 how to include bounded repetitions, negation and back-references). |
202 to be developed (for example how to include bounded repetitions, negation |
|
203 and back-references). |
|
204 |
201 |
205 There is also another under-researched problem to do with regular |
202 There is also another under-researched problem to do with regular |
206 expressions and lexing, i.e.~the process of breaking up strings into |
203 expressions and lexing, i.e.~the process of breaking up strings into |
207 sequences of tokens according to some regular expressions. In this |
204 sequences of tokens according to some regular expressions. In this |
208 setting one is not just interested in whether or not a regular |
205 setting one is not just interested in whether or not a regular |
263 The intended meaning of the constructors is as follows: $\ZERO$ |
260 The intended meaning of the constructors is as follows: $\ZERO$ |
264 cannot match any string, $\ONE$ can match the empty string, the |
261 cannot match any string, $\ONE$ can match the empty string, the |
265 character regular expression $c$ can match the character $c$, and so |
262 character regular expression $c$ can match the character $c$, and so |
266 on. |
263 on. |
267 |
264 |
268 The brilliant contribution by Brzozowski is the notion of |
265 The ingenious contribution by Brzozowski is the notion of |
269 \emph{derivatives} of regular expressions. The idea behind this |
266 \emph{derivatives} of regular expressions. The idea behind this |
270 notion is as follows: suppose a regular expression $r$ can match a |
267 notion is as follows: suppose a regular expression $r$ can match a |
271 string of the form $c\!::\! s$ (that is a list of characters starting |
268 string of the form $c\!::\! s$ (that is a list of characters starting |
272 with $c$), what does the regular expression look like that can match |
269 with $c$), what does the regular expression look like that can match |
273 just $s$? Brzozowski gave a neat answer to this question. He started |
270 just $s$? Brzozowski gave a neat answer to this question. He started |
356 \noindent |
353 \noindent |
357 where we start with a regular expression $r_0$, build successive |
354 where we start with a regular expression $r_0$, build successive |
358 derivatives until we exhaust the string and then use \textit{nullable} |
355 derivatives until we exhaust the string and then use \textit{nullable} |
359 to test whether the result can match the empty string. It can be |
356 to test whether the result can match the empty string. It can be |
360 relatively easily shown that this matcher is correct (that is given |
357 relatively easily shown that this matcher is correct (that is given |
361 $s$ and $r$, it generates YES if and only if $s \in L(r)$). |
358 an $s$ and a $r$, it generates YES if and only if $s \in L(r)$). |
362 |
359 |
363 |
360 |
364 \section{Values and the Algorithm by Sulzmann and Lu} |
361 \section{Values and the Algorithm by Sulzmann and Lu} |
365 |
362 |
366 One limitation, however, of Brzozowski's algorithm is that it only |
363 One limitation, however, of Brzozowski's algorithm is that it only |
396 \end{tabular} |
393 \end{tabular} |
397 \end{tabular} |
394 \end{tabular} |
398 \end{center} |
395 \end{center} |
399 |
396 |
400 \noindent |
397 \noindent |
401 The idea of values is to express parse trees. Suppose a flatten |
398 There is no value corresponding to $\ZERO$; $\Empty$ corresponds to |
|
399 $\ONE$; $\Seq$ to the sequence regular expression and so on. The idea of |
|
400 values is to encode parse trees. To see this, suppose a \emph{flatten} |
402 operation, written $|v|$, which we can use to extract the underlying |
401 operation, written $|v|$, which we can use to extract the underlying |
403 string of $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, |
402 string of a value $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, |
404 (\textit{Char y})|$ is the string $xy$. We omit the straightforward |
403 (\textit{Char y})|$ is the string $xy$. We omit the straightforward |
405 definition of flatten. Using flatten, we can describe how |
404 definition of flatten. Using flatten, we can describe how values encode |
406 values encode parse trees: $\Seq\,v_1\, v_2$ tells us how the |
405 parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$ |
407 string $|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ |
406 matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and, |
408 matches $|v_1|$ and, respectively, $r_2$ matches $|v_2|$. Exactly how |
407 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched |
409 these two are matched is contained in the sub-structure of $v_1$ and |
408 is contained in the sub-structure of $v_1$ and $v_2$. |
410 $v_2$. |
409 |
411 |
410 To give a concrete example of how value work, consider the string $xy$ |
412 To give a concrete example of how value works, consider the string $xy$ |
|
413 and the regular expression $(x + (y + xy))^*$. We can view this regular |
411 and the regular expression $(x + (y + xy))^*$. We can view this regular |
414 expression as a tree and if the string $xy$ is matched by two Star |
412 expression as a tree and if the string $xy$ is matched by two Star |
415 ``iterations'', then the $x$ is matched by the left-most alternative in |
413 ``iterations'', then the $x$ is matched by the left-most alternative in |
416 this tree and the $y$ by the right-left alternative. This suggests to |
414 this tree and the $y$ by the right-left alternative. This suggests to |
417 record this matching as |
415 record this matching as |
448 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
446 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
449 \end{tikzcd} |
447 \end{tikzcd} |
450 \end{center} |
448 \end{center} |
451 |
449 |
452 \noindent |
450 \noindent |
453 We shall briefly explain this algorithm. For the convenience of |
451 For the convenience, we shall employ the following notations: the regular expression we |
454 explanation, we have the following notations: the regular expression we |
452 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1 |
455 start with is $r_0$ and the string $s$ is composed characters $c_0 c_1 |
453 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots according to |
456 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots, using |
|
457 the characters $c_0$, $c_1$,\ldots until we exhaust the string and |
454 the characters $c_0$, $c_1$,\ldots until we exhaust the string and |
458 arrive at the derivative $r_n$. We test whether this derivative is |
455 obtain at the derivative $r_n$. We test whether this derivative is |
459 $\textit{nullable}$ or not. If not, we know the string does not match |
456 $\textit{nullable}$ or not. If not, we know the string does not match |
460 $r$ and no value needs to be generated. If yes, we start building the |
457 $r$ and no value needs to be generated. If yes, we start building the |
461 parse tree incrementally by \emph{injecting} back the characters into |
458 parse tree incrementally by \emph{injecting} back the characters into |
462 the values $v_n, \ldots, v_0$. We first call the function |
459 the values $v_n, \ldots, v_0$. For this we first call the function |
463 $\textit{mkeps}$, which builds the parse tree for how the empty string |
460 $\textit{mkeps}$, which builds the parse tree for how the empty string |
464 has matched the empty regular expression $r_n$. This function is defined |
461 has matched the (nullable) regular expression $r_n$. This function is defined |
465 as |
462 as |
466 |
463 |
467 \begin{center} |
464 \begin{center} |
468 \begin{tabular}{lcl} |
465 \begin{tabular}{lcl} |
469 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
466 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
480 the parse tree $v_i$ for how the regex $r_i$ matches the string |
477 the parse tree $v_i$ for how the regex $r_i$ matches the string |
481 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we |
478 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we |
482 get the parse tree for how $r_0$ matches $s$, exactly as we wanted. |
479 get the parse tree for how $r_0$ matches $s$, exactly as we wanted. |
483 A correctness proof using induction can be routinely established. |
480 A correctness proof using induction can be routinely established. |
484 |
481 |
485 It is instructive to see how this algorithm works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to match it against the string $abc$(when $abc$ is written |
482 It is instructive to see how this algorithm works by a little example. |
486 as a regular expression, the most standard way of expressing it should be $a \cdot (b \cdot c)$. We omit the parenthesis and dots here for readability). |
483 Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to |
487 By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching. |
484 match it against the string $abc$(when $abc$ is written as a regular |
488 First, we build successive derivatives until we exhaust the string, as illustrated here( we simplified some regular expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow $\textit{ALT}$ to take a list of regular expressions as an argument instead of just 2 operands to reduce the nested depth of $\textit{ALT}$): |
485 expression, the most standard way of expressing it should be $a \cdot (b |
|
486 \cdot c)$. We omit the parenthesis and dots here for readability). By |
|
487 POSIX rules the lexer should go for the longest matching, i.e. it should |
|
488 match the string $abc$ in one star iteration, using the longest string |
|
489 $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this |
|
490 sub-expression for conciseness). Here is how the lexer achieves a parse |
|
491 tree for this matching. First, we build successive derivatives until we |
|
492 exhaust the string, as illustrated here( we simplified some regular |
|
493 expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow |
|
494 $\textit{ALT}$ to take a list of regular expressions as an argument |
|
495 instead of just 2 operands to reduce the nested depth of |
|
496 $\textit{ALT}$): |
489 |
497 |
490 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\] |
498 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\] |
491 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] |
499 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] |
492 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* ) |
500 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* ) |
493 \] |
501 \] |
854 |
862 |
855 \noindent |
863 \noindent |
856 We would settle the correctness claim. |
864 We would settle the correctness claim. |
857 It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words, |
865 It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words, |
858 bmkeps r = bmkeps simp r |
866 bmkeps r = bmkeps simp r |
859 as this basically comes down to proving actions like removing the additional $r$ in $r+r$ does not delete imporatnt POSIX information in a regular expression. |
867 as this basically comes down to proving actions like removing the additional $r$ in $r+r$ does not delete important POSIX information in a regular expression. |
860 The hardcore of this problem is to prove that |
868 The hardcore of this problem is to prove that |
861 bmkeps bders r = bmkeps bders simp r |
869 bmkeps bders r = bmkeps bders simp r |
862 That is, if we do derivative on regular expression r and the simplificed version fo r , they can still provie the same POSIZ calue if there is one . This is not as strightforward as the previous proposition, as the two regular expression r and simp r might become very different regular epxressions after repeated application ofd simp and derivative. |
870 That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straghtforward as the previous proposition, as the two regular expression r and simp r might become very different regular expressions after repeated application ofd simp and derivative. |
863 The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification. |
871 The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification. |
864 To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu: |
872 To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu: |
865 \\definition of retrieve\\ |
873 \\definition of retrieve\\ |
866 This function assembled the bitcode that corresponds to a parse tree for how the current derivative mathces the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
874 This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
867 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ |
875 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ |
868 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\ |
876 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\ |
869 A little fact that needs to be stated to help comprehension:\\ |
877 A little fact that needs to be stated to help comprehension:\\ |
870 $r^\uparrow = a$($a$ stands for $annotated$).\\ |
878 $r^\uparrow = a$($a$ stands for $annotated$).\\ |
871 Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplificaiton. |
879 Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplification. |
872 Our purpose of using this, however, is try to establish \\ |
880 Our purpose of using this, however, is try to establish \\ |
873 $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$\\ |
881 $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$\\ |
874 The idea is that using $v'$, |
882 The idea is that using $v'$, |
875 a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still able to extract the bitsequence that gievs the same parsing information as the unsimplified one. |
883 a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still able to extract the bitsequence that gives the same parsing information as the unsimplified one. |
876 After establishing this, we might be able to finally bridge the gap of proving\\ |
884 After establishing this, we might be able to finally bridge the gap of proving\\ |
877 $\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; \textit{simp}(r) \backslash s \; v'$\\ |
885 $\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; \textit{simp}(r) \backslash s \; v'$\\ |
878 and subsequently\\ |
886 and subsequently\\ |
879 $\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; r \backslash_{simp} s \; v'$.\\ |
887 $\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; r \backslash_{simp} s \; v'$.\\ |
880 This proves that our simplified version of regular expression still contains all the bitcodes neeeded. |
888 This proves that our simplified version of regular expression still contains all the bitcodes needed. |
881 |
889 |
882 The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplifiction(the naive version as implemented in ADU of course can explode in some cases). |
890 The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplification(the naive version as implemented in ADU of course can explode in some cases). |
883 So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress. |
891 So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress. |
884 |
892 |
885 \section{Conclusion} |
893 \section{Conclusion} |
886 |
894 |
887 In this PhD-project we are interested in fast algorithms for regular |
895 In this PhD-project we are interested in fast algorithms for regular |