68 \begin{abstract} |
68 \begin{abstract} |
69 Brzozowski introduced in 1964 a beautifully simple algorithm for |
69 Brzozowski introduced in 1964 a beautifully simple algorithm for |
70 regular expression matching based on the notion of derivatives of |
70 regular expression matching based on the notion of derivatives of |
71 regular expressions. In 2014, Sulzmann and Lu extended this |
71 regular expressions. In 2014, Sulzmann and Lu extended this |
72 algorithm to not just give a YES/NO answer for whether or not a |
72 algorithm to not just give a YES/NO answer for whether or not a |
73 regular expression matches a string, but in case it matches also |
73 regular expression matches a string, but in case it does also |
74 answers with \emph{how} it matches the string. This is important for |
74 answers with \emph{how} it matches the string. This is important for |
75 applications such as lexing (tokenising a string). The problem is to |
75 applications such as lexing (tokenising a string). The problem is to |
76 make the algorithm by Sulzmann and Lu fast on all inputs without |
76 make the algorithm by Sulzmann and Lu fast on all inputs without |
77 breaking its correctness. We have already developed some |
77 breaking its correctness. We have already developed some |
78 simplification rules for this, but have not yet proved that they |
78 simplification rules for this, but have not yet proved that they |
91 fast? |
91 fast? |
92 |
92 |
93 Unfortunately these preconceptions are not supported by evidence: Take |
93 Unfortunately these preconceptions are not supported by evidence: Take |
94 for example the regular expression $(a^*)^*\,b$ and ask whether |
94 for example the regular expression $(a^*)^*\,b$ and ask whether |
95 strings of the form $aa..a$ match this regular |
95 strings of the form $aa..a$ match this regular |
96 expression. Obviously they do not match---the expected $b$ in the last |
96 expression. Obviously not---the expected $b$ in the last |
97 position is missing. One would expect that modern regular expression |
97 position is missing. One would expect that modern regular expression |
98 matching engines can find this out very quickly. Alas, if one tries |
98 matching engines can find this out very quickly. Alas, if one tries |
99 this example in JavaScript, Python or Java 8 with strings like 28 |
99 this example in JavaScript, Python or Java 8 with strings like 28 |
100 $a$'s, one discovers that this decision takes around 30 seconds and |
100 $a$'s, one discovers that this decision takes around 30 seconds and |
101 takes considerably longer when adding a few more $a$'s, as the graphs |
101 takes considerably longer when adding a few more $a$'s, as the graphs |
169 of the form $\underbrace{aa..a}_{n}$.} |
169 of the form $\underbrace{aa..a}_{n}$.} |
170 \end{tabular} |
170 \end{tabular} |
171 \end{center} |
171 \end{center} |
172 |
172 |
173 \noindent These are clearly abysmal and possibly surprising results. One |
173 \noindent These are clearly abysmal and possibly surprising results. One |
174 would expect these systems doing much better than that---after all, |
174 would expect these systems to do much better than that---after all, |
175 given a DFA and a string, deciding whether a string is matched by this |
175 given a DFA and a string, deciding whether a string is matched by this |
176 DFA should be linear. |
176 DFA should be linear. |
177 |
177 |
178 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to |
178 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to |
179 exhibit this ``exponential behaviour''. Unfortunately, such regular |
179 exhibit this exponential behaviour. Unfortunately, such regular |
180 expressions are not just a few ``outliers'', but actually they are |
180 expressions are not just a few outliers. They are actually |
181 frequent enough that a separate name has been created for |
181 frequent enough to have a separate name created for |
182 them---\emph{evil regular expressions}. In empiric work, Davis et al |
182 them---\emph{evil regular expressions}. In empiric work, Davis et al |
183 report that they have found thousands of such evil regular expressions |
183 report that they have found thousands of such evil regular expressions |
184 in the JavaScript and Python ecosystems \cite{Davis18}. |
184 in the JavaScript and Python ecosystems \cite{Davis18}. |
185 |
185 |
186 This exponential blowup in matching algorithms sometimes causes |
186 This exponential blowup in matching algorithms sometimes causes |
188 regular expression brought the webpage |
188 regular expression brought the webpage |
189 \href{http://stackexchange.com}{Stack Exchange} to its |
189 \href{http://stackexchange.com}{Stack Exchange} to its |
190 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}} |
190 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}} |
191 In this instance, a regular expression intended to just trim white |
191 In this instance, a regular expression intended to just trim white |
192 spaces from the beginning and the end of a line actually consumed |
192 spaces from the beginning and the end of a line actually consumed |
193 massive amounts of CPU-resources and because of this the web servers |
193 massive amounts of CPU-resources---causing web servers to |
194 ground to a halt. This happened when a post with 20,000 white spaces was |
194 grind to a halt. This happened when a post with 20,000 white spaces was |
195 submitted, but importantly the white spaces were neither at the |
195 submitted, but importantly the white spaces were neither at the |
196 beginning nor at the end. As a result, the regular expression matching |
196 beginning nor at the end. As a result, the regular expression matching |
197 engine needed to backtrack over many choices. The underlying problem is |
197 engine needed to backtrack over many choices. The underlying problem is |
198 that many ``real life'' regular expression matching engines do not use |
198 that many ``real life'' regular expression matching engines do not use |
199 DFAs for matching. This is because they support regular expressions that |
199 DFAs for matching. This is because they support regular expressions that |
200 are not covered by the classical automata theory, and in this more |
200 are not covered by the classical automata theory, and in this more |
201 general setting there are quite a few research questions still |
201 general setting there are quite a few research questions still |
202 unanswered and fast algorithms still need to be developed (for example |
202 unanswered and fast algorithms still need to be developed (for example |
203 how to treat bounded repetitions, negation and back-references |
203 how to treat bounded repetitions, negation and back-references |
204 efficiently). |
204 efficiently). |
205 |
205 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas? |
|
206 %how do they avoid dfas exponential states if they use them for fast matching? |
206 There is also another under-researched problem to do with regular |
207 There is also another under-researched problem to do with regular |
207 expressions and lexing, i.e.~the process of breaking up strings into |
208 expressions and lexing, i.e.~the process of breaking up strings into |
208 sequences of tokens according to some regular expressions. In this |
209 sequences of tokens according to some regular expressions. In this |
209 setting one is not just interested in whether or not a regular |
210 setting one is not just interested in whether or not a regular |
210 expression matches a string, but if it matches also in \emph{how} it |
211 expression matches a string, but also in \emph{how}. Consider for example a regular expression |
211 matches the string. Consider for example a regular expression |
|
212 $r_{key}$ for recognising keywords such as \textit{if}, \textit{then} |
212 $r_{key}$ for recognising keywords such as \textit{if}, \textit{then} |
213 and so on; and a regular expression $r_{id}$ for recognising |
213 and so on; and a regular expression $r_{id}$ for recognising |
214 identifiers (say, a single character followed by characters or |
214 identifiers (say, a single character followed by characters or |
215 numbers). One can then form the compound regular expression |
215 numbers). One can then form the compound regular expression |
216 $(r_{key} + r_{id})^*$ and use it to tokenise strings. But then how |
216 $(r_{key} + r_{id})^*$ and use it to tokenise strings. But then how |
219 single identifier. Similarly, how should the string \textit{if} be |
219 single identifier. Similarly, how should the string \textit{if} be |
220 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would |
220 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would |
221 ``fire''---so is it an identifier or a keyword? While in applications |
221 ``fire''---so is it an identifier or a keyword? While in applications |
222 there is a well-known strategy to decide these questions, called POSIX |
222 there is a well-known strategy to decide these questions, called POSIX |
223 matching, only relatively recently precise definitions of what POSIX |
223 matching, only relatively recently precise definitions of what POSIX |
224 matching actually means have been formalised |
224 matching actually means has been formalised |
225 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. |
225 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. |
226 Such a definition has also been given by Sulzmann and Lu \cite{Sulzmann2014}, but the |
226 Such a definition has also been given by Sulzmann and Lu \cite{Sulzmann2014}, but the |
227 corresponding correctness proof turned out to be faulty \cite{AusafDyckhoffUrban2016}. |
227 corresponding correctness proof turned out to be faulty \cite{AusafDyckhoffUrban2016}. |
228 Roughly, POSIX matching means matching the longest initial substring. |
228 Roughly, POSIX matching means matching the longest initial substring. |
229 In the case of a tie, the initial submatch is chosen according to some priorities attached to the |
229 In the case of a tie, the initial sub-match is chosen according to some priorities attached to the |
230 regular expressions (e.g.~keywords have a higher priority than |
230 regular expressions (e.g.~keywords have a higher priority than |
231 identifiers). This sounds rather simple, but according to Grathwohl et |
231 identifiers). This sounds rather simple, but according to Grathwohl et |
232 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote: |
232 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote: |
233 |
233 |
234 \begin{quote} |
234 \begin{quote} |
397 \end{tabular} |
397 \end{tabular} |
398 \end{tabular} |
398 \end{tabular} |
399 \end{center} |
399 \end{center} |
400 |
400 |
401 \noindent |
401 \noindent |
402 There is no value corresponding to $\ZERO$; $\Empty$ corresponds to |
402 No value corresponds to $\ZERO$; $\Empty$ corresponds to |
403 $\ONE$; $\Char$ to the character regular expression; $\Seq$ to the |
403 $\ONE$; $\Char$ to the character regular expression; $\Seq$ to the |
404 sequence regular expression and so on. The idea of values is to encode |
404 sequence regular expression and so on. The idea of values is to encode |
405 parse trees. To see this, suppose a \emph{flatten} operation, written |
405 parse trees for how the sub-parts of a regular expression matches |
|
406 the sub-parts of a string. To see this, suppose a \emph{flatten} operation, written |
406 $|v|$. We use this function to extract the underlying string of a value |
407 $|v|$. We use this function to extract the underlying string of a value |
407 $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char |
408 $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char |
408 y})|$ is the string $xy$. Using flatten, we can describe how values |
409 y})|$ is the string $xy$. Using flatten, we can describe how values |
409 encode parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |
410 encode parse trees: $\Seq\,v_1\, v_2$ encodes a tree with 2 children nodes |
|
411 that tells how the string $|v_1| @ |
410 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the |
412 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the |
411 substring $|v_1|$ and, respectively, $r_2$ matches the substring |
413 substring $|v_1|$ and, respectively, $r_2$ matches the substring |
412 $|v_2|$. Exactly how these two are matched is contained in the |
414 $|v_2|$. Exactly how these two are matched is contained in the |
413 sub-structure of $v_1$ and $v_2$. |
415 children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . |
414 |
416 |
415 To give a concrete example of how value works, consider the string $xy$ |
417 To give a concrete example of how value works, consider the string $xy$ |
416 and the regular expression $(x + (y + xy))^*$. We can view this regular |
418 and the regular expression $(x + (y + xy))^*$. We can view this regular |
417 expression as a tree and if the string $xy$ is matched by two Star |
419 expression as a tree and if the string $xy$ is matched by two Star |
418 ``iterations'', then the $x$ is matched by the left-most alternative in |
420 ``iterations'', then the $x$ is matched by the left-most alternative in |
453 \end{equation} |
455 \end{equation} |
454 |
456 |
455 \noindent |
457 \noindent |
456 For convenience, we shall employ the following notations: the regular expression we |
458 For convenience, we shall employ the following notations: the regular expression we |
457 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1 |
459 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1 |
458 \ldots c_n$. In the first phase, we build the derivatives $r_1$, $r_2$, \ldots according to |
460 \ldots c_{n-1}$. In the first phase, we build the derivatives $r_1$, $r_2$, \ldots according to |
459 the characters $c_0$, $c_1$,\ldots until we exhaust the string and |
461 the characters $c_0$, $c_1$,\ldots until we exhaust the string and |
460 obtain at the derivative $r_n$. We test whether this derivative is |
462 obtain at the derivative $r_n$. We test whether this derivative is |
461 $\textit{nullable}$ or not. If not, we know the string does not match |
463 $\textit{nullable}$ or not. If not, we know the string does not match |
462 $r$ and no value needs to be generated. If yes, we start building the |
464 $r$ and no value needs to be generated. If yes, we start building the |
463 parse tree incrementally by \emph{injecting} back the characters into |
465 parse tree incrementally by \emph{injecting} back the characters into |
485 regular expression on the left-hand side. This will become |
487 regular expression on the left-hand side. This will become |
486 important later on. |
488 important later on. |
487 |
489 |
488 After this, we inject back the characters one by one in order to build |
490 After this, we inject back the characters one by one in order to build |
489 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$ |
491 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$ |
490 ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After |
492 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. After |
491 injecting back $n$ characters, we get the parse tree for how $r_0$ |
493 injecting back $n$ characters, we get the parse tree for how $r_0$ |
492 matches $s$. For this Sulzmann and Lu defined a function that reverses |
494 matches $s$. For this Sulzmann and Lu defined a function that reverses |
493 the ``chopping off'' of characters during the derivative phase. The |
495 the ``chopping off'' of characters during the derivative phase. The |
494 corresponding function is called $\textit{inj}$; it takes three |
496 corresponding function is called $\textit{inj}$; it takes three |
495 arguments: the first one is a regular expression ${r_{i-1}}$, before the |
497 arguments: the first one is a regular expression ${r_{i-1}}$, before the |
496 character is chopped off, the second is a character ${c_{i-1}}$, the |
498 character is chopped off, the second is a character ${c_{i-1}}$, the |
497 character we want to inject and the third argument is the value |
499 character we want to inject and the third argument is the value |
498 $textit{v_i}$, into which one wants to inject the character (it |
500 ${v_i}$, into which one wants to inject the character (it |
499 corresponds to the regular expression after the character has been |
501 corresponds to the regular expression after the character has been |
500 chopped off). The result of this function is a new value. The definition |
502 chopped off). The result of this function is a new value. The definition |
501 of $\textit{inj}$ is as follows: |
503 of $\textit{inj}$ is as follows: |
502 |
504 |
503 \begin{center} |
505 \begin{center} |
514 |
516 |
515 \noindent This definition is by recursion on the ``shape'' of regular |
517 \noindent This definition is by recursion on the ``shape'' of regular |
516 expressions and values. To understands this definition better consider |
518 expressions and values. To understands this definition better consider |
517 the situation when we build the derivative on regular expression $r_{i-1}$. |
519 the situation when we build the derivative on regular expression $r_{i-1}$. |
518 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a |
520 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a |
519 ``hole'' in $r_i$. In order to calculate a value for $r_{i-1}$, we need to |
521 ``hole'' in $r_i$ and its corresponding value $v_i$ |
520 locate where that hole is. We can find this location information by |
522 . To calculate $v_{i-1}$, we need to |
|
523 locate where that hole is and fill it. |
|
524 We can find this location by |
521 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape |
525 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape |
522 $r_a \cdot r_b$, and $v_i$ is of shape $\textit{Left}(\textit{Seq}(v_a, |
526 $r_a \cdot r_b$, and $v_i$ is of shape $\Left(Seq(v_1,v_2))$, we know immediately that |
523 v_b))$, we know immediately that |
|
524 % |
527 % |
525 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\] |
528 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\] |
526 |
529 |
527 \noindent |
530 \noindent |
528 otherwise if $r_a$ is not nullable, |
531 otherwise if $r_a$ is not nullable, |
529 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\] |
532 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\] |
530 |
533 |
531 \noindent |
534 \noindent |
532 the value $v_i$ should be of shape $\Seq(\ldots)$, contradicting the fact that |
535 the value $v_i$ should be $\Seq(\ldots)$, contradicting the fact that |
533 $v_i$ is actually $\Left(\ldots)$. Furthermore, since $v_i$ is of shape |
536 $v_i$ is actually of shape $\Left(\ldots)$. Furthermore, since $v_i$ is of shape |
534 $\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left |
537 $\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left |
535 branch is taken instead of the right one. We have therefore found out |
538 branch of \[ (r_a \cdot r_b)\backslash c = |
|
539 \bold{\underline{ (r_a\backslash c) \cdot r_b} }\,+\, r_b\backslash c,\](underlined) |
|
540 is taken instead of the right one. This means $c$ is chopped off |
|
541 from $r_a$ rather than $r_b$. |
|
542 We have therefore found out |
536 that the hole will be on $r_a$. So we recursively call $\inj\, |
543 that the hole will be on $r_a$. So we recursively call $\inj\, |
537 r_a\,c\,v_a$ in order to fill that hole in $v_a$. After injection, the value |
544 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value |
538 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$. |
545 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$. |
539 Other clauses can be understood in a similar way. |
546 Other clauses can be understood in a similar way. |
540 |
547 |
541 Let us do some further examples involving $\inj$: Suppose we have the |
548 Let us do some further examples involving $\inj$: Suppose we have the |
542 regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against |
549 regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against |
543 the string $abc$ (when $abc$ is written as a regular expression, the most |
550 the string $abc$ (when $abc$ is written as a regular expression, the most |
544 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit |
551 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit |
545 the parentheses and dots here for readability). By POSIX rules the lexer |
552 the parentheses and dots here for readability). By POSIX rules the algorithm |
546 should go for the longest matching, i.e. it should match the string |
553 should go for the longest matching, i.e. it should match the string |
547 $abc$ in one star iteration, using the longest string $abc$ in the |
554 $abc$ in one star iteration, using the longest string $abc$ in the |
548 sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression |
555 sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression |
549 for conciseness). Here is how the lexer achieves a parse tree for this |
556 for conciseness). |
|
557 Here is how the lexer achieves a parse tree for this |
550 matching. First, build successive derivatives until we exhaust the |
558 matching. First, build successive derivatives until we exhaust the |
551 string, as illustrated here (we simplified some regular expressions like |
559 string, as illustrated here (we simplified some regular expressions like |
552 $0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if |
560 $0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if |
553 they are clear from the context): |
561 they are clear from the context): |
554 %Similarly, we allow |
562 %Similarly, we allow |