96 fast? |
96 fast? |
97 |
97 |
98 Unfortunately these preconceptions are not supported by evidence: Take |
98 Unfortunately these preconceptions are not supported by evidence: Take |
99 for example the regular expression $(a^*)^*\,b$ and ask whether |
99 for example the regular expression $(a^*)^*\,b$ and ask whether |
100 strings of the form $aa..a$ match this regular |
100 strings of the form $aa..a$ match this regular |
101 expression. Obviously not---the expected $b$ in the last |
101 expression. Obviously this is not the case---the expected $b$ in the last |
102 position is missing. One would expect that modern regular expression |
102 position is missing. One would expect that modern regular expression |
103 matching engines can find this out very quickly. Alas, if one tries |
103 matching engines can find this out very quickly. Alas, if one tries |
104 this example in JavaScript, Python or Java 8 with strings like 28 |
104 this example in JavaScript, Python or Java 8 with strings like 28 |
105 $a$'s, one discovers that this decision takes around 30 seconds and |
105 $a$'s, one discovers that this decision takes around 30 seconds and |
106 takes considerably longer when adding a few more $a$'s, as the graphs |
106 takes considerably longer when adding a few more $a$'s, as the graphs |
186 frequent enough to have a separate name created for |
186 frequent enough to have a separate name created for |
187 them---\emph{evil regular expressions}. In empiric work, Davis et al |
187 them---\emph{evil regular expressions}. In empiric work, Davis et al |
188 report that they have found thousands of such evil regular expressions |
188 report that they have found thousands of such evil regular expressions |
189 in the JavaScript and Python ecosystems \cite{Davis18}. |
189 in the JavaScript and Python ecosystems \cite{Davis18}. |
190 |
190 |
|
191 \comment{Needs to be consistent: either exponential blowup; or quadratic blowup. Maybe you use the terminology superlinear, like in the Davis et al paper} |
191 This exponential blowup in matching algorithms sometimes causes |
192 This exponential blowup in matching algorithms sometimes causes |
192 considerable grief in real life: for example on 20 July 2016 one evil |
193 considerable grief in real life: for example on 20 July 2016 one evil |
193 regular expression brought the webpage |
194 regular expression brought the webpage |
194 \href{http://stackexchange.com}{Stack Exchange} to its |
195 \href{http://stackexchange.com}{Stack Exchange} to its |
195 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}} |
196 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}} |
218 The underlying problem is that many ``real life'' regular expression |
219 The underlying problem is that many ``real life'' regular expression |
219 matching engines do not use DFAs for matching. This is because they |
220 matching engines do not use DFAs for matching. This is because they |
220 support regular expressions that are not covered by the classical |
221 support regular expressions that are not covered by the classical |
221 automata theory, and in this more general setting there are quite a few |
222 automata theory, and in this more general setting there are quite a few |
222 research questions still unanswered and fast algorithms still need to be |
223 research questions still unanswered and fast algorithms still need to be |
223 developed (for example how to treat bounded repetitions, negation and |
224 developed (for example how to treat efficiently bounded repetitions, negation and |
224 back-references efficiently). |
225 back-references). |
225 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas? |
226 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas? |
226 %how do they avoid dfas exponential states if they use them for fast matching? |
227 %how do they avoid dfas exponential states if they use them for fast matching? |
227 |
228 |
228 There is also another under-researched problem to do with regular |
229 There is also another under-researched problem to do with regular |
229 expressions and lexing, i.e.~the process of breaking up strings into |
230 expressions and lexing, i.e.~the process of breaking up strings into |
230 sequences of tokens according to some regular expressions. In this |
231 sequences of tokens according to some regular expressions. In this |
231 setting one is not just interested in whether or not a regular |
232 setting one is not just interested in whether or not a regular |
232 expression matches a string, but also in \emph{how}. Consider for example a regular expression |
233 expression matches a string, but also in \emph{how}. Consider for |
233 $r_{key}$ for recognising keywords such as \textit{if}, \textit{then} |
234 example a regular expression $r_{key}$ for recognising keywords such as |
234 and so on; and a regular expression $r_{id}$ for recognising |
235 \textit{if}, \textit{then} and so on; and a regular expression $r_{id}$ |
235 identifiers (say, a single character followed by characters or |
236 for recognising identifiers (say, a single character followed by |
236 numbers). One can then form the compound regular expression |
237 characters or numbers). One can then form the compound regular |
237 $(r_{key} + r_{id})^*$ and use it to tokenise strings. But then how |
238 expression $(r_{key} + r_{id})^*$ and use it to tokenise strings. But |
238 should the string \textit{iffoo} be tokenised? It could be tokenised |
239 then how should the string \textit{iffoo} be tokenised? It could be |
239 as a keyword followed by an identifier, or the entire string as a |
240 tokenised as a keyword followed by an identifier, or the entire string |
240 single identifier. Similarly, how should the string \textit{if} be |
241 as a single identifier. Similarly, how should the string \textit{if} be |
241 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would |
242 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would |
242 ``fire''---so is it an identifier or a keyword? While in applications |
243 ``fire''---so is it an identifier or a keyword? While in applications |
243 there is a well-known strategy to decide these questions, called POSIX |
244 there is a well-known strategy to decide these questions, called POSIX |
244 matching, only relatively recently precise definitions of what POSIX |
245 matching, only relatively recently precise definitions of what POSIX |
245 matching actually means has been formalised |
246 matching actually means have been formalised |
246 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. |
247 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Such a |
247 Such a definition has also been given by Sulzmann and Lu \cite{Sulzmann2014}, but the |
248 definition has also been given by Sulzmann and Lu \cite{Sulzmann2014}, |
248 corresponding correctness proof turned out to be faulty \cite{AusafDyckhoffUrban2016}. |
249 but the corresponding correctness proof turned out to be faulty |
249 Roughly, POSIX matching means matching the longest initial substring. |
250 \cite{AusafDyckhoffUrban2016}. Roughly, POSIX matching means matching |
250 In the case of a tie, the initial sub-match is chosen according to some priorities attached to the |
251 the longest initial substring. In the case of a tie, the initial |
251 regular expressions (e.g.~keywords have a higher priority than |
252 sub-match is chosen according to some priorities attached to the regular |
252 identifiers). This sounds rather simple, but according to Grathwohl et |
253 expressions (e.g.~keywords have a higher priority than identifiers). |
253 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote: |
254 This sounds rather simple, but according to Grathwohl et al \cite[Page |
|
255 36]{CrashCourse2014} this is not the case. They wrote: |
254 |
256 |
255 \begin{quote} |
257 \begin{quote} |
256 \it{}``The POSIX strategy is more complicated than the greedy because of |
258 \it{}``The POSIX strategy is more complicated than the greedy because of |
257 the dependence on information about the length of matched strings in the |
259 the dependence on information about the length of matched strings in the |
258 various subexpressions.'' |
260 various subexpressions.'' |
346 straightforwardly realised within the classic automata approach. |
348 straightforwardly realised within the classic automata approach. |
347 For the moment however, we focus only on the usual basic regular expressions. |
349 For the moment however, we focus only on the usual basic regular expressions. |
348 |
350 |
349 |
351 |
350 Now if we want to find out whether a string $s$ matches with a regular |
352 Now if we want to find out whether a string $s$ matches with a regular |
351 expression $r$, build the derivatives of $r$ w.r.t.\ (in succession) |
353 expression $r$, we can build the derivatives of $r$ w.r.t.\ (in succession) |
352 all the characters of the string $s$. Finally, test whether the |
354 all the characters of the string $s$. Finally, test whether the |
353 resulting regular expression can match the empty string. If yes, then |
355 resulting regular expression can match the empty string. If yes, then |
354 $r$ matches $s$, and no in the negative case. To implement this idea |
356 $r$ matches $s$, and no in the negative case. To implement this idea |
355 we can generalise the derivative operation to strings like this: |
357 we can generalise the derivative operation to strings like this: |
356 |
358 |
419 \end{tabular} |
421 \end{tabular} |
420 \end{tabular} |
422 \end{tabular} |
421 \end{center} |
423 \end{center} |
422 |
424 |
423 \noindent |
425 \noindent |
424 No value corresponds to $\ZERO$; $\Empty$ corresponds to |
426 No value corresponds to $\ZERO$; $\Empty$ corresponds to $\ONE$; |
425 $\ONE$; $\Char$ to the character regular expression; $\Seq$ to the |
427 $\Char$ to the character regular expression; $\Seq$ to the sequence |
426 sequence regular expression and so on. The idea of values is to encode |
428 regular expression and so on. The idea of values is to encode a kind of |
427 a kind of lexical value for how the sub-parts of a regular expression match |
429 lexical value for how the sub-parts of a regular expression match the |
428 the sub-parts of a string. To see this, suppose a \emph{flatten} operation, written |
430 sub-parts of a string. To see this, suppose a \emph{flatten} operation, |
429 $|v|$ for values. We can use this function to extract the underlying string of a value |
431 written $|v|$ for values. We can use this function to extract the |
430 $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char |
432 underlying string of a value $v$. For example, $|\mathit{Seq} \, |
431 y})|$ is the string $xy$. Using flatten, we can describe how values |
433 (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$. Using |
432 encode parse trees: $\Seq\,v_1\, v_2$ encodes a tree with 2 children nodes |
434 flatten, we can describe how values encode \comment{Avoid the notion |
433 that tells how the string $|v_1| @ |
435 parse trees! Also later!!}parse trees: $\Seq\,v_1\, v_2$ encodes a tree with two |
434 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the |
436 children nodes that tells how the string $|v_1| @ |v_2|$ matches the |
435 substring $|v_1|$ and, respectively, $r_2$ matches the substring |
437 regex $r_1 \cdot r_2$ whereby $r_1$ matches the substring $|v_1|$ and, |
436 $|v_2|$. Exactly how these two are matched is contained in the |
438 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two |
437 children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . |
439 are matched is contained in the children nodes $v_1$ and $v_2$ of parent |
|
440 $\textit{Seq}$. |
438 |
441 |
439 To give a concrete example of how values work, consider the string $xy$ |
442 To give a concrete example of how values work, consider the string $xy$ |
440 and the regular expression $(x + (y + xy))^*$. We can view this regular |
443 and the regular expression $(x + (y + xy))^*$. We can view this regular |
441 expression as a tree and if the string $xy$ is matched by two Star |
444 expression as a tree and if the string $xy$ is matched by two Star |
442 ``iterations'', then the $x$ is matched by the left-most alternative in |
445 ``iterations'', then the $x$ is matched by the left-most alternative in |
488 $\textit{nullable}$ or not. If not, we know the string does not match |
491 $\textit{nullable}$ or not. If not, we know the string does not match |
489 $r$ and no value needs to be generated. If yes, we start building the |
492 $r$ and no value needs to be generated. If yes, we start building the |
490 values incrementally by \emph{injecting} back the characters into the |
493 values incrementally by \emph{injecting} back the characters into the |
491 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
494 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
492 algorithm from the right to left. For the first value $v_n$, we call the |
495 algorithm from the right to left. For the first value $v_n$, we call the |
493 function $\textit{mkeps}$, which builds the parse tree for how the empty |
496 function $\textit{mkeps}$, which builds the \comment{Avoid}parse tree |
494 string has been matched by the (nullable) regular expression $r_n$. This |
497 for how the empty string has been matched by the (nullable) regular |
495 function is defined as |
498 expression $r_n$. This function is defined as |
496 |
499 |
497 \begin{center} |
500 \begin{center} |
498 \begin{tabular}{lcl} |
501 \begin{tabular}{lcl} |
499 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
502 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
500 $\mkeps(r_{1}+r_{2})$ & $\dn$ |
503 $\mkeps(r_{1}+r_{2})$ & $\dn$ |
509 |
512 |
510 \noindent There are no cases for $\ZERO$ and $c$, since |
513 \noindent There are no cases for $\ZERO$ and $c$, since |
511 these regular expression cannot match the empty string. Note |
514 these regular expression cannot match the empty string. Note |
512 also that in case of alternatives we give preference to the |
515 also that in case of alternatives we give preference to the |
513 regular expression on the left-hand side. This will become |
516 regular expression on the left-hand side. This will become |
514 important later on. |
517 important later on about what value is calculated. |
515 |
518 |
516 After this, we inject back the characters one by one in order to build |
519 After the $\mkeps$-call, we inject back the characters one by one in order to build |
517 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$ |
520 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$ |
518 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. |
521 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. |
519 After injecting back $n$ characters, we get the parse tree for how $r_0$ |
522 After injecting back $n$ characters, we get the parse tree for how $r_0$ |
520 matches $s$. For this Sulzmann and Lu defined a function that reverses |
523 matches $s$. For this Sulzmann and Lu defined a function that reverses |
521 the ``chopping off'' of characters during the derivative phase. The |
524 the ``chopping off'' of characters during the derivative phase. The |
570 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value |
573 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value |
571 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$. |
574 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$. |
572 Other clauses can be understood in a similar way. |
575 Other clauses can be understood in a similar way. |
573 |
576 |
574 %\comment{Other word: insight?} |
577 %\comment{Other word: insight?} |
575 The following example gives an insight of $\textit{inj}$'s effect |
578 The following example gives an insight of $\textit{inj}$'s effect and |
576 and how Sulzmann and Lu's algorithm works as a whole. |
579 how Sulzmann and Lu's algorithm works as a whole. Suppose we have a |
577 Suppose we have a |
580 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it |
578 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against |
581 against the string $abc$ (when $abc$ is written as a regular expression, |
579 the string $abc$ (when $abc$ is written as a regular expression, the most |
582 the standard way of expressing it is $a \cdot (b \cdot c)$. But we |
580 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit |
583 usually omit the parentheses and dots here for better readability. This |
581 the parentheses and dots here for readability). |
584 algorithm returns a POSIX value, which means it will produce the longest |
582 This algorithm returns a POSIX value, which means it |
585 matching. Consequently, it matches the string $abc$ in one star |
583 will go for the longest matching, i.e.~it should match the string |
586 iteration, using the longest alternative $abc$ in the sub-expression (we shall use $r$ to denote this |
584 $abc$ in one star iteration, using the longest alternative $abc$ in the |
587 sub-expression for conciseness): |
585 sub-expression $((((a+b)+ab)+c)+abc)$ (we use $r$ to denote this sub-expression |
588 |
586 for conciseness). |
589 \[((((a+b)+ab)+c)+\underbrace{abc}_r)\] |
587 Before $\textit{inj}$ comes into play, |
590 |
588 our lexer first builds derivative using string $abc$ (we simplified some regular expressions like |
591 \noindent |
589 $\ZERO \cdot b$ to $\ZERO$ for conciseness; we also omit parentheses if |
592 Before $\textit{inj}$ is called, our lexer first builds derivative using |
590 they are clear from the context): |
593 string $abc$ (we simplified some regular expressions like $\ZERO \cdot |
|
594 b$ to $\ZERO$ for conciseness; we also omit parentheses if they are |
|
595 clear from the context): |
|
596 |
591 %Similarly, we allow |
597 %Similarly, we allow |
592 %$\textit{ALT}$ to take a list of regular expressions as an argument |
598 %$\textit{ALT}$ to take a list of regular expressions as an argument |
593 %instead of just 2 operands to reduce the nested depth of |
599 %instead of just 2 operands to reduce the nested depth of |
594 %$\textit{ALT}$ |
600 %$\textit{ALT}$ |
|
601 |
595 \begin{center} |
602 \begin{center} |
596 \begin{tabular}{lcl} |
603 \begin{tabular}{lcl} |
597 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r^*$\\ |
604 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r^*$\\ |
598 & $\xrightarrow{\backslash b}$ & $r_2 = (\ZERO+\ZERO+\ONE \cdot \ONE + \ZERO + \ONE \cdot \ONE \cdot c) \cdot r^* +(\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^*$\\ |
605 & $\xrightarrow{\backslash b}$ & $r_2 = (\ZERO+\ZERO+\ONE \cdot \ONE + \ZERO + \ONE \cdot \ONE \cdot c) \cdot r^* +(\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^*$\\ |
599 & $\xrightarrow{\backslash c}$ & $r_3 = ((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot \ONE) \cdot r^* + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^*) + $\\ |
606 & $\xrightarrow{\backslash c}$ & $r_3 = ((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot \ONE) \cdot r^* + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^*) + $\\ |
602 \end{center} |
609 \end{center} |
603 |
610 |
604 \noindent |
611 \noindent |
605 In case $r_3$ is nullable, we can call $\textit{mkeps}$ |
612 In case $r_3$ is nullable, we can call $\textit{mkeps}$ |
606 to construct a parse tree for how $r_3$ matched the string $abc$. |
613 to construct a parse tree for how $r_3$ matched the string $abc$. |
607 $\textit{mkeps}$ gives the following value $v_3$: |
614 This function gives the following value $v_3$: |
|
615 |
|
616 |
608 \begin{center} |
617 \begin{center} |
609 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$ |
618 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$ |
610 \end{center} |
619 \end{center} |
611 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined): |
620 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined): |
612 |
621 |
613 \begin{center} |
622 \begin{center}\comment{better layout} |
614 $( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) |
623 $( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) |
615 \cdot r^*) +((\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^*+(\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^* ).$ |
624 \cdot r^*) +((\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^*+(\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^* ).$ |
616 \end{center} |
625 \end{center} |
617 |
626 |
618 \noindent |
627 \noindent |
619 Note that the leftmost location of term $((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot |
628 Note that the leftmost location of term $((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot |
620 \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows |
629 \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows |
621 $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the |
630 $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the |
622 left one when it is nullable. In the case of this example, $abc$ is |
631 left one when it is nullable. In the case of this example, $abc$ is |
623 preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is |
632 preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is |
624 naturally generated by two applications of the splitting clause |
633 generated by two applications of the splitting clause |
625 |
634 |
626 \begin{center} |
635 \begin{center} |
627 $(r_1 \cdot r_2)\backslash c (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$ |
636 $(r_1 \cdot r_2)\backslash c \;\;(when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$ |
628 \end{center} |
637 \end{center} |
629 |
638 |
630 \noindent |
639 \noindent |
631 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the |
640 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the |
632 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This |
641 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This |
638 $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$ |
647 $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$ |
639 \end{center} |
648 \end{center} |
640 |
649 |
641 \noindent |
650 \noindent |
642 tells us how the empty string $[]$ is matched with $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot |
651 tells us how the empty string $[]$ is matched with $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot |
643 \ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of 2 nullable regular |
652 \ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of two nullable regular |
644 expressions. The first one is an alternative, we take the rightmost |
653 expressions. The first one is an alternative, we take the rightmost |
645 alternative---whose language contains the empty string. The second |
654 alternative---whose language contains the empty string. The second |
646 nullable regular expression is a Kleene star. $\Stars$ tells us how it |
655 nullable regular expression is a Kleene star. $\Stars$ tells us how it |
647 generates the nullable regular expression: by 0 iterations to form |
656 generates the nullable regular expression: by 0 iterations to form |
648 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally |
657 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally |
667 $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$ |
676 $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$ |
668 \end{center} |
677 \end{center} |
669 for how $r$ matched $abc$. This completes the algorithm. |
678 for how $r$ matched $abc$. This completes the algorithm. |
670 |
679 |
671 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. |
680 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. |
672 Readers might have noticed that the parse tree information |
681 Readers might have noticed that the parse tree information is actually |
673 is actually already available when doing derivatives. |
682 already available when doing derivatives. For example, immediately after |
674 For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$, |
683 the operation $\backslash a$ we know that if we want to match a string |
675 we can either take the initial match to be |
684 that starts with $a$, we can either take the initial match to be |
|
685 |
676 \begin{center} |
686 \begin{center} |
677 \begin{enumerate} |
687 \begin{enumerate} |
678 \item[1)] just $a$ or |
688 \item[1)] just $a$ or |
679 \item[2)] string $ab$ or |
689 \item[2)] string $ab$ or |
680 \item[3)] string $abc$. |
690 \item[3)] string $abc$. |
681 \end{enumerate} |
691 \end{enumerate} |
682 \end{center} |
692 \end{center} |
683 |
693 |
684 \noindent |
694 \noindent |
685 In order to differentiate between these choices, we just need to |
695 In order to differentiate between these choices, we just need to |
686 remember their positions--$a$ is on the left, $ab$ is in the middle , |
696 remember their positions---$a$ is on the left, $ab$ is in the middle , |
687 and $abc$ is on the right. Which one of these alternatives is chosen |
697 and $abc$ is on the right. Which of these alternatives is chosen |
688 later does not affect their relative position because our algorithm does |
698 later does not affect their relative position because the algorithm does |
689 not change this order. If this parsing information can be determined and |
699 not change this order. If this parsing information can be determined and |
690 does not change because of later derivatives, there is no point in |
700 does not change because of later derivatives, there is no point in |
691 traversing this information twice. This leads to an optimisation---if we |
701 traversing this information twice. This leads to an optimisation---if we |
692 store the information for parse trees inside the regular expression, |
702 store the information for parse trees inside the regular expression, |
693 update it when we do derivative on them, and collect the information |
703 update it when we do derivative on them, and collect the information |
761 A partial derivative of a regular expression $r$ is essentially a set of |
771 A partial derivative of a regular expression $r$ is essentially a set of |
762 regular expressions that are either $r$'s children expressions or a |
772 regular expressions that are either $r$'s children expressions or a |
763 concatenation of them. Antimirov has proved a tight bound of the size of |
773 concatenation of them. Antimirov has proved a tight bound of the size of |
764 \emph{all} partial derivatives no matter what the string looks like. |
774 \emph{all} partial derivatives no matter what the string looks like. |
765 Roughly speaking the size will be quadruple in the size of the regular |
775 Roughly speaking the size will be quadruple in the size of the regular |
766 expression. If we want the size of derivatives in Sulzmann and Lu's |
776 expression.\comment{Are you sure? I have just proved that the sum of |
767 algorithm to stay equal or below this bound, we would need more |
777 sizes in $pder$ is less or equal $(1 + size\;r)^3$. And this is surely |
|
778 not the best bound.} If we want the size of derivatives in Sulzmann and |
|
779 Lu's algorithm to stay equal or below this bound, we would need more |
768 aggressive simplifications. Essentially we need to delete useless |
780 aggressive simplifications. Essentially we need to delete useless |
769 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible. |
781 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible. |
770 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to |
782 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to |
771 get $a\cdot c + b \cdot c + b \cdot c$, and then simplified to just $a \cdot |
783 get $a\cdot c + b \cdot c + b \cdot c$, and then simplified to just $a |
772 c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+ \ONE) + (a |
784 \cdot c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+ |
773 +\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive simplification |
785 \ONE) + (a +\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive |
774 rules helps us to achieve the same size bound as that of the partial |
786 simplification rules helps us to achieve the same size bound as that of |
775 derivatives. In order to implement the idea of ``spilling out alternatives'' |
787 the partial derivatives. |
776 and to make them compatible with the $\text{inj}$-mechanism, we use \emph{bitcodes}. |
788 |
777 Bits and bitcodes (lists of bits) are just: |
789 In order to implement the idea of ``spilling out alternatives'' and to |
|
790 make them compatible with the $\text{inj}$-mechanism, we use |
|
791 \emph{bitcodes}. Bits and bitcodes (lists of bits) are just: |
|
792 |
778 %This allows us to prove a tight |
793 %This allows us to prove a tight |
779 %bound on the size of regular expression during the running time of the |
794 %bound on the size of regular expression during the running time of the |
780 %algorithm if we can establish the connection between our simplification |
795 %algorithm if we can establish the connection between our simplification |
781 %rules and partial derivatives. |
796 %rules and partial derivatives. |
782 |
797 |
923 $\textit{STAR}\,(bs\,@\,bs')\,a$ |
938 $\textit{STAR}\,(bs\,@\,bs')\,a$ |
924 \end{tabular} |
939 \end{tabular} |
925 \end{center} |
940 \end{center} |
926 |
941 |
927 \noindent |
942 \noindent |
928 After internalise we do successive derivative operations on the |
943 After internalising the regular expression, we perform successive |
929 annotated regular expressions. This derivative operation is the same as |
944 derivative operations on the annotated regular expressions. This |
930 what we previously have for the simple regular expressions, except that |
945 derivative operation is the same as what we had previously for the |
931 we beed to take special care of the bitcodes:\comment{You need to be consitent with ALTS and ALT; ALT is just |
946 basic regular expressions, except that we beed to take care of |
932 an abbreviation; derivations and so on are defined for ALTS}\comment{no this is not the case, |
947 the bitcodes:\comment{You need to be consitent with ALTS and ALT; ALT |
933 ALT for 2 regexes, ALTS for a list} |
948 is just an abbreviation; derivations and so on are defined for |
|
949 ALTS}\comment{no this is not the case, ALT for 2 regexes, ALTS for a |
|
950 list...\textcolor{blue}{This does not make sense to me. CU}} |
934 |
951 |
935 %\begin{definition}{bder} |
952 %\begin{definition}{bder} |
936 \begin{center} |
953 \begin{center} |
937 \begin{tabular}{@{}lcl@{}} |
954 \begin{tabular}{@{}lcl@{}} |
938 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
955 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
966 then the right branch of $ALTS$, which is $fuse \; bmkeps \; a_1 (a_2 |
983 then the right branch of $ALTS$, which is $fuse \; bmkeps \; a_1 (a_2 |
967 \backslash c)$ will collapse the regular expression $a_1$(as it has |
984 \backslash c)$ will collapse the regular expression $a_1$(as it has |
968 already been fully matched) and store the parsing information at the |
985 already been fully matched) and store the parsing information at the |
969 head of the regular expression $a_2 \backslash c$ by fusing to it. The |
986 head of the regular expression $a_2 \backslash c$ by fusing to it. The |
970 bitsequence $bs$, which was initially attached to the head of $SEQ$, has |
987 bitsequence $bs$, which was initially attached to the head of $SEQ$, has |
971 now been elevated to the top-level of ALT, as this information will be |
988 now been elevated to the top-level of $ALT$, as this information will be |
972 needed whichever way the $SEQ$ is matched--no matter whether $c$ belongs |
989 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs |
973 to $a_1$ or $ a_2$. After building these derivatives and maintaining all |
990 to $a_1$ or $ a_2$. After building these derivatives and maintaining all |
974 the lexing information, we complete the lexing by collecting the |
991 the lexing information, we complete the lexing by collecting the |
975 bitcodes using a generalised version of the $\textit{mkeps}$ function |
992 bitcodes using a generalised version of the $\textit{mkeps}$ function |
976 for annotated regular expressions, called $\textit{bmkeps}$: |
993 for annotated regular expressions, called $\textit{bmkeps}$: |
977 |
994 |
1060 When it detected that the regular expression is an alternative or |
1077 When it detected that the regular expression is an alternative or |
1061 sequence, it will try to simplify its children regular expressions |
1078 sequence, it will try to simplify its children regular expressions |
1062 recursively and then see if one of the children turn into $\ZERO$ or |
1079 recursively and then see if one of the children turn into $\ZERO$ or |
1063 $\ONE$, which might trigger further simplification at the current level. |
1080 $\ONE$, which might trigger further simplification at the current level. |
1064 The most involved part is the $\textit{ALTS}$ clause, where we use two |
1081 The most involved part is the $\textit{ALTS}$ clause, where we use two |
1065 auxiliary functions flatten and distinct to open up nested |
1082 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested |
1066 $\textit{ALTS}$ and reduce as many duplicates as possible. Function |
1083 $\textit{ALTS}$ and reduce as many duplicates as possible. Function |
1067 distinct keeps the first occurring copy only and remove all later ones |
1084 $\textit{distinct}$ keeps the first occurring copy only and remove all later ones |
1068 when detected duplicates. Function flatten opens up nested \textit{ALTS}. |
1085 when detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}. |
1069 Its recursive definition is given below: |
1086 Its recursive definition is given below: |
1070 |
1087 |
1071 \begin{center} |
1088 \begin{center} |
1072 \begin{tabular}{@{}lcl@{}} |
1089 \begin{tabular}{@{}lcl@{}} |
1073 $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \; |
1090 $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \; |
1076 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) |
1093 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) |
1077 \end{tabular} |
1094 \end{tabular} |
1078 \end{center} |
1095 \end{center} |
1079 |
1096 |
1080 \noindent |
1097 \noindent |
1081 Here flatten behaves like the traditional functional programming flatten |
1098 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
1082 function, except that it also removes $\ZERO$s. What it does is |
1099 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
1083 basically removing parentheses like changing $a+(b+c)$ into $a+b+c$. |
1100 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
1084 |
1101 |
1085 Suppose we apply simplification after each derivative step, and view |
1102 Suppose we apply simplification after each derivative step, and view |
1086 these two operations as an atomic one: $a \backslash_{simp}\,c \dn |
1103 these two operations as an atomic one: $a \backslash_{simp}\,c \dn |
1087 \textit{simp}(a \backslash c)$. Then we can use the previous natural |
1104 \textit{simp}(a \backslash c)$. Then we can use the previous natural |
1088 extension from derivative w.r.t.~character to derivative |
1105 extension from derivative w.r.t.~character to derivative |
1137 Figure 6 yields POSIX parse trees. We have tested this claim |
1154 Figure 6 yields POSIX parse trees. We have tested this claim |
1138 extensively by using the method in Figure~3 as a reference but yet |
1155 extensively by using the method in Figure~3 as a reference but yet |
1139 have to work out all proof details.'' |
1156 have to work out all proof details.'' |
1140 \end{quote} |
1157 \end{quote} |
1141 |
1158 |
1142 \noindent We would settle this correctness claim. It is relatively |
1159 \noindent We like to settle this correctness claim. It is relatively |
1143 straightforward to establish that after one simplification step, the part of a |
1160 straightforward to establish that after one simplification step, the part of a |
1144 nullable derivative that corresponds to a POSIX value remains intact and can |
1161 nullable derivative that corresponds to a POSIX value remains intact and can |
1145 still be collected, in other words, we can show that\comment{Double-check....I |
1162 still be collected, in other words, we can show that\comment{Double-check....I |
1146 think this is not the case}\comment{If i remember correctly, you have proved this lemma. |
1163 think this is not the case} |
1147 I feel this is indeed not true because you might place arbitrary |
1164 %\comment{If i remember correctly, you have proved this lemma. |
1148 bits on the regex r, however if this is the case, did i remember wrongly that |
1165 %I feel this is indeed not true because you might place arbitrary |
1149 you proved something like simplification does not affect $\textit{bmkeps}$ results? |
1166 %bits on the regex r, however if this is the case, did i remember wrongly that |
1150 Anyway, i have amended this a little bit so it does not allow arbitrary bits attached |
1167 %you proved something like simplification does not affect $\textit{bmkeps}$ results? |
1151 to a regex. Maybe it works now.} |
1168 %Anyway, i have amended this a little bit so it does not allow arbitrary bits attached |
|
1169 %to a regex. Maybe it works now.} |
1152 |
1170 |
1153 \begin{center} |
1171 \begin{center} |
1154 $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;( a\; \textit{bnullable} and \textit{decode}(r, \textit{bmkeps}(a)) is a \textit{POSIX} value)$ |
1172 $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;( a\; \textit{bnullable} and \textit{decode}(r, \textit{bmkeps}(a)) is a \textit{POSIX} value)$ |
1155 \end{center} |
1173 \end{center} |
|
1174 \comment{\textcolor{blue}{I proved $bmkeps\,(bsimp\,a) = bmkeps\,a$ provided $a$ is |
|
1175 $\textit{bnullable}$}} |
1156 |
1176 |
1157 \noindent |
1177 \noindent |
1158 as this basically comes down to proving actions like removing the |
1178 as this basically comes down to proving actions like removing the |
1159 additional $r$ in $r+r$ does not delete important POSIX information in |
1179 additional $r$ in $r+r$ does not delete important POSIX information in |
1160 a regular expression. The hard part of this proof is to establish that |
1180 a regular expression. The hard part of this proof is to establish that |
1161 |
1181 |
1162 \begin{center} |
1182 \begin{center} |
1163 $\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$ |
1183 $\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$ |
1164 \end{center} |
1184 \end{center}\comment{This is not true either...look at the definion blexer/blexer-simp} |
1165 |
1185 |
1166 \noindent That is, if we do derivative on regular expression $r$ and then |
1186 \noindent That is, if we do derivative on regular expression $r$ and then |
1167 simplify it, and repeat this process until we exhaust the string, we get a |
1187 simplify it, and repeat this process until we exhaust the string, we get a |
1168 regular expression $r''$($\textit{LHS}$) that provides the POSIX matching |
1188 regular expression $r''$($\textit{LHS}$) that provides the POSIX matching |
1169 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the |
1189 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the |
1179 $\textit{POSIX}$ information of a regular expression and how it is modified, |
1199 $\textit{POSIX}$ information of a regular expression and how it is modified, |
1180 augmented and propagated |
1200 augmented and propagated |
1181 during simplification in parallel with the regularr expression that |
1201 during simplification in parallel with the regularr expression that |
1182 has not been simplified in the subsequent derivative operations. To aid this, |
1202 has not been simplified in the subsequent derivative operations. To aid this, |
1183 we use the helping function retrieve described by Sulzmann and Lu: \\definition |
1203 we use the helping function retrieve described by Sulzmann and Lu: \\definition |
1184 of retrieve TODO\\ |
1204 of retrieve TODO\comment{Did not read further}\\ |
1185 This function assembles the bitcode that corresponds to a parse tree for how |
1205 This function assembles the bitcode that corresponds to a parse tree for how |
1186 the current derivative matches the suffix of the string(the characters that |
1206 the current derivative matches the suffix of the string(the characters that |
1187 have not yet appeared, but will appear as the successive derivatives go on, |
1207 have not yet appeared, but will appear as the successive derivatives go on, |
1188 how do we get this "future" information? By the value $v$, which is |
1208 how do we get this "future" information? By the value $v$, which is |
1189 computed by a pass of the algorithm that uses |
1209 computed by a pass of the algorithm that uses |