371 |
371 |
372 One limitation, however, of Brzozowski's algorithm is that it only |
372 One limitation, however, of Brzozowski's algorithm is that it only |
373 produces a YES/NO answer for whether a string is being matched by a |
373 produces a YES/NO answer for whether a string is being matched by a |
374 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
374 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
375 algorithm to allow generation of an actual matching, called a |
375 algorithm to allow generation of an actual matching, called a |
376 \emph{value}. Values and regular expressions correspond to each |
376 \emph{value} or sometimes lexical values. These values and regular expressions correspond to each |
377 other as illustrated in the following table: |
377 other as illustrated in the following table: |
378 |
378 |
379 |
379 |
380 \begin{center} |
380 \begin{center} |
381 \begin{tabular}{c@{\hspace{20mm}}c} |
381 \begin{tabular}{c@{\hspace{20mm}}c} |
405 |
405 |
406 \noindent |
406 \noindent |
407 No value corresponds to $\ZERO$; $\Empty$ corresponds to |
407 No value corresponds to $\ZERO$; $\Empty$ corresponds to |
408 $\ONE$; $\Char$ to the character regular expression; $\Seq$ to the |
408 $\ONE$; $\Char$ to the character regular expression; $\Seq$ to the |
409 sequence regular expression and so on. The idea of values is to encode |
409 sequence regular expression and so on. The idea of values is to encode |
410 parse trees for how the sub-parts of a regular expression matches |
410 a kind of lexical value for how the sub-parts of a regular expression match |
411 the sub-parts of a string. To see this, suppose a \emph{flatten} operation, written |
411 the sub-parts of a string. To see this, suppose a \emph{flatten} operation, written |
412 $|v|$. We use this function to extract the underlying string of a value |
412 $|v|$ for values. We can use this function to extract the underlying string of a value |
413 $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char |
413 $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char |
414 y})|$ is the string $xy$. Using flatten, we can describe how values |
414 y})|$ is the string $xy$. Using flatten, we can describe how values |
415 encode parse trees: $\Seq\,v_1\, v_2$ encodes a tree with 2 children nodes |
415 encode parse trees: $\Seq\,v_1\, v_2$ encodes a tree with 2 children nodes |
416 that tells how the string $|v_1| @ |
416 that tells how the string $|v_1| @ |
417 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the |
417 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the |
418 substring $|v_1|$ and, respectively, $r_2$ matches the substring |
418 substring $|v_1|$ and, respectively, $r_2$ matches the substring |
419 $|v_2|$. Exactly how these two are matched is contained in the |
419 $|v_2|$. Exactly how these two are matched is contained in the |
420 children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . |
420 children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . |
421 |
421 |
422 To give a concrete example of how value works, consider the string $xy$ |
422 To give a concrete example of how values work, consider the string $xy$ |
423 and the regular expression $(x + (y + xy))^*$. We can view this regular |
423 and the regular expression $(x + (y + xy))^*$. We can view this regular |
424 expression as a tree and if the string $xy$ is matched by two Star |
424 expression as a tree and if the string $xy$ is matched by two Star |
425 ``iterations'', then the $x$ is matched by the left-most alternative in |
425 ``iterations'', then the $x$ is matched by the left-most alternative in |
426 this tree and the $y$ by the right-left alternative. This suggests to |
426 this tree and the $y$ by the right-left alternative. This suggests to |
427 record this matching as |
427 record this matching as |
429 \begin{center} |
429 \begin{center} |
430 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$ |
430 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$ |
431 \end{center} |
431 \end{center} |
432 |
432 |
433 \noindent |
433 \noindent |
434 where $\Stars [\ldots]$ records all the |
434 where $\Stars \; [\ldots]$ records all the |
435 iterations; and $\Left$, respectively $\Right$, which |
435 iterations; and $\Left$, respectively $\Right$, which |
436 alternative is used. The value for |
436 alternative is used. The value for |
437 matching $xy$ in a single ``iteration'', i.e.~the POSIX value, |
437 matching $xy$ in a single ``iteration'', i.e.~the POSIX value, |
438 would look as follows |
438 would look as follows |
439 |
439 |
447 expression. |
447 expression. |
448 |
448 |
449 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
449 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
450 algorithm by a second phase (the first phase being building successive |
450 algorithm by a second phase (the first phase being building successive |
451 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
451 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
452 is generated assuming the regular expression matches the string. |
452 is generated in case the regular expression matches the string. |
453 Pictorially, the algorithm is as follows: |
453 Pictorially, the Sulzmann and Lu algorithm is as follows: |
454 |
454 |
455 \begin{ceqn} |
455 \begin{ceqn} |
456 \begin{equation}\label{graph:2} |
456 \begin{equation}\label{graph:2} |
457 \begin{tikzcd} |
457 \begin{tikzcd} |
458 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
458 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
463 |
463 |
464 \noindent |
464 \noindent |
465 For convenience, we shall employ the following notations: the regular expression we |
465 For convenience, we shall employ the following notations: the regular expression we |
466 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1 |
466 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1 |
467 \ldots c_{n-1}$. In the first phase, we build the derivatives $r_1$, $r_2$, \ldots according to |
467 \ldots c_{n-1}$. In the first phase, we build the derivatives $r_1$, $r_2$, \ldots according to |
468 the characters $c_0$, $c_1$,\ldots until we exhaust the string and |
468 the characters $c_0$, $c_1$ until we exhaust the string and |
469 obtain at the derivative $r_n$. We test whether this derivative is |
469 obtain the derivative $r_n$. We test whether this derivative is |
470 $\textit{nullable}$ or not. If not, we know the string does not match |
470 $\textit{nullable}$ or not. If not, we know the string does not match |
471 $r$ and no value needs to be generated. If yes, we start building the |
471 $r$ and no value needs to be generated. If yes, we start building the |
472 parse tree incrementally by \emph{injecting} back the characters into |
472 values incrementally by \emph{injecting} back the characters into |
473 the values $v_n, \ldots, v_0$. For this we first call the function |
473 the earlier values $v_n, \ldots, v_0$. For the first value $v_0$, we call the function |
474 $\textit{mkeps}$, which builds the parse tree for how the empty string |
474 $\textit{mkeps}$, which builds the parse tree for how the empty string |
475 has matched the (nullable) regular expression $r_n$. This function is defined |
475 has been matched by the (nullable) regular expression $r_n$. This function is defined |
476 as |
476 as |
477 |
477 |
478 \begin{center} |
478 \begin{center} |
479 \begin{tabular}{lcl} |
479 \begin{tabular}{lcl} |
480 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
480 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
523 |
523 |
524 \noindent This definition is by recursion on the ``shape'' of regular |
524 \noindent This definition is by recursion on the ``shape'' of regular |
525 expressions and values. To understands this definition better consider |
525 expressions and values. To understands this definition better consider |
526 the situation when we build the derivative on regular expression $r_{i-1}$. |
526 the situation when we build the derivative on regular expression $r_{i-1}$. |
527 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a |
527 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a |
528 ``hole'' in $r_i$ and its corresponding value $v_i$ |
528 ``hole'' in $r_i$ and its corresponding value $v_i$. |
529 . To calculate $v_{i-1}$, we need to |
529 To calculate $v_{i-1}$, we need to |
530 locate where that hole is and fill it. |
530 locate where that hole is and fill it. |
531 We can find this location by |
531 We can find this location by |
532 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape |
532 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape |
533 $r_a \cdot r_b$, and $v_i$ is of shape $\Left(Seq(v_1,v_2))$, we know immediately that |
533 $r_a \cdot r_b$, and $v_i$ is of shape $\Left(Seq(v_1,v_2))$, we know immediately that |
534 % |
534 % |
573 %$\textit{ALT}$ to take a list of regular expressions as an argument |
573 %$\textit{ALT}$ to take a list of regular expressions as an argument |
574 %instead of just 2 operands to reduce the nested depth of |
574 %instead of just 2 operands to reduce the nested depth of |
575 %$\textit{ALT}$ |
575 %$\textit{ALT}$ |
576 \begin{center} |
576 \begin{center} |
577 \begin{tabular}{lcl} |
577 \begin{tabular}{lcl} |
578 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^*$\\ |
578 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r^*$\\ |
579 & $\xrightarrow{\backslash b}$ & $r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r^*$\\ |
579 & $\xrightarrow{\backslash b}$ & $r_2 = (\ZERO+\ZERO+\ONE \cdot \ONE + \ZERO + \ONE \cdot \ONE \cdot c) \cdot r^* +(\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^*$\\ |
580 & $\xrightarrow{\backslash c}$ & $r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r^*) + $\\ |
580 & $\xrightarrow{\backslash c}$ & $r_3 = ((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot \ONE) \cdot r^* + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^*) + $\\ |
581 & & $\phantom{r_3 = (} ((0+1+0 + 0 + 0) \cdot r^* + (0+0+0 + 1 + 0) \cdot r^* )$ |
581 & & $\phantom{r_3 = (} ((\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^* + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^* )$ |
582 \end{tabular} |
582 \end{tabular} |
583 \end{center} |
583 \end{center} |
584 |
584 |
585 \noindent |
585 \noindent |
586 In case $r_3$ is nullable, we can call $mkeps$ |
586 In case $r_3$ is nullable, we can call $\textit{mkeps}$ |
587 to construct a parse tree for how $r_3$ matched the string $abc$. |
587 to construct a parse tree for how $r_3$ matched the string $abc$. |
588 $mkeps$ gives the following value $v_3$: |
588 $\textit{mkeps}$ gives the following value $v_3$: |
589 \begin{center} |
589 \begin{center} |
590 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$ |
590 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$ |
591 \end{center} |
591 \end{center} |
592 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined): |
592 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined): |
593 |
593 |
594 \begin{center} |
594 \begin{center} |
595 $( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0 + 1 + 0) |
595 $( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) |
596 \cdot r^*) +((0+1+0 + 0 + 0) \cdot r^*+(0+0+0 + 1 + 0) \cdot r^* ).$ |
596 \cdot r^*) +((\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^*+(\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^* ).$ |
597 \end{center} |
597 \end{center} |
598 |
598 |
599 \noindent |
599 \noindent |
600 Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot |
600 Note that the leftmost location of term $((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot |
601 1) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows |
601 \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows |
602 $mkeps$ to pick it up because $mkeps$ is defined to always choose the |
602 $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the |
603 left one when it is nullable. In the case of this example, $abc$ is |
603 left one when it is nullable. In the case of this example, $abc$ is |
604 preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is |
604 preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is |
605 naturally generated by two applications of the splitting clause |
605 naturally generated by two applications of the splitting clause |
606 |
606 |
607 \begin{center} |
607 \begin{center} |
609 \end{center} |
609 \end{center} |
610 |
610 |
611 \noindent |
611 \noindent |
612 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the |
612 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the |
613 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This |
613 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This |
614 allows $mkeps$ to always pick up among two matches the one with a longer |
614 allows $\textit{mkeps}$ to always pick up among two matches the one with a longer |
615 initial sub-match. Removing the outside $\Left(\Left(...))$, the inside |
615 initial sub-match. Removing the outside $\Left(\Left(...))$, the inside |
616 sub-value |
616 sub-value |
617 |
617 |
618 \begin{center} |
618 \begin{center} |
619 $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$ |
619 $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$ |
620 \end{center} |
620 \end{center} |
621 |
621 |
622 \noindent |
622 \noindent |
623 tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot |
623 tells us how the empty string $[]$ is matched with $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot |
624 1 \cdot 1) \cdot r^*$. We match $[]$ by a sequence of 2 nullable regular |
624 \ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of 2 nullable regular |
625 expressions. The first one is an alternative, we take the rightmost |
625 expressions. The first one is an alternative, we take the rightmost |
626 alternative---whose language contains the empty string. The second |
626 alternative---whose language contains the empty string. The second |
627 nullable regular expression is a Kleene star. $\Stars$ tells us how it |
627 nullable regular expression is a Kleene star. $\Stars$ tells us how it |
628 generates the nullable regular expression: by 0 iterations to form |
628 generates the nullable regular expression: by 0 iterations to form |
629 $\epsilon$. Now $\textit{inj}$ injects characters back and incrementally |
629 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally |
630 builds a parse tree based on $v_3$. Using the value $v_3$, the character |
630 builds a parse tree based on $v_3$. Using the value $v_3$, the character |
631 c, and the regular expression $r_2$, we can recover how $r_2$ matched |
631 c, and the regular expression $r_2$, we can recover how $r_2$ matched |
632 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us |
632 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us |
633 \begin{center} |
633 \begin{center} |
634 $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$ |
634 $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$ |
670 not change this order. If this parsing information can be determined and |
670 not change this order. If this parsing information can be determined and |
671 does not change because of later derivatives, there is no point in |
671 does not change because of later derivatives, there is no point in |
672 traversing this information twice. This leads to an optimisation---if we |
672 traversing this information twice. This leads to an optimisation---if we |
673 store the information for parse trees inside the regular expression, |
673 store the information for parse trees inside the regular expression, |
674 update it when we do derivative on them, and collect the information |
674 update it when we do derivative on them, and collect the information |
675 when finished with derivatives and call $mkeps$ for deciding which |
675 when finished with derivatives and call $\textit{mkeps}$ for deciding which |
676 branch is POSIX, we can generate the parse tree in one pass, instead of |
676 branch is POSIX, we can generate the parse tree in one pass, instead of |
677 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel |
677 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel |
678 idea of using bitcodes in derivatives. |
678 idea of using bitcodes in derivatives. |
679 |
679 |
680 In the next section, we shall focus on the bit-coded algorithm and the |
680 In the next section, we shall focus on the bitcoded algorithm and the |
681 process of simplification of regular expressions. This is needed in |
681 process of simplification of regular expressions. This is needed in |
682 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
682 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
683 and Lu's algorithms. This is where the PhD-project aims to advance the |
683 and Lu's algorithms. This is where the PhD-project aims to advance the |
684 state-of-the-art. |
684 state-of-the-art. |
685 |
685 |
741 \noindent |
741 \noindent |
742 A partial derivative of a regular expression $r$ is essentially a set of |
742 A partial derivative of a regular expression $r$ is essentially a set of |
743 regular expressions that are either $r$'s children expressions or a |
743 regular expressions that are either $r$'s children expressions or a |
744 concatenation of them. Antimirov has proved a tight bound of the size of |
744 concatenation of them. Antimirov has proved a tight bound of the size of |
745 partial derivatives. Roughly |
745 partial derivatives. Roughly |
746 speaking the size will not exceed $t^2$, t is the number of characters |
746 speaking the size will be quadruple in the size of the regular expression. |
747 appearing in that regular expression. If we want the size of derivatives |
747 If we want the size of derivatives |
748 to stay below this bound, we would need more aggressive simplifications |
748 to stay below this bound, we would need more aggressive simplifications |
749 such as opening up alternatives to achieve the maximum level of duplicates |
749 such as opening up alternatives to achieve the maximum level of duplicates |
750 cancellation. |
750 cancellation. |
751 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to |
751 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to |
752 get $a\cdot c +b \cdot c |
752 get $a\cdot c +b \cdot c |
753 + b \cdot c$, and then simplified to $a \cdot c+b \cdot c$. Another example is from |
753 + b \cdot c$, and then simplified to $a \cdot c+b \cdot c$. Another example is from |
754 $(a^*+a) + (a^*+ 1) + (a +1)$ to $a^*+a+1$. |
754 $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to $a^*+a+\ONE$. |
755 Adding these more aggressive simplification rules |
755 Adding these more aggressive simplification rules |
756 helped us to achieve the same size bound as that of the partial derivatives. |
756 helped us to achieve the same size bound as that of the partial derivatives. |
757 To introduce these "spilling out alternatives" simplifications |
757 To introduce these "spilling out alternatives" simplifications |
758 and make the correctness proof easier, |
758 and make the correctness proof easier, |
759 we used bitcodes. |
759 we used bitcodes. |
760 |
760 Bitcodes look like this: |
761 %This allows us to prove a tight |
761 %This allows us to prove a tight |
762 %bound on the size of regular expression during the running time of the |
762 %bound on the size of regular expression during the running time of the |
763 %algorithm if we can establish the connection between our simplification |
763 %algorithm if we can establish the connection between our simplification |
764 %rules and partial derivatives. |
764 %rules and partial derivatives. |
765 |
765 |
766 %We believe, and have generated test |
766 %We believe, and have generated test |
767 %data, that a similar bound can be obtained for the derivatives in |
767 %data, that a similar bound can be obtained for the derivatives in |
768 %Sulzmann and Lu's algorithm. Let us give some details about this next. |
768 %Sulzmann and Lu's algorithm. Let us give some details about this next. |
769 |
769 |
770 Bitcodes look like this: |
770 |
771 \begin{center} |
771 \begin{center} |
772 $b ::= S \mid Z \; \;\; |
772 $b ::= S \mid Z \; \;\; |
773 bs ::= [] \mid b:bs |
773 bs ::= [] \mid b:bs |
774 $ |
774 $ |
775 \end{center} |
775 \end{center} |
845 \begin{center} |
845 \begin{center} |
846 \begin{tabular}{lcl} |
846 \begin{tabular}{lcl} |
847 $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ |
847 $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ |
848 & $\mid$ & $\textit{ONE}\;\;bs$\\ |
848 & $\mid$ & $\textit{ONE}\;\;bs$\\ |
849 & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ |
849 & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ |
850 & $\mid$ & $\textit{ALT}\;\;bs\,as$\\ |
850 & $\mid$ & $\textit{ALT}\;\;bs\,a_1 \, a_2$\\ |
851 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ |
851 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ |
852 & $\mid$ & $\textit{STAR}\;\;bs\,a$ |
852 & $\mid$ & $\textit{STAR}\;\;bs\,a$ |
853 \end{tabular} |
853 \end{tabular} |
854 \end{center} |
854 \end{center} |
855 |
855 %(in \textit{ALT}) |
856 \noindent |
856 \noindent |
857 where $bs$ stands for bit-sequences, and $as$ (in \textit{ALTS}) for a |
857 where $bs$ stands for bit-sequences, and $a$ for $\bold{a}$nnotated regular expressions. These bit-sequences encode |
858 list of annotated regular expressions. These bit-sequences encode |
|
859 information about the (POSIX) value that should be generated by the |
858 information about the (POSIX) value that should be generated by the |
860 Sulzmann and Lu algorithm. |
859 Sulzmann and Lu algorithm. |
861 |
860 |
862 To do lexing using annotated regular expressions, we shall first |
861 To do lexing using annotated regular expressions, we shall first |
863 transform the usual (un-annotated) regular expressions into annotated |
862 transform the usual (un-annotated) regular expressions into annotated |
926 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\, |
927 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\, |
927 (\textit{STAR}\,[]\,r)$ |
928 (\textit{STAR}\,[]\,r)$ |
928 \end{tabular} |
929 \end{tabular} |
929 \end{center} |
930 \end{center} |
930 %\end{definition} |
931 %\end{definition} |
931 For instance, when we unfold $STAR \; bs \; a$ into a sequence, we attach an additional bit Z to the front of $r \backslash c$ to indicate that there is one more star iteration. |
932 For instance, we attach an additional bit $\Z$ to |
932 The other example, the $SEQ$ clause is more subtle-- when $a_1$ is $bnullable$(here bnullable is exactly the same as nullable, except that it is for annotated regular expressions, therefore we omit the definition). |
933 the front of $r \backslash c$ to indicate that there |
933 Assume that $bmkeps$ correctly extracts the bitcode for how $a_1$ matches the string prior to character c(more on this later), then the right branch of $ALTS$, which is $fuse \; bmkeps \; a_1 (a_2 \backslash c)$ will collapse the regular expression $a_1$(as it has already been fully matched) and store the parsing information at the head of the regular expression $a_2 \backslash c$ by fusing to it. The bitsequence $bs$, which was initially attached to the head of $SEQ$, has now been elevated to the top-level of ALT, |
934 is one more star iteration when we unfold $\textit{STAR} \; bs \; a$ |
934 as this information will be needed whichever way the $SEQ$ is matched--no matter whether c belongs to $a_1$ or $ a_2$. |
935 into a sequence in the last clause. |
935 After carefully doing these derivatives and maintaining all the parsing information, we complete the parsing by collecting the bits using a special $mkeps$ function for annotated regular expressions--$bmkeps$: |
936 A more subtle treatment of the bits lies in the $\textit{SEQ}$ clause |
|
937 when $a_1$ is $\textit{bnullable}$($\textit{bnullable}$ |
|
938 is exactly the same as $\textit{nullable}$, except that it is for |
|
939 annotated regular expressions, therefore we omit the definition). |
|
940 $\textit{bmkeps} \, a_1$ extracts the bitcode accumulated in |
|
941 $a_1$ for how the prefix $s_1$ of the input string ending at character $c$ |
|
942 is matched by the predecessor of $a_1$(before $s_1$ was chopped off). |
|
943 Then $\textit{fuse}$ stores the information |
|
944 extracted by $\textit{bmkeps}$ at the head of the |
|
945 regular expression $a_2 \backslash c$. |
|
946 The bitsequence $bs$, which was initially |
|
947 attached to the head of $\textit{SEQ}$, has |
|
948 now been elevated to the top-level $\textit{ALT}$, |
|
949 as this information will be needed |
|
950 whichever way the $\textit{SEQ}\,bs \,a_1\,a_2$ is |
|
951 matched--no matter whether $c$ belongs to $a_1$ or $ a_2$. |
|
952 After carefully doing these derivatives and |
|
953 maintaining all the parsing information, |
|
954 we complete the parsing by collecting the bits using |
|
955 a special $\textit{mkeps}$ function for annotated |
|
956 regular expressions--$\textit{bmkeps}$: |
936 |
957 |
937 |
958 |
938 %\begin{definition}[\textit{bmkeps}]\mbox{} |
959 %\begin{definition}[\textit{bmkeps}]\mbox{} |
939 \begin{center} |
960 \begin{center} |
940 \begin{tabular}{lcl} |
961 \begin{tabular}{lcl} |
952 %\end{definition} |
973 %\end{definition} |
953 |
974 |
954 \noindent |
975 \noindent |
955 This function completes the parse tree information by |
976 This function completes the parse tree information by |
956 travelling along the path on the regular expression that corresponds to a POSIX value snd collect all the bits, and |
977 travelling along the path on the regular expression that corresponds to a POSIX value snd collect all the bits, and |
957 using S to indicate the end of star iterations. If we take the bitsproduced by $bmkeps$ and decode it, |
978 using S to indicate the end of star iterations. If we take the bits produced by $bmkeps$ and decode it, |
958 we get the parse tree we need, the working flow looks like this:\\ |
979 we get the parse tree we need, the working flow looks like this:\\ |
959 \begin{center} |
980 \begin{center} |
960 \begin{tabular}{lcl} |
981 \begin{tabular}{lcl} |
961 $\textit{blexer}\;r\,s$ & $\dn$ & |
982 $\textit{blexer}\;r\,s$ & $\dn$ & |
962 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
983 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
999 &&$ \textit{case} \; (\_, 0) \Rightarrow 0$ \\ |
1020 &&$ \textit{case} \; (\_, 0) \Rightarrow 0$ \\ |
1000 &&$ \textit{case} \; (1, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
1021 &&$ \textit{case} \; (1, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
1001 &&$ \textit{case} \; (a_1', 1) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
1022 &&$ \textit{case} \; (a_1', 1) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
1002 &&$ \textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ |
1023 &&$ \textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ |
1003 |
1024 |
1004 $\textit{simp} \; \textit{ALT}\;bs\,as$ & $\dn$ & $\textit{ distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ |
1025 $\textit{simp} \; \textit{ALTS}\;bs\,as$ & $\dn$ & $\textit{ distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ |
1005 &&$\textit{case} \; [] \Rightarrow 0$ \\ |
1026 &&$\textit{case} \; [] \Rightarrow 0$ \\ |
1006 &&$ \textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
1027 &&$ \textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
1007 &&$ \textit{case} \; as' \Rightarrow \textit{ALT bs as'}$ |
1028 &&$ \textit{case} \; as' \Rightarrow \textit{ALT bs as'}$ |
1008 \end{tabular} |
1029 \end{tabular} |
1009 \end{center} |
1030 \end{center} |
1010 |
1031 |
1011 The simplification does a pattern matching on the regular expression. When it detected that |
1032 The simplification does a pattern matching on the regular expression. When it detected that |
1012 the regular expression is an alternative or sequence, it will try to simplify its children regular expressions |
1033 the regular expression is an alternative or sequence, |
1013 recursively and then see if one of the children turn into 0 or 1, which might trigger further simplification |
1034 it will try to simplify its children regular expressions |
1014 at the current level. The most involved part is the ALTS clause, where we use two auxiliary functions |
1035 recursively and then see if one of the children turn |
1015 flatten and distinct to open up nested ALT and reduce as many duplicates as possible. |
1036 into $\ZERO$ or $\ONE$, which might trigger further simplification |
1016 Function distinct keeps the first occurring copy only and remove all later ones when detected duplicates. |
1037 at the current level. The most involved part is the $\textit{ALTS}$ |
1017 Function flatten opens up nested ALT. Its recursive definition is given below: |
1038 clause, where we use two auxiliary functions |
|
1039 flatten and distinct to open up nested $\textit{ALTS}$ and |
|
1040 reduce as many duplicates as possible. |
|
1041 Function distinct keeps the first occurring copy only and |
|
1042 remove all later ones when detected duplicates. |
|
1043 Function flatten opens up nested \textit{ALT}. Its recursive |
|
1044 definition is given below: |
1018 \begin{center} |
1045 \begin{center} |
1019 \begin{tabular}{@{}lcl@{}} |
1046 \begin{tabular}{@{}lcl@{}} |
1020 $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \; |
1047 $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \; |
1021 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
1048 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
1022 $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \; as' $ \\ |
1049 $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \; as' $ \\ |
1106 The crucial point is to find the indispensable information of |
1132 The crucial point is to find the indispensable information of |
1107 a regular expression and how it is kept intact during simplification so that it performs |
1133 a regular expression and how it is kept intact during simplification so that it performs |
1108 as good as a regular expression that has not been simplified in the subsequent derivative operations. |
1134 as good as a regular expression that has not been simplified in the subsequent derivative operations. |
1109 To aid this, we use the helping function retrieve described by Sulzmann and Lu: |
1135 To aid this, we use the helping function retrieve described by Sulzmann and Lu: |
1110 \\definition of retrieve\\ |
1136 \\definition of retrieve\\ |
1111 This function assembles the bitcode that corresponds to a parse tree for how the current |
1137 This function assembled the bitcode that corresponds to a parse tree for how the current |
1112 derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
1138 derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
1113 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ |
1139 Sulzmann and Lu used this to connect the bitcoded algorithm to the older algorithm by the following equation: |
1114 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\ |
1140 \begin{center} |
1115 A little fact that needs to be stated to help comprehension:\\ |
1141 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$ |
1116 $r^\uparrow = a$($a$ stands for $annotated$).\\ |
1142 \end{center} |
1117 Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplification. |
1143 A little fact that needs to be stated to help comprehension: |
|
1144 \begin{center} |
|
1145 $r^\uparrow = a$($a$ stands for $\textit{annotated}).$ |
|
1146 \end{center} |
|
1147 Ausaf and Urban also used this fact to prove the correctness of bitcoded algorithm without simplification. |
1118 Our purpose of using this, however, is try to establish \\ |
1148 Our purpose of using this, however, is try to establish \\ |
1119 $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$\\ |
1149 $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$\\ |
1120 The idea is that using $v'$, |
1150 The idea is that using $v'$, |
1121 a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still able to extract the bit-sequence that gives the same parsing information as the unsimplified one. |
1151 a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still able to extract the bit-sequence that gives the same parsing information as the unsimplified one. |
1122 After establishing this, we might be able to finally bridge the gap of proving\\ |
1152 After establishing this, we might be able to finally bridge the gap of proving\\ |
1123 $\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; \textit{simp}(r) \backslash s \; v'$\\ |
1153 $\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; \textit{simp}(r) \backslash s \; v'$\\ |
1124 and subsequently\\ |
1154 and subsequently\\ |
1125 $\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; r \backslash_{simp} s \; v'$.\\ |
1155 $\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; r \backslash_{simp} s \; v'$.\\ |
1126 This proves that our simplified version of regular expression still contains all the bitcodes needed. |
1156 This proves that our simplified version of regular expression still contains all the bitcodes needed. |
1127 |
1157 |
1128 \item |
1158 |
1129 The second task is to speed up the more aggressive simplification. |
1159 The second task is to speed up the more aggressive simplification. |
1130 Currently it is slower than a naive simplification(the naive version as |
1160 Currently it is slower than a naive simplification(the naive version as |
1131 implemented in ADU of course can explode in some cases). So it needs to |
1161 implemented in ADU of course can explode in some cases). So it needs to |
1132 be explored how to make it faster. Our possibility would be to explore |
1162 be explored how to make it faster. Our possibility would be to explore |
1133 again the connection to DFAs. This is very much work in progress. |
1163 again the connection to DFAs. This is very much work in progress. |
1134 \end{itemize} |
|
1135 |
1164 |
1136 \section{Conclusion} |
1165 \section{Conclusion} |
1137 |
1166 |
1138 In this PhD-project we are interested in fast algorithms for regular |
1167 In this PhD-project we are interested in fast algorithms for regular |
1139 expression matching. While this seems to be a ``settled'' area, in |
1168 expression matching. While this seems to be a ``settled'' area, in |