543 that the hole will be on $r_a$. So we recursively call $\inj\, |
543 that the hole will be on $r_a$. So we recursively call $\inj\, |
544 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value |
544 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value |
545 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$. |
545 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$. |
546 Other clauses can be understood in a similar way. |
546 Other clauses can be understood in a similar way. |
547 |
547 |
548 Let us do some further examples involving $\inj$: Suppose we have the |
548 The following example gives a taste of $\textit{inj}$'s effect |
549 regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against |
549 and how Sulzmann and Lu's algorithm works as a whole. |
|
550 Suppose we have a |
|
551 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against |
550 the string $abc$ (when $abc$ is written as a regular expression, the most |
552 the string $abc$ (when $abc$ is written as a regular expression, the most |
551 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit |
553 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit |
552 the parentheses and dots here for readability). By POSIX rules the algorithm |
554 the parentheses and dots here for readability). |
553 should go for the longest matching, i.e. it should match the string |
555 This algorithm returns a POSIX value, which means it |
554 $abc$ in one star iteration, using the longest string $abc$ in the |
556 will go for the longest matching, i.e.~it should match the string |
555 sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression |
557 $abc$ in one star iteration, using the longest alternative $abc$ in the |
|
558 sub-expression $((((a+b)+ab)+c)+abc)$ (we use $r$ to denote this sub-expression |
556 for conciseness). |
559 for conciseness). |
557 Here is how the lexer achieves a parse tree for this |
560 Before $\textit{inj}$ comes into play, |
558 matching. First, build successive derivatives until we exhaust the |
561 our lexer first builds derivative using string $abc$ (we simplified some regular expressions like |
559 string, as illustrated here (we simplified some regular expressions like |
562 $0 \cdot b$ to $0$ for conciseness; we also omit parentheses if |
560 $0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if |
|
561 they are clear from the context): |
563 they are clear from the context): |
562 %Similarly, we allow |
564 %Similarly, we allow |
563 %$\textit{ALT}$ to take a list of regular expressions as an argument |
565 %$\textit{ALT}$ to take a list of regular expressions as an argument |
564 %instead of just 2 operands to reduce the nested depth of |
566 %instead of just 2 operands to reduce the nested depth of |
565 %$\textit{ALT}$ |
567 %$\textit{ALT}$ |
571 & & $\phantom{r_3 = (} ((0+1+0 + 0 + 0) \cdot r^* + (0+0+0 + 1 + 0) \cdot r^* )$ |
573 & & $\phantom{r_3 = (} ((0+1+0 + 0 + 0) \cdot r^* + (0+0+0 + 1 + 0) \cdot r^* )$ |
572 \end{tabular} |
574 \end{tabular} |
573 \end{center} |
575 \end{center} |
574 |
576 |
575 \noindent |
577 \noindent |
576 Now instead when $nullable$ gives a $yes$ on $r_3$, we call $mkeps$ |
578 Now when $nullable$ gives a $yes$ on $r_3$, we call $mkeps$ |
577 to construct a parse tree for how $r_3$ matched the string $abc$. |
579 to construct a parse tree for how $r_3$ matched the string $abc$. |
578 $mkeps$ gives the following value $v_3$: |
580 $mkeps$ gives the following value $v_3$: |
579 \[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\] |
581 \begin{center} |
580 The outer 2 $Left$ tells us the first nullable part hides in the term |
582 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$ |
581 \[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \] |
583 \end{center} |
582 in \[r_3 = ( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0 + 1 + 0) |
584 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined): |
583 \cdot r^*) +((0+1+0 + 0 + 0) \cdot r^*+(0+0+0 + 1 + 0) \cdot r^* ) |
585 \begin{center} |
584 \](underlined). Note that the leftmost location of term \[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \] |
586 $( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0 + 1 + 0) |
|
587 \cdot r^*) +((0+1+0 + 0 + 0) \cdot r^*+(0+0+0 + 1 + 0) \cdot r^* ).$ |
|
588 |
|
589 \end{center} |
|
590 Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$ |
585 (which corresponds to the initial sub-match $abc$) |
591 (which corresponds to the initial sub-match $abc$) |
586 allows $mkeps$ to choose it as the first candidate that meets the requirement of being $nullable$ |
592 allows $mkeps$ to pick it up |
587 because $mkeps$ is defined to always choose the left one when nullable. |
593 because $mkeps$ is defined to always choose the left one when it is nullable. |
588 In the case of this example, $abc$ is preferred over $a$ or $ab$. |
594 In the case of this example, $abc$ is preferred over $a$ or $ab$. |
589 This location is naturally generated by the splitting clause\[ |
595 This $\Left(\Left(\ldots))$ location is naturally generated by |
590 (r_1 \cdot r_2)\backslash c (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c\]. |
596 two applications of the splitting clause |
|
597 \begin{center} |
|
598 $(r_1 \cdot r_2)\backslash c (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$ |
|
599 \end{center} |
591 By this clause, we put |
600 By this clause, we put |
592 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. |
601 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. |
593 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match. |
602 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match. |
594 The inside subvalue |
603 Removing the outside $\Left(\Left(...))$, the inside sub-value |
595 \[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\] |
604 \begin{center} |
596 tells us how about the empty string matches |
605 $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$ |
597 \[(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \], |
606 \end{center} |
598 a sequence of 2 empty regular expressions, the first one is an alternative, we take the rightmost alternative |
607 tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$. |
599 to get that empty regular expression. The second one is a star, which iterates 0 times to form the second |
608 We match $[]$ by a sequence of 2 nullable regular expressions. |
600 empty regular expression. |
609 The first one is an alternative, we take the rightmost alternative---whose language |
601 |
610 contains the empty string. The second nullable regular |
|
611 expression is a Kleene star. $\Stars$ tells us how it |
|
612 generates the nullable regular expression: by 0 iterations to form $\epsilon$. |
|
613 Now $\textit{inj}$ injects characters back and |
|
614 incrementally builds a parse tree based on $v_3$. |
602 Using the value $v_3$, the character c, and the regular expression $r_2$, |
615 Using the value $v_3$, the character c, and the regular expression $r_2$, |
603 we can recover how $r_2$ matched the string $[c]$ : |
616 we can recover how $r_2$ matched the string $[c]$ : |
604 $\textit{inj} r_2 c v_3$ gives us\[ v_2 = Left(Seq(Right(Seq(Empty, Seq(Empty, c))), Stars []))\], |
617 $\textit{inj} \; r_2 \; c \; v_3$ gives us |
605 which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get |
618 \begin{center} |
606 \[ v_1 = Seq(Right(Seq(Empty, Seq(b, c))), Stars [])\] |
619 $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$ |
|
620 \end{center} |
|
621 which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get |
|
622 \begin{center} |
|
623 $v_1 = \Seq(\Right(\Seq(\Empty, \Seq(b, c))), \Stars [])$ |
|
624 \end{center} |
607 for how |
625 for how |
608 \[r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*\] |
626 \begin{center} |
|
627 $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ |
|
628 \end{center} |
609 matched the string $bc$ before it split into 2 pieces. |
629 matched the string $bc$ before it split into 2 pieces. |
610 Finally, after injecting character $a$ back to $v_1$, |
630 Finally, after injecting character $a$ back to $v_1$, |
611 we get the parse tree $v_0= Stars [Right(Seq(a, Seq(b, c)))]$ for how $r$ matched $abc$. |
631 we get the parse tree |
|
632 \begin{center} |
|
633 $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$ |
|
634 \end{center} |
|
635 for how $r$ matched $abc$. This completes the algorithm. |
|
636 |
612 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. |
637 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. |
613 Readers might have noticed that the parse tree information |
638 Readers might have noticed that the parse tree information |
614 is actually already available when doing derivatives. |
639 is actually already available when doing derivatives. |
615 For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$, |
640 For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$, |
616 we can either take the initial match to be |
641 we can either take the initial match to be |
|
642 \begin{center} |
617 \begin{enumerate} |
643 \begin{enumerate} |
618 \item[1)] just $a$ or |
644 \item[1)] just $a$ or |
619 \item[2)] string $ab$ or |
645 \item[2)] string $ab$ or |
620 \item[3)] string $abc$. |
646 \item[3)] string $abc$. |
621 \end{enumerate} |
647 \end{enumerate} |
|
648 \end{center} |
622 In order to differentiate between these choices, |
649 In order to differentiate between these choices, |
623 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. |
650 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. |
624 Which one of these alternatives is chosen later does not affect their relative position because our algorithm |
651 Which one of these alternatives is chosen later does not affect their relative position because our algorithm |
625 does not change this order. If this parsing information can be determined and does |
652 does not change this order. If this parsing information can be determined and does |
626 not change because of later derivatives, |
653 not change because of later derivatives, |
627 there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives. |
654 there is no point in traversing this information twice. This leads to an optimization---if we store the information for parse trees in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an $n$-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives. |
628 |
655 |
629 In the next section, we shall focus on the bit-coded algorithm and the |
656 In the next section, we shall focus on the bit-coded algorithm and the |
630 process of simplification of regular expressions. This is needed in |
657 process of simplification of regular expressions. This is needed in |
631 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
658 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
632 and Lu's algorithms. This is where the PhD-project aims to advance the |
659 and Lu's algorithms. This is where the PhD-project aims to advance the |