577 By this clause, we put |
577 By this clause, we put |
578 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. |
578 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. |
579 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match. |
579 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match. |
580 The inside subvalue |
580 The inside subvalue |
581 \[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\] |
581 \[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\] |
582 tells us how about the empty string matches : among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* )$, \\ |
582 tells us how about the empty string matches |
583 we choose the left most nullable one, which is composed of a sequence of an alternative and a star. In that alternative $0+0+0 + 0 + 1 \cdot 1 \cdot 1$ we take the rightmost choice. |
583 \[(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \], |
584 |
584 a sequence of 2 empty regular expressions, the first one is an alternative, we take the rightmost alternative |
585 Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, \\ |
585 to get that empty regular expression. The second one is a star, which iterates 0 times to form the second |
586 which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$. |
586 empty regular expression. |
587 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. |
587 |
588 Readers might have noticed that the parse tree information as actually already available when doing derivatives. For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with a, we can either take the initial match to be |
588 Using the value $v_3$, the character c, and the regular expression $r_2$, |
|
589 we can recover how $r_2$ matched the string $[c]$ : |
|
590 $\textit{inj} r_2 c v_3$ gives us\[ v_2 = Left(Seq(Right(Seq(Empty, Seq(Empty, c))), Stars []))\], |
|
591 which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get |
|
592 \[ v_1 = Seq(Right(Seq(Empty, Seq(b, c))), Stars [])\] |
|
593 for how |
|
594 \[r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*\] |
|
595 matched the string $bc$ before it split into 2 pieces. |
|
596 Finally, after injecting character $a$ back to $v_1$, |
|
597 we get the parse tree $v_0= Stars [Right(Seq(a, Seq(b, c)))]$ for how $r$ matched $abc$. |
|
598 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. |
|
599 Readers might have noticed that the parse tree information |
|
600 is actually already available when doing derivatives. |
|
601 For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$, |
|
602 we can either take the initial match to be |
589 \begin{enumerate} |
603 \begin{enumerate} |
590 \item[1)] just $a$ or |
604 \item[1)] just $a$ or |
591 \item[2)] string $ab$ or |
605 \item[2)] string $ab$ or |
592 \item[3)] string $abc$. |
606 \item[3)] string $abc$. |
593 \end{enumerate} |
607 \end{enumerate} |
594 In order to differentiate between these choices, we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. Which one of these alternatives is chosen later does not affect their relative position because our algorithm does not change this order. There is no need to traverse this information twice. This leads to a new approach of lexing-- if we store the information for parse trees in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives. |
608 In order to differentiate between these choices, |
|
609 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. |
|
610 Which one of these alternatives is chosen later does not affect their relative position because our algorithm |
|
611 does not change this order. If this parsing information can be determined and does |
|
612 not change because of later derivatives, |
|
613 there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives. |
595 |
614 |
596 In the next section, we shall focus on the bit-coded algorithm and the natural |
615 In the next section, we shall focus on the bit-coded algorithm and the natural |
597 process of simplification of regular expressions using bit-codes, which is needed in |
616 process of simplification of regular expressions using bit-codes, which is needed in |
598 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
617 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
599 and Lu's algorithms. This is where the PhD-project hopes to advance |
618 and Lu's algorithms. This is where the PhD-project hopes to advance |
603 \section{Simplification of Regular Expressions} |
622 \section{Simplification of Regular Expressions} |
604 Using bit-codes to guide parsing is not a new idea. |
623 Using bit-codes to guide parsing is not a new idea. |
605 It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by integrating bitcodes into derivatives. |
624 It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by integrating bitcodes into derivatives. |
606 |
625 |
607 The argument for complicating the data structures from basic regular expressions to those with bitcodes |
626 The argument for complicating the data structures from basic regular expressions to those with bitcodes |
608 is that we can introduce simplification without making the algorithm crash or impossible to reason about. |
627 is that we can introduce simplification without making the algorithm crash or overly complex to reason about. |
609 The reason why we need simplification is due to the shortcoming of a naive algorithm using Brzozowski's definition only. |
628 The reason why we need simplification is due to the shortcoming of the previous algorithm. |
610 The main drawback of building successive derivatives according to |
629 The main drawback of building successive derivatives according to |
611 Brzozowski's definition is that they can grow very quickly in size. |
630 Brzozowski's definition is that they can grow very quickly in size. |
612 This is mainly due to the fact that the derivative operation generates |
631 This is mainly due to the fact that the derivative operation generates |
613 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, |
632 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, |
614 if implemented naively both algorithms by Brzozowski and by Sulzmann |
633 if implemented naively both algorithms by Brzozowski and by Sulzmann |