455 This corresponds to the leftmost term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* $ in $r_3$. Note that its leftmost location allows $mkeps$ to choose it as the first candidate that meets the requirement of being $nullable$. This location is naturally generated by the splitting clause\\ $(r_1 \cdot r_2)\backslash c (when \, r_1 \, nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c. \\$. By this clause, we put |
456 This corresponds to the leftmost term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* $ in $r_3$. Note that its leftmost location allows $mkeps$ to choose it as the first candidate that meets the requirement of being $nullable$. This location is naturally generated by the splitting clause\\ $(r_1 \cdot r_2)\backslash c (when \, r_1 \, nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c. \\$. By this clause, we put |
456 $r_1 \backslash c \cdot r_2 $ at the front and $r_2 \backslash c$ at the back. This allows $mkeps$ to always pick up among two matches the one with a longer prefix. The value \\ |
457 $r_1 \backslash c \cdot r_2 $ at the front and $r_2 \backslash c$ at the back. This allows $mkeps$ to always pick up among two matches the one with a longer prefix. The value \\ |
457 $Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\ |
458 $Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\ |
458 tells us how about the empty string matches the final regular expression after doing all the derivatives: among the regular expressions $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* )$ we choose the left most nullable one, which is composed of a sequence of a nested alternative and a folded star that iterates 0 times. In that nested alternative we take the rightmost alternative. |
459 tells us how about the empty string matches the final regular expression after doing all the derivatives: among the regular expressions $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* )$ we choose the left most nullable one, which is composed of a sequence of a nested alternative and a folded star that iterates 0 times. In that nested alternative we take the rightmost alternative. |
459 |
460 |
460 Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, which tells us how $r_2$ matched $c$. |
461 Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$. |
461 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. Rather, we shall focus next on the |
462 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. |
462 process of simplification of regular expressions, which is needed in |
463 Readers might have noticed that the parse tree information as actually already available when doing derivatives. For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with a, we can either take the initial match to be |
|
464 \begin{enumerate} |
|
465 \item[1)] just $a$ or |
|
466 \item[2)] string $ab$ or |
|
467 \item[3)] string $abc$. |
|
468 \end{enumerate} |
|
469 In order to differentiate between these choices, we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. Which one of these alternatives is chosen later does not affect their relative position because our algorithm does not change this order. There is no need to traverse this information twice. This leads to a new approach of lexing-- if we store the information for parse trees in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when fininshed with derivatives and calling $mkeps$ for deiciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives. |
|
470 |
|
471 In the next section, we shall focus on the bit-coded algorithm and the natural |
|
472 process of simplification of regular expressions using bit-codes, which is needed in |
463 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
473 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
464 and Lu's algorithms. This is where the PhD-project hopes to advance |
474 and Lu's algorithms. This is where the PhD-project hopes to advance |
465 the state-of-the-art. |
475 the state-of-the-art. |
466 |
476 |
467 |
477 |
468 \section{Simplification of Regular Expressions} |
478 \section{Simplification of Regular Expressions} |
|
479 Using bit-codes to guide parsing is not a new idea. |
469 |
480 |
470 The main drawback of building successive derivatives according to |
481 The main drawback of building successive derivatives according to |
471 Brzozowski's definition is that they can grow very quickly in size. |
482 Brzozowski's definition is that they can grow very quickly in size. |
472 This is mainly due to the fact that the derivative operation generates |
483 This is mainly due to the fact that the derivative operation generates |
473 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, |
484 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, |