464 \begin{enumerate} |
464 \begin{enumerate} |
465 \item[1)] just $a$ or |
465 \item[1)] just $a$ or |
466 \item[2)] string $ab$ or |
466 \item[2)] string $ab$ or |
467 \item[3)] string $abc$. |
467 \item[3)] string $abc$. |
468 \end{enumerate} |
468 \end{enumerate} |
469 In order to differentiate between these choices, we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. Which one of these alternatives is chosen later does not affect their relative position because our algorithm does not change this order. There is no need to traverse this information twice. This leads to a new approach of lexing-- if we store the information for parse trees in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when fininshed with derivatives and calling $mkeps$ for deiciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives. |
469 In order to differentiate between these choices, we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. Which one of these alternatives is chosen later does not affect their relative position because our algorithm does not change this order. There is no need to traverse this information twice. This leads to a new approach of lexing-- if we store the information for parse trees in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives. |
470 |
470 |
471 In the next section, we shall focus on the bit-coded algorithm and the natural |
471 In the next section, we shall focus on the bit-coded algorithm and the natural |
472 process of simplification of regular expressions using bit-codes, which is needed in |
472 process of simplification of regular expressions using bit-codes, which is needed in |
473 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
473 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
474 and Lu's algorithms. This is where the PhD-project hopes to advance |
474 and Lu's algorithms. This is where the PhD-project hopes to advance |
475 the state-of-the-art. |
475 the state-of-the-art. |
476 |
476 |
477 |
477 |
478 \section{Simplification of Regular Expressions} |
478 \section{Simplification of Regular Expressions} |
479 Using bit-codes to guide parsing is not a new idea. |
479 Using bit-codes to guide parsing is not a new idea. |
480 It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by intergrating bitcodes into derivatives. |
480 It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by integrating bitcodes into derivatives. |
481 |
481 |
482 The argument for complicating the data structures from basic regular expressions to those with bitcodes |
482 The argument for complicating the data structures from basic regular expressions to those with bitcodes |
483 is that we can introduce simplification without making the algorithm crash or impossible to reason about. |
483 is that we can introduce simplification without making the algorithm crash or impossible to reason about. |
484 The reason why we need simplification is due to the shortcoming of a naive algorithm using Brzozowski's definition only. |
484 The reason why we need simplification is due to the shortcoming of a naive algorithm using Brzozowski's definition only. |
485 The main drawback of building successive derivatives according to |
485 The main drawback of building successive derivatives according to |
672 $bs \,@\, [\S]$ |
672 $bs \,@\, [\S]$ |
673 \end{tabular} |
673 \end{tabular} |
674 \end{center} |
674 \end{center} |
675 %\end{definition} |
675 %\end{definition} |
676 This function completes the parse tree information by |
676 This function completes the parse tree information by |
677 travelling along the path on the regular epxression that corresponds to a POSIX value snd collect all the bits, and |
677 travelling along the path on the regular expression that corresponds to a POSIX value snd collect all the bits, and |
678 using S to indicate the end of star iterations. If we take the bitsproduced by $bmkeps$ and decode it, |
678 using S to indicate the end of star iterations. If we take the bitsproduced by $bmkeps$ and decode it, |
679 we get the parse tree we need, the working flow looks like this:\\ |
679 we get the parse tree we need, the working flow looks like this:\\ |
680 \begin{center} |
680 \begin{center} |
681 \begin{tabular}{lcl} |
681 \begin{tabular}{lcl} |
682 $\textit{blexer}\;r\,s$ & $\dn$ & |
682 $\textit{blexer}\;r\,s$ & $\dn$ & |