475 the state-of-the-art. |
475 the state-of-the-art. |
476 |
476 |
477 |
477 |
478 \section{Simplification of Regular Expressions} |
478 \section{Simplification of Regular Expressions} |
479 Using bit-codes to guide parsing is not a new idea. |
479 Using bit-codes to guide parsing is not a new idea. |
480 |
480 It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by intergrating bitcodes into derivatives. |
|
481 |
|
482 The argument for complicating the data structures from basic regular expressions to those with bitcodes |
|
483 is that we can introduce simplification without making the algorithm crash or impossible to reason about. |
|
484 The reason why we need simplification is due to the shortcoming of a naive algorithm using Brzozowski's definition only. |
481 The main drawback of building successive derivatives according to |
485 The main drawback of building successive derivatives according to |
482 Brzozowski's definition is that they can grow very quickly in size. |
486 Brzozowski's definition is that they can grow very quickly in size. |
483 This is mainly due to the fact that the derivative operation generates |
487 This is mainly due to the fact that the derivative operation generates |
484 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, |
488 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, |
485 if implemented naively both algorithms by Brzozowski and by Sulzmann |
489 if implemented naively both algorithms by Brzozowski and by Sulzmann |
512 |
516 |
513 %We believe, and have generated test |
517 %We believe, and have generated test |
514 %data, that a similar bound can be obtained for the derivatives in |
518 %data, that a similar bound can be obtained for the derivatives in |
515 %Sulzmann and Lu's algorithm. Let us give some details about this next. |
519 %Sulzmann and Lu's algorithm. Let us give some details about this next. |
516 |
520 |
517 We first followed Sulzmann and Lu's idea of introducing |
521 Bit-codes look like this: |
518 \emph{annotated regular expressions}~\cite{Sulzmann2014}. They are |
522 \[ b ::= S \mid Z \; \;\; |
519 defined by the following grammar: |
523 bs ::= [] \mid b:bs |
520 |
524 \] |
521 \begin{center} |
525 They are just a string of bits, the names "S" and "Z" here are kind of arbitrary, we can use 0 and 1 or binary symbol to substitute them. They are a compact form of parse trees. |
522 \begin{tabular}{lcl} |
526 Here is how values and bit-codes are related: |
523 $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ |
527 Bitcodes are essentially incomplete values. |
524 & $\mid$ & $\textit{ONE}\;\;bs$\\ |
|
525 & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ |
|
526 & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\ |
|
527 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ |
|
528 & $\mid$ & $\textit{STAR}\;\;bs\,a$ |
|
529 \end{tabular} |
|
530 \end{center} |
|
531 |
|
532 \noindent |
|
533 where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a |
|
534 list of annotated regular expressions. These bitsequences encode |
|
535 information about the (POSIX) value that should be generated by the |
|
536 Sulzmann and Lu algorithm. Bitcodes are essentially incomplete values. |
|
537 This can be straightforwardly seen in the following transformation: |
528 This can be straightforwardly seen in the following transformation: |
538 \begin{center} |
529 \begin{center} |
539 \begin{tabular}{lcl} |
530 \begin{tabular}{lcl} |
540 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
531 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
541 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
532 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
577 \textit{else}\;\textit{None}$ |
568 \textit{else}\;\textit{None}$ |
578 \end{tabular} |
569 \end{tabular} |
579 \end{center} |
570 \end{center} |
580 %\end{definition} |
571 %\end{definition} |
581 |
572 |
|
573 |
|
574 Sulzmann and Lu's integrated the bitcodes into annotated regular expressions by attaching them to the head of every substructure of a regularexpression\emph{annotated regular expressions}~\cite{Sulzmann2014}. They are |
|
575 defined by the following grammar: |
|
576 |
|
577 \begin{center} |
|
578 \begin{tabular}{lcl} |
|
579 $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ |
|
580 & $\mid$ & $\textit{ONE}\;\;bs$\\ |
|
581 & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ |
|
582 & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\ |
|
583 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ |
|
584 & $\mid$ & $\textit{STAR}\;\;bs\,a$ |
|
585 \end{tabular} |
|
586 \end{center} |
|
587 |
|
588 \noindent |
|
589 where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a |
|
590 list of annotated regular expressions. These bitsequences encode |
|
591 information about the (POSIX) value that should be generated by the |
|
592 Sulzmann and Lu algorithm. |
|
593 |
582 To do lexing using annotated regular expressions, we shall first transform the |
594 To do lexing using annotated regular expressions, we shall first transform the |
583 usual (un-annotated) regular expressions into annotated regular |
595 usual (un-annotated) regular expressions into annotated regular |
584 expressions:\\ |
596 expressions:\\ |
585 %\begin{definition} |
597 %\begin{definition} |
586 \begin{center} |
598 \begin{center} |
596 $(r^*)^\uparrow$ & $\dn$ & |
608 $(r^*)^\uparrow$ & $\dn$ & |
597 $\textit{STAR}\;[]\,r^\uparrow$\\ |
609 $\textit{STAR}\;[]\,r^\uparrow$\\ |
598 \end{tabular} |
610 \end{tabular} |
599 \end{center} |
611 \end{center} |
600 %\end{definition} |
612 %\end{definition} |
601 Then we do successive derivative operations on the annotated regular expression. This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\ |
613 After internalise we do successive derivative operations on the annotated regular expression. |
|
614 Here $fuse$ is an auxiliary function that helps to attach bits to the front of an annotated regular expression. |
|
615 This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\ |
602 %\begin{definition}{bder} |
616 %\begin{definition}{bder} |
603 \begin{center} |
617 \begin{center} |
604 \begin{tabular}{@{}lcl@{}} |
618 \begin{tabular}{@{}lcl@{}} |
605 $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
619 $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
606 $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
620 $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |