105 formally established their correctness. We have also not yet looked |
105 formally established their correctness. We have also not yet looked |
106 at extended regular expressions, such as bounded repetitions, |
106 at extended regular expressions, such as bounded repetitions, |
107 negation and back-references. |
107 negation and back-references. |
108 \end{abstract} |
108 \end{abstract} |
109 |
109 |
|
110 |
|
111 \section{Introduction} |
|
112 |
|
113 While we believe derivatives of regular expressions, written |
|
114 $r\backslash s$, are a beautiful concept (in terms of ease of |
|
115 implementing them in functional programming languages and in terms of |
|
116 reasoning about them formally), they have one major drawback: every |
|
117 derivative step can make regular expressions grow drastically in |
|
118 size. This in turn has negative effect on the runtime of the |
|
119 corresponding lexing algorithms. Consider for example the regular |
|
120 expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The |
|
121 corresponding derivative contains already 8668 nodes where we assume |
|
122 the derivative is given as a tree. The reason for the poor runtime of |
|
123 the derivative-based lexing algorithms is that they need to traverse |
|
124 such trees over and over again. The solution is to find a complete set |
|
125 of simplification rules that keep the sizes of derivatives uniformly |
|
126 small. |
|
127 |
|
128 |
110 \section{Recapitulation of Concepts From the Last Report} |
129 \section{Recapitulation of Concepts From the Last Report} |
111 |
130 |
112 \subsection*{Regular Expressions and Derivatives} |
131 \subsection*{Regular Expressions and Derivatives} |
113 |
132 |
114 Suppose (basic) regular expressions are given by the following grammar: |
133 Suppose (basic) regular expressions are given by the following grammar: |
733 will be reduced to just 6 and stays constant, no matter how long the |
752 will be reduced to just 6 and stays constant, no matter how long the |
734 input string is. |
753 input string is. |
735 |
754 |
736 |
755 |
737 |
756 |
738 \section{Introduction} |
757 \section{Current Work and Progress} |
739 |
|
740 While we believe derivatives of regular expressions, written |
|
741 $r\backslash s$, are a beautiful concept (in terms of ease of |
|
742 implementing them in functional programming languages and in terms of |
|
743 reasoning about them formally), they have one major drawback: every |
|
744 derivative step can make regular expressions grow drastically in |
|
745 size. This in turn has negative effect on the runtime of the |
|
746 corresponding lexing algorithms. Consider for example the regular |
|
747 expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The |
|
748 corresponding derivative contains already 8668 nodes where we assume |
|
749 the derivative is given as a tree. The reason for the poor runtime of |
|
750 the derivative-based lexing algorithms is that they need to traverse |
|
751 such trees over and over again. The solution is to find a complete set |
|
752 of simplification rules that keep the sizes of derivatives uniformly |
|
753 small. |
|
754 |
|
755 For reasons beyond this report, it turns out that a complete set of |
758 For reasons beyond this report, it turns out that a complete set of |
756 simplification rules depends on values being encoded as |
759 simplification rules depends on values being encoded as |
757 bitsequences.\footnote{Values are the results the lexing algorithms |
760 bitsequences.\footnote{Values are the results the lexing algorithms |
758 generate; they encode how a regular expression matched a string.} We |
761 generate; they encode how a regular expression matched a string.} We |
759 already know that the lexing algorithm using bitsequences but |
762 already know that the lexing algorithm using bitsequences but |