100 at extended regular expressions, such as bounded repetitions, |
100 at extended regular expressions, such as bounded repetitions, |
101 negation and back-references. |
101 negation and back-references. |
102 \end{abstract} |
102 \end{abstract} |
103 |
103 |
104 \section{Recapitulation of Concepts From the Last Report} |
104 \section{Recapitulation of Concepts From the Last Report} |
105 \subsection{The Algorithm by Brzozowski based on Derivatives of Regular |
105 |
106 Expressions} |
106 \subsection*{Regular Expressions and Derivatives} |
107 |
107 |
108 Suppose (basic) regular expressions are given by the following grammar: |
108 Suppose (basic) regular expressions are given by the following grammar: |
|
109 |
109 \[ r ::= \ZERO \mid \ONE |
110 \[ r ::= \ZERO \mid \ONE |
110 \mid c |
111 \mid c |
111 \mid r_1 \cdot r_2 |
112 \mid r_1 \cdot r_2 |
112 \mid r_1 + r_2 |
113 \mid r_1 + r_2 |
113 \mid r^* |
114 \mid r^* |
114 \] |
115 \] |
115 |
116 |
116 \noindent |
117 \noindent |
117 |
118 The ingenious contribution of Brzozowski is the notion of \emph{derivatives} of |
118 The ingenious contribution by Brzozowski is the notion of |
119 regular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of |
119 \emph{derivatives} of regular expressions. |
120 $\nullable$ defined below. |
|
121 |
120 \begin{center} |
122 \begin{center} |
121 \begin{tabular}{lcl} |
123 \begin{tabular}{lcl} |
122 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
124 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
123 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
125 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
124 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
126 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
152 $c\!::\!s \in L(r)$ holds |
154 $c\!::\!s \in L(r)$ holds |
153 if and only if $s \in L(r\backslash c)$. |
155 if and only if $s \in L(r\backslash c)$. |
154 \end{center} |
156 \end{center} |
155 |
157 |
156 \noindent |
158 \noindent |
157 |
159 We can generalise the derivative operation shown above for single characters |
158 |
160 to strings as follows: |
159 Now we can generalise the derivative operation to strings like this: |
|
160 |
161 |
161 \begin{center} |
162 \begin{center} |
162 \begin{tabular}{lcl} |
163 \begin{tabular}{lcl} |
163 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
164 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
164 $r \backslash [\,] $ & $\dn$ & $r$ |
165 $r \backslash [\,] $ & $\dn$ & $r$ |
165 \end{tabular} |
166 \end{tabular} |
166 \end{center} |
167 \end{center} |
167 |
168 |
168 \noindent |
169 \noindent |
169 and then define as regular-expression matching algorithm: |
170 and then define Brzozowski's regular-expression matching algorithm as: |
|
171 |
170 \[ |
172 \[ |
171 match\;s\;r \;\dn\; nullable(r\backslash s) |
173 match\;s\;r \;\dn\; nullable(r\backslash s) |
172 \] |
174 \] |
173 |
175 |
174 \noindent |
176 \noindent |
175 This algorithm looks graphically as follows: |
177 Assuming the a string is givane as a sequence of characters, say $c_0c_1..c_n$, |
|
178 this algorithm presented graphically is as follows: |
|
179 |
176 \begin{equation}\label{graph:*} |
180 \begin{equation}\label{graph:*} |
177 \begin{tikzcd} |
181 \begin{tikzcd} |
178 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
182 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
179 \end{tikzcd} |
183 \end{tikzcd} |
180 \end{equation} |
184 \end{equation} |
185 to test whether the result can match the empty string. It can be |
189 to test whether the result can match the empty string. It can be |
186 relatively easily shown that this matcher is correct (that is given |
190 relatively easily shown that this matcher is correct (that is given |
187 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
191 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
188 |
192 |
189 |
193 |
190 \subsection{Values and the Algorithm by Sulzmann and Lu} |
194 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} |
191 |
195 |
192 One limitation of Brzozowski's algorithm is that it only produces a |
196 One limitation of Brzozowski's algorithm is that it only produces a |
193 YES/NO answer for whether a string is being matched by a regular |
197 YES/NO answer for whether a string is being matched by a regular |
194 expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm |
198 expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm |
195 to allow generation of an actual matching, called a \emph{value} or |
199 to allow generation of an actual matching, called a \emph{value} or |
223 \end{tabular} |
227 \end{tabular} |
224 \end{tabular} |
228 \end{tabular} |
225 \end{center} |
229 \end{center} |
226 |
230 |
227 \noindent |
231 \noindent |
228 |
|
229 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
232 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
230 algorithm by a second phase (the first phase being building successive |
233 algorithm by a second phase (the first phase being building successive |
231 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
234 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
232 is generated in case the regular expression matches the string. |
235 is generated in case the regular expression matches the string. |
233 Pictorially, the Sulzmann and Lu algorithm is as follows: |
236 Pictorially, the Sulzmann and Lu algorithm is as follows: |
300 |
303 |
301 \noindent This definition is by recursion on the ``shape'' of regular |
304 \noindent This definition is by recursion on the ``shape'' of regular |
302 expressions and values. |
305 expressions and values. |
303 |
306 |
304 |
307 |
305 \subsection{Simplification of Regular Expressions} |
308 \subsection*{Simplification of Regular Expressions} |
306 |
309 |
307 The main drawback of building successive derivatives according |
310 The main drawback of building successive derivatives according |
308 to Brzozowski's definition is that they can grow very quickly in size. |
311 to Brzozowski's definition is that they can grow very quickly in size. |
309 This is mainly due to the fact that the derivative operation generates |
312 This is mainly due to the fact that the derivative operation generates |
310 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, if |
313 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, if |
329 |
332 |
330 If we want the size of derivatives in Sulzmann and Lu's algorithm to |
333 If we want the size of derivatives in Sulzmann and Lu's algorithm to |
331 stay below this bound, we would need more aggressive simplifications. |
334 stay below this bound, we would need more aggressive simplifications. |
332 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
335 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
333 deleting duplicates whenever possible. For example, the parentheses in |
336 deleting duplicates whenever possible. For example, the parentheses in |
334 $(a+b) \cdot c + bc$ can be opened up to get $a\cdot c + b \cdot c + b |
337 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
335 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
338 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
336 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
339 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
337 $a^*+a+\ONE$. Adding these more aggressive simplification rules helps us |
340 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
338 to achieve the same size bound as that of the partial derivatives. |
341 to achieve the same size bound as that of the partial derivatives. |
339 |
342 |
340 In order to implement the idea of ``spilling out alternatives'' and to |
343 In order to implement the idea of ``spilling out alternatives'' and to |
341 make them compatible with the $\text{inj}$-mechanism, we use |
344 make them compatible with the $\textit{inj}$-mechanism, we use |
342 \emph{bitcodes}. Bits and bitcodes (lists of bits) are just: |
345 \emph{bitcodes}. They were first introduced by Sulzmann and Lu. |
343 |
346 Here bits and bitcodes (lists of bits) are defined as: |
344 |
347 |
345 \begin{center} |
348 \begin{center} |
346 $b ::= S \mid Z \qquad |
349 $b ::= S \mid Z \qquad |
347 bs ::= [] \mid b:bs |
350 bs ::= [] \mid b:bs |
348 $ |
351 $ |
571 |
574 |
572 \noindent |
575 \noindent |
573 In this definition $\_\backslash s$ is the generalisation of the derivative |
576 In this definition $\_\backslash s$ is the generalisation of the derivative |
574 operation from characters to strings (just like the derivatives for un-annotated |
577 operation from characters to strings (just like the derivatives for un-annotated |
575 regular expressions). |
578 regular expressions). |
|
579 |
|
580 |
|
581 \subsection*{Our Simplification Rules} |
576 |
582 |
577 The main point of the bitcodes and annotated regular expressions is that |
583 The main point of the bitcodes and annotated regular expressions is that |
578 we can apply rather aggressive (in terms of size) simplification rules |
584 we can apply rather aggressive (in terms of size) simplification rules |
579 in order to keep derivatives small. We have developed such |
585 in order to keep derivatives small. We have developed such |
580 ``aggressive'' simplification rules and generated test data that show |
586 ``aggressive'' simplification rules and generated test data that show |