63 |
63 |
64 \begin{abstract} |
64 \begin{abstract} |
65 Brzozowski introduced in 1964 a beautifully simple algorithm for |
65 Brzozowski introduced in 1964 a beautifully simple algorithm for |
66 regular expression matching based on the notion of derivatives of |
66 regular expression matching based on the notion of derivatives of |
67 regular expressions. In 2014, Sulzmann and Lu extended this |
67 regular expressions. In 2014, Sulzmann and Lu extended this |
68 algorithm to not just give a YES/NO answer for whether or not a regular |
68 algorithm to not just give a YES/NO answer for whether or not a |
69 expression matches a string, but in case it matches also \emph{how} |
69 regular expression matches a string, but in case it matches also |
70 it matches the string. This is important for applications such as |
70 \emph{how} it matches the string. This is important for |
71 lexing (tokenising a string). The problem is to make the algorithm |
71 applications such as lexing (tokenising a string). The problem is to |
72 by Sulzmann and Lu fast on all inputs without breaking its |
72 make the algorithm by Sulzmann and Lu fast on all inputs without |
73 correctness. We have already developed some simplification rules, but have not shown that they |
73 breaking its correctness. We have already developed some |
74 preserve the correctness. We also have not yet looked at extended regular expressions. |
74 simplification rules for this, but have not proved that they |
|
75 preserve the correctness of the algorithm. We also have not yet |
|
76 looked at extended regular expressions, such as bounded repetitions, |
|
77 negation and back-references. |
75 \end{abstract} |
78 \end{abstract} |
76 |
79 |
77 \section{Introduction} |
80 \section{Introduction} |
78 |
81 |
79 This PhD-project is about regular expression matching and |
82 This PhD-project is about regular expression matching and |
240 regular expressions \cite{Brzozowski1964}. We shall briefly explain |
243 regular expressions \cite{Brzozowski1964}. We shall briefly explain |
241 the algorithms next. |
244 the algorithms next. |
242 |
245 |
243 \section{The Algorithms by Brzozowski, and Sulzmann and Lu} |
246 \section{The Algorithms by Brzozowski, and Sulzmann and Lu} |
244 |
247 |
245 Suppose basic regular expressions are given by the following grammar:\\ |
248 Suppose (basic) regular expressions are given by the following grammar: |
246 \[ r ::= \ZERO \mid \ONE |
249 \[ r ::= \ZERO \mid \ONE |
247 \mid c |
250 \mid c |
248 \mid r_1 \cdot r_2 |
251 \mid r_1 \cdot r_2 |
249 \mid r_1 + r_2 |
252 \mid r_1 + r_2 |
250 \mid r^* |
253 \mid r^* |
251 \] |
254 \] |
252 |
255 |
253 \noindent |
256 \noindent |
254 The intended meaning of the regular expressions is as usual: $\ZERO$ |
257 The intended meaning of the constructors is as usual: $\ZERO$ |
255 cannot match any string, $\ONE$ can match the empty string, the |
258 cannot match any string, $\ONE$ can match the empty string, the |
256 character regular expression $c$ can match the character $c$, and so |
259 character regular expression $c$ can match the character $c$, and so |
257 on. The brilliant contribution by Brzozowski is the notion of |
260 on. |
|
261 |
|
262 The brilliant contribution by Brzozowski is the notion of |
258 \emph{derivatives} of regular expressions. The idea behind this |
263 \emph{derivatives} of regular expressions. The idea behind this |
259 notion is as follows: suppose a regular expression $r$ can match a |
264 notion is as follows: suppose a regular expression $r$ can match a |
260 string of the form $c\!::\! s$ (that is a list of characters starting |
265 string of the form $c\!::\! s$ (that is a list of characters starting |
261 with $c$), what does the regular expression look like that can match |
266 with $c$), what does the regular expression look like that can match |
262 just $s$? Brzozowski gave a neat answer to this question. He started with the definition of $nullable$: |
267 just $s$? Brzozowski gave a neat answer to this question. He started |
|
268 with the definition of $nullable$: |
263 \begin{center} |
269 \begin{center} |
264 \begin{tabular}{lcl} |
270 \begin{tabular}{lcl} |
265 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
271 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
266 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
272 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
267 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
273 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
314 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so |
320 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so |
315 straightforwardly realised within the classic automata approach. |
321 straightforwardly realised within the classic automata approach. |
316 For the moment however, we focus only on the usual basic regular expressions. |
322 For the moment however, we focus only on the usual basic regular expressions. |
317 |
323 |
318 |
324 |
319 Now if we want to find out whether a string $s$ |
325 Now if we want to find out whether a string $s$ matches with a regular |
320 matches with a regular expression $r$, build the derivatives of $r$ |
326 expression $r$, build the derivatives of $r$ w.r.t.\ (in succession) |
321 w.r.t.\ (in succession) all the characters of the string $s$. Finally, |
327 all the characters of the string $s$. Finally, test whether the |
322 test whether the resulting regular expression can match the empty |
328 resulting regular expression can match the empty string. If yes, then |
323 string. If yes, then $r$ matches $s$, and no in the negative |
329 $r$ matches $s$, and no in the negative case. To implement this idea |
324 case. |
330 we can generalise the derivative operation to strings like this: |
325 |
|
326 For this we can generalise the derivative operation for strings like this: |
|
327 \begin{center} |
331 \begin{center} |
328 \begin{tabular}{lcl} |
332 \begin{tabular}{lcl} |
329 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
333 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
330 $r \backslash \epsilon $ & $\dn$ & $r$ |
334 $r \backslash [\,] $ & $\dn$ & $r$ |
331 \end{tabular} |
335 \end{tabular} |
332 \end{center} |
336 \end{center} |
333 \noindent |
337 |
334 Using the above definition we obtain a simple and elegant regular |
338 \noindent |
|
339 Using this definition we obtain a simple and elegant regular |
335 expression matching algorithm: |
340 expression matching algorithm: |
336 \[ |
341 \[ |
337 match\;s\;r \;\dn\; nullable(r\backslash s) |
342 match\;s\;r \;\dn\; nullable(r\backslash s) |
338 \] |
343 \] |
339 This algorithm can be illustrated as follows: |
344 |
|
345 \noindent |
|
346 Pictorially this algorithm can be illustrated as follows: |
|
347 |
|
348 \begin{center} |
340 \begin{tikzcd}\label{graph:*} |
349 \begin{tikzcd}\label{graph:*} |
341 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"nullable?"] & Yes/No |
350 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"nullable?"] & Yes/No |
342 \end{tikzcd} |
351 \end{tikzcd} |
343 where we bnuild the successive derivative until we exhaust the string. |
352 \end{center} |
|
353 |
|
354 \noindent |
|
355 where we start with a regular expression $r_0$, build successive derivatives |
|
356 until we exhaust the string and then use \textit{nullable} to test whether the |
|
357 result can match the empty string. It can be relatively easily shown that this |
|
358 matcher is correct. |
344 |
359 |
345 One limitation, however, of Brzozowski's algorithm is that it only |
360 One limitation, however, of Brzozowski's algorithm is that it only |
346 produces a YES/NO answer for whether a string is being matched by a |
361 produces a YES/NO answer for whether a string is being matched by a |
347 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
362 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
348 algorithm to allow generation of an actual matching, called a |
363 algorithm to allow generation of an actual matching, called a |
349 \emph{value}. |
364 \emph{value}. |
350 |
365 |
351 \begin{center} |
366 \begin{center} |
352 \begin{tabular}{c@{\hspace{20mm}}c} |
367 \begin{tabular}{c@{\hspace{20mm}}c} |
353 \begin{tabular}{@{}rrl@{}} |
368 \begin{tabular}{@{}rrl@{}} |
354 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
369 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |