68 algorithm to not just give a YES/NO answer for whether or not a regular |
68 algorithm to not just give a YES/NO answer for whether or not a regular |
69 expression matches a string, but in case it matches also \emph{how} |
69 expression matches a string, but in case it matches also \emph{how} |
70 it matches the string. This is important for applications such as |
70 it matches the string. This is important for applications such as |
71 lexing (tokenising a string). The problem is to make the algorithm |
71 lexing (tokenising a string). The problem is to make the algorithm |
72 by Sulzmann and Lu fast on all inputs without breaking its |
72 by Sulzmann and Lu fast on all inputs without breaking its |
73 correctness. |
73 correctness. We have already developed some simplification rules, but have not shown that they |
|
74 preserve the correctness. We also have not yet looked at extended regular expressions. |
74 \end{abstract} |
75 \end{abstract} |
75 |
76 |
76 \section{Introduction} |
77 \section{Introduction} |
77 |
78 |
78 This PhD-project is about regular expression matching and |
79 This PhD-project is about regular expression matching and |
239 regular expressions \cite{Brzozowski1964}. We shall briefly explain |
240 regular expressions \cite{Brzozowski1964}. We shall briefly explain |
240 the algorithms next. |
241 the algorithms next. |
241 |
242 |
242 \section{The Algorithms by Brzozowski, and Sulzmann and Lu} |
243 \section{The Algorithms by Brzozowski, and Sulzmann and Lu} |
243 |
244 |
244 Suppose regular expressions are given by the following grammar:\\ |
245 Suppose basic regular expressions are given by the following grammar:\\ |
245 $r$ $::=$ $\ZERO$ $\mid$ $\ONE$ |
246 \[ r ::= \ZERO \mid \ONE |
246 $\mid$ $c$ |
247 \mid c |
247 $\mid$ $r_1 \cdot r_2$ |
248 \mid r_1 \cdot r_2 |
248 $\mid$ $r_1 + r_2$ |
249 \mid r_1 + r_2 |
249 $\mid$ $r^*$ |
250 \mid r^* |
250 |
251 \] |
251 |
|
252 |
252 |
253 \noindent |
253 \noindent |
254 The intended meaning of the regular expressions is as usual: $\ZERO$ |
254 The intended meaning of the regular expressions is as usual: $\ZERO$ |
255 cannot match any string, $\ONE$ can match the empty string, the |
255 cannot match any string, $\ONE$ can match the empty string, the |
256 character regular expression $c$ can match the character $c$, and so |
256 character regular expression $c$ can match the character $c$, and so |
268 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
268 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
269 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
269 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
270 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
270 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
271 \end{tabular} |
271 \end{tabular} |
272 \end{center} |
272 \end{center} |
273 The function tests whether the empty string is in $L(r)$. |
273 This function simply tests whether the empty string is in $L(r)$. |
274 He then defined |
274 He then defined |
275 the following operation on regular expressions, written |
275 the following operation on regular expressions, written |
276 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$): |
276 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$): |
277 |
277 |
278 \begin{center} |
278 \begin{center} |
289 \end{tabular} |
289 \end{tabular} |
290 \end{center} |
290 \end{center} |
291 |
291 |
292 \noindent |
292 \noindent |
293 |
293 |
|
294 |
|
295 |
|
296 %Assuming the classic notion of a |
|
297 %\emph{language} of a regular expression, written $L(\_)$, t |
|
298 The main |
|
299 property of the derivative operation is that |
|
300 |
|
301 \begin{center} |
|
302 $c\!::\!s \in L(r)$ holds |
|
303 if and only if $s \in L(r\backslash c)$. |
|
304 \end{center} |
|
305 |
|
306 \noindent |
294 For us the main advantage is that derivatives can be |
307 For us the main advantage is that derivatives can be |
295 straightforwardly implemented in any functional programming language, |
308 straightforwardly implemented in any functional programming language, |
296 and are easily definable and reasoned about in theorem provers---the |
309 and are easily definable and reasoned about in theorem provers---the |
297 definitions just consist of inductive datatypes and simple recursive |
310 definitions just consist of inductive datatypes and simple recursive |
298 functions. Moreover, the notion of derivatives can be easily |
311 functions. Moreover, the notion of derivatives can be easily |
299 generalised to cover extended regular expression constructors such as |
312 generalised to cover extended regular expression constructors such as |
300 the not-regular expression, written $\neg\,r$, or bounded repetitions |
313 the not-regular expression, written $\neg\,r$, or bounded repetitions |
301 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so |
314 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so |
302 straightforwardly realised within the classic automata approach. |
315 straightforwardly realised within the classic automata approach. |
303 |
316 For the moment however, we focus only on the usual basic regular expressions. |
304 |
317 |
305 |
318 |
306 %Assuming the classic notion of a |
319 Now if we want to find out whether a string $s$ |
307 %\emph{language} of a regular expression, written $L(\_)$, t |
|
308 The main |
|
309 property of the derivative operation is that |
|
310 |
|
311 \begin{center} |
|
312 $c\!::\!s \in L(r)$ holds |
|
313 if and only if $s \in L(r\backslash c)$. |
|
314 \end{center} |
|
315 |
|
316 \noindent |
|
317 So if we want to find out whether a string $s$ |
|
318 matches with a regular expression $r$, build the derivatives of $r$ |
320 matches with a regular expression $r$, build the derivatives of $r$ |
319 w.r.t.\ (in succession) all the characters of the string $s$. Finally, |
321 w.r.t.\ (in succession) all the characters of the string $s$. Finally, |
320 test whether the resulting regular expression can match the empty |
322 test whether the resulting regular expression can match the empty |
321 string. If yes, then $r$ matches $s$, and no in the negative |
323 string. If yes, then $r$ matches $s$, and no in the negative |
322 case. |
324 case. |
323 |
325 |
324 We can generalise the derivative operation for strings like this: |
326 For this we can generalise the derivative operation for strings like this: |
325 \begin{center} |
327 \begin{center} |
326 \begin{tabular}{lcl} |
328 \begin{tabular}{lcl} |
327 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
329 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
328 $r \backslash \epsilon $ & $\dn$ & $r$ |
330 $r \backslash \epsilon $ & $\dn$ & $r$ |
329 \end{tabular} |
331 \end{tabular} |
332 Using the above definition we obtain a simple and elegant regular |
334 Using the above definition we obtain a simple and elegant regular |
333 expression matching algorithm: |
335 expression matching algorithm: |
334 \[ |
336 \[ |
335 match\;s\;r \;\dn\; nullable(r\backslash s) |
337 match\;s\;r \;\dn\; nullable(r\backslash s) |
336 \] |
338 \] |
|
339 This algorithm can be illustrated as follows: |
|
340 \begin{tikzcd}\label{graph:*} |
|
341 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n |
|
342 \end{tikzcd} |
|
343 |
|
344 |
337 One limitation, however, of Brzozowski's algorithm is that it only |
345 One limitation, however, of Brzozowski's algorithm is that it only |
338 produces a YES/NO answer for whether a string is being matched by a |
346 produces a YES/NO answer for whether a string is being matched by a |
339 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
347 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
340 algorithm to allow generation of an actual matching, called a |
348 algorithm to allow generation of an actual matching, called a |
341 \emph{value}. |
349 \emph{value}. |
400 expression. |
408 expression. |
401 |
409 |
402 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
410 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
403 algorithm by a second phase (the first phase being building successive |
411 algorithm by a second phase (the first phase being building successive |
404 derivatives). In this second phase, for every successful match the |
412 derivatives). In this second phase, for every successful match the |
405 corresponding POSIX value is computed. The whole process looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is provided on the right for the purpose of comparison):\\ |
413 corresponding POSIX value is computed. The whole process looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is given before \ref{graph:*}):\\ |
406 \begin{tikzcd} |
414 \begin{tikzcd} |
407 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
415 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
408 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
416 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
409 \end{tikzcd} |
417 \end{tikzcd} |
410 \begin{tikzcd} |
418 \begin{tikzcd} |