336 \[ |
336 \[ |
337 match\;s\;r \;\dn\; nullable(r\backslash s) |
337 match\;s\;r \;\dn\; nullable(r\backslash s) |
338 \] |
338 \] |
339 This algorithm can be illustrated as follows: |
339 This algorithm can be illustrated as follows: |
340 \begin{tikzcd}\label{graph:*} |
340 \begin{tikzcd}\label{graph:*} |
341 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n |
341 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"nullable?"] & Yes/No |
342 \end{tikzcd} |
342 \end{tikzcd} |
343 |
343 where we bnuild the successive derivative until we exhaust the string. |
344 |
344 |
345 One limitation, however, of Brzozowski's algorithm is that it only |
345 One limitation, however, of Brzozowski's algorithm is that it only |
346 produces a YES/NO answer for whether a string is being matched by a |
346 produces a YES/NO answer for whether a string is being matched by a |
347 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
347 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
348 algorithm to allow generation of an actual matching, called a |
348 algorithm to allow generation of an actual matching, called a |
373 \end{tabular} |
373 \end{tabular} |
374 \end{tabular} |
374 \end{tabular} |
375 \end{center} |
375 \end{center} |
376 |
376 |
377 \noindent |
377 \noindent |
378 Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. |
378 Values and regular expressions correspond to each other as illustrated by placing corresponding values next to the regular expressions. |
379 |
379 The idea of values is to express parse trees. |
380 The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition. |
380 Suppose by using a flatten operation, written $|v|$, we can extract the underlying string of v. |
381 Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. |
381 For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition. |
|
382 Using this flatten notation, we now elaborate how values can express parse trees. $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and respectively $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. |
382 |
383 |
383 To give a concrete example of how value works, consider the string $xy$ and the |
384 To give a concrete example of how value works, consider the string $xy$ and the |
384 regular expression $(x + (y + xy))^*$. We can view this regular |
385 regular expression $(x + (y + xy))^*$. We can view this regular |
385 expression as a tree and if the string $xy$ is matched by two Star |
386 expression as a tree and if the string $xy$ is matched by two Star |
386 ``iterations'', then the $x$ is matched by the left-most alternative |
387 ``iterations'', then the $x$ is matched by the left-most alternative |
408 expression. |
409 expression. |
409 |
410 |
410 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
411 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
411 algorithm by a second phase (the first phase being building successive |
412 algorithm by a second phase (the first phase being building successive |
412 derivatives). In this second phase, for every successful match the |
413 derivatives). In this second phase, for every successful match the |
413 corresponding POSIX value is computed. The whole process looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is given before \ref{graph:*}):\\ |
414 corresponding POSIX value is computed. Pictorially, the Sulzmann and Lu algorithm looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is given before \ref{graph:*}):\\ |
414 \begin{tikzcd} |
415 \begin{tikzcd} |
415 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
416 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
416 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
417 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
417 \end{tikzcd} |
|
418 \begin{tikzcd} |
|
419 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n |
|
420 \end{tikzcd} |
418 \end{tikzcd} |
421 |
419 |
422 |
420 |
423 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$. |
421 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$. |
424 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ... until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string: |
422 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ... until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string: |