518 $r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$. |
518 $r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$. |
519 This leaves a "hole" on $r_i$. In order to calculate a value for |
519 This leaves a "hole" on $r_i$. In order to calculate a value for |
520 $r_{i-1}$, we need to locate where that hole is. We can find this location |
520 $r_{i-1}$, we need to locate where that hole is. We can find this location |
521 informtaion by comparing $r_{i-1}$ and $v_i$. |
521 informtaion by comparing $r_{i-1}$ and $v_i$. |
522 For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$ |
522 For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$ |
523 is of shape $\textit{Left(Seq(v_a, v_b))}$, we know immediately that |
523 is of shape $\textit{Left}(\textit{Seq}(v_a, v_b))$, we know immediately that |
524 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\], |
524 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\], |
525 otherwise if $r_a$ is not nullable, |
525 otherwise if $r_a$ is not nullable, |
526 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\], |
526 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\], |
527 the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that |
527 the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that |
528 $v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape |
528 $v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape |
529 $Left(\ldots)$ instead of $Right(\ldots)$, we know that the left |
529 $Left(\ldots)$ instead of $Right(\ldots)$, we know that the left |
530 branch is taken instead of the right one. We have therefore found out |
530 branch is taken instead of the right one. We have therefore found out |
531 that the hole will be on $r_a$. So we recursively call $\textit{inj} |
531 that the hole will be on $r_a$. So we recursively call $\textit{inj} |
532 r_a c v_a$ to fill that hole in $v_a$. After injection, the value |
532 r_a c v_a$ to fill that hole in $v_a$. After injection, the value |
533 $v_i$ for $r_i = r_a \cdot \rb$ should be $\textit{Seq}(\textit{inj} |
533 $v_i$ for $r_i = r_a \cdot r_b$ should be $\textit{Seq}(\textit{inj} |
534 r_a c v_a, v_b)$. |
534 r_a c v_a, v_b)$. |
|
535 Other clauses can be understood in a similar way. |
535 |
536 |
536 To understand what is going on, it might be best to do some |
537 To understand what is going on, it might be best to do some |
537 example calculations and compare them with Figure~\eqref{graph:2}. |
538 example calculations and compare them with Figure~\eqref{graph:2}. |
538 |
539 % A correctness proof using induction can be routinely established. |
539 |
540 Suppose we have a regular expression $((((a+b)+ab)+c)+abc)^*$ and want to |
540 |
|
541 A correctness proof using induction can be routinely established. |
|
542 |
|
543 It is instructive to see how this algorithm works by a little example. |
|
544 Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to |
|
545 match it against the string $abc$(when $abc$ is written as a regular |
541 match it against the string $abc$(when $abc$ is written as a regular |
546 expression, the most standard way of expressing it should be $a \cdot (b |
542 expression, the most standard way of expressing it should be $a \cdot (b |
547 \cdot c)$. We omit the parenthesis and dots here for readability). By |
543 \cdot c)$. We omit the parentheses and dots here for readability). By |
548 POSIX rules the lexer should go for the longest matching, i.e. it should |
544 POSIX rules the lexer should go for the longest matching, i.e. it should |
549 match the string $abc$ in one star iteration, using the longest string |
545 match the string $abc$ in one star iteration, using the longest string |
550 $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this |
546 $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this |
551 sub-expression for conciseness). Here is how the lexer achieves a parse |
547 sub-expression for conciseness). Here is how the lexer achieves a parse |
552 tree for this matching. First, we build successive derivatives until we |
548 tree for this matching. First, build successive derivatives until we |
553 exhaust the string, as illustrated here( we simplified some regular |
549 exhaust the string, as illustrated here(we simplified some regular |
554 expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow |
550 expressions like $0 \cdot b$ to $0$ for conciseness. We also omit |
555 $\textit{ALT}$ to take a list of regular expressions as an argument |
551 some parentheses if they are clear from the context.): |
556 instead of just 2 operands to reduce the nested depth of |
552 %Similarly, we allow |
557 $\textit{ALT}$): |
553 %$\textit{ALT}$ to take a list of regular expressions as an argument |
558 |
554 %instead of just 2 operands to reduce the nested depth of |
559 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\] |
555 %$\textit{ALT}$ |
560 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] |
556 \begin{center} |
561 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* ) |
557 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^* \xrightarrow{\backslash b}\] |
|
558 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r^* \xrightarrow{\backslash c}\] |
|
559 \[r_3 = (( 0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r^*) +((0+1+0 + 0 + 0) \cdot r^*+(0+0+0 + 1 + 0) \cdot r^* ) |
562 \] |
560 \] |
563 |
561 \end{center} |
564 Now instead of using $nullable$ to give a $yes$, we call $mkeps$ to construct a parse tree for how $r_3$ matched the string $abc$. $mkeps$ gives the following value $v_3$: \\$Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\ |
562 Now instead when $nullable$ gives a $yes$ on $r_3$, we call $mkeps$ |
565 This corresponds to the leftmost term |
563 to construct a parse tree for how $r_3$ matched the string $abc$. |
566 $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* $\\ |
564 $mkeps$ gives the following value $v_3$: |
567 in $r_3$. Note that its leftmost location allows $mkeps$ to choose it as the first candidate that meets the requirement of being $nullable$. This location is naturally generated by the splitting clause\\ $(r_1 \cdot r_2)\backslash c (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c. \\$ By this clause, we put |
565 \[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\] |
568 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This allows $mkeps$ to always pick up among two matches the one with a longer prefix. The value \\ |
566 The outer 2 $Left$ tells us the first nullable part hides in the term |
569 $Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\ |
567 \[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \] |
570 tells us how about the empty string matches the final regular expression after doing all the derivatives: among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* )$, \\ |
568 in \[r_3 = ( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0 + 1 + 0) |
|
569 \cdot r^*) +((0+1+0 + 0 + 0) \cdot r^*+(0+0+0 + 1 + 0) \cdot r^* ) |
|
570 \](underlined). Note that the leftmost location of term \[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \] |
|
571 (which corresponds to the intial sub-match $abc$) |
|
572 allows $mkeps$ to choose it as the first candidate that meets the requirement of being $nullable$ |
|
573 because $mkeps$ is defined to always choose the left one when nullable. |
|
574 In the case of this example, $abc$ is preferred over $a$ or $ab$. |
|
575 This location is naturally generated by the splitting clause\[ |
|
576 (r_1 \cdot r_2)\backslash c (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c\]. |
|
577 By this clause, we put |
|
578 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. |
|
579 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match. |
|
580 The inside subvalue |
|
581 \[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\] |
|
582 tells us how about the empty string matches : among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* )$, \\ |
571 we choose the left most nullable one, which is composed of a sequence of an alternative and a star. In that alternative $0+0+0 + 0 + 1 \cdot 1 \cdot 1$ we take the rightmost choice. |
583 we choose the left most nullable one, which is composed of a sequence of an alternative and a star. In that alternative $0+0+0 + 0 + 1 \cdot 1 \cdot 1$ we take the rightmost choice. |
572 |
584 |
573 Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, \\ |
585 Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, \\ |
574 which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$. |
586 which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$. |
575 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. |
587 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. |