diff -r 8ff7b7508824 -r c737a0259194 ninems/ninems.tex --- a/ninems/ninems.tex Sat Jul 06 19:48:20 2019 +0100 +++ b/ninems/ninems.tex Sat Jul 06 20:16:20 2019 +0100 @@ -489,8 +489,8 @@ reverses the "chopping off" for values which we did in the first phase for derivatives. The corresponding function is called $\textit{inj}$ (short for injection). This function takes three -arguments: the first one is a regular expression $\textit{r_{i-1}}$ -before the character is chopped off, the second is $\textit{c_{i-1}}$, +arguments: the first one is a regular expression ${r_{i-1}}$, +before the character is chopped off, the second is ${c_{i-1}}$, the character we want to inject and the third argument is the value $textit{v_i}$, where we @@ -520,7 +520,7 @@ $r_{i-1}$, we need to locate where that hole is. We can find this location informtaion by comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$ -is of shape $\textit{Left(Seq(v_a, v_b))}$, we know immediately that +is of shape $\textit{Left}(\textit{Seq}(v_a, v_b))$, we know immediately that \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\], otherwise if $r_a$ is not nullable, \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\], @@ -530,44 +530,56 @@ branch is taken instead of the right one. We have therefore found out that the hole will be on $r_a$. So we recursively call $\textit{inj} r_a c v_a$ to fill that hole in $v_a$. After injection, the value -$v_i$ for $r_i = r_a \cdot \rb$ should be $\textit{Seq}(\textit{inj} +$v_i$ for $r_i = r_a \cdot r_b$ should be $\textit{Seq}(\textit{inj} r_a c v_a, v_b)$. +Other clauses can be understood in a similar way. To understand what is going on, it might be best to do some example calculations and compare them with Figure~\eqref{graph:2}. - - - - A correctness proof using induction can be routinely established. - -It is instructive to see how this algorithm works by a little example. -Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to +% A correctness proof using induction can be routinely established. +Suppose we have a regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against the string $abc$(when $abc$ is written as a regular expression, the most standard way of expressing it should be $a \cdot (b -\cdot c)$. We omit the parenthesis and dots here for readability). By +\cdot c)$. We omit the parentheses and dots here for readability). By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse -tree for this matching. First, we build successive derivatives until we -exhaust the string, as illustrated here( we simplified some regular -expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow -$\textit{ALT}$ to take a list of regular expressions as an argument -instead of just 2 operands to reduce the nested depth of -$\textit{ALT}$): - -\[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\] -\[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] -\[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* ) +tree for this matching. First, build successive derivatives until we +exhaust the string, as illustrated here(we simplified some regular +expressions like $0 \cdot b$ to $0$ for conciseness. We also omit +some parentheses if they are clear from the context.): +%Similarly, we allow +%$\textit{ALT}$ to take a list of regular expressions as an argument +%instead of just 2 operands to reduce the nested depth of +%$\textit{ALT}$ +\begin{center} +\[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^* \xrightarrow{\backslash b}\] +\[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r^* \xrightarrow{\backslash c}\] +\[r_3 = (( 0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r^*) +((0+1+0 + 0 + 0) \cdot r^*+(0+0+0 + 1 + 0) \cdot r^* ) \] - -Now instead of using $nullable$ to give a $yes$, we call $mkeps$ to construct a parse tree for how $r_3$ matched the string $abc$. $mkeps$ gives the following value $v_3$: \\$Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\ -This corresponds to the leftmost term -$((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* $\\ - in $r_3$. Note that its leftmost location allows $mkeps$ to choose it as the first candidate that meets the requirement of being $nullable$. This location is naturally generated by the splitting clause\\ $(r_1 \cdot r_2)\backslash c (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c. \\$ By this clause, we put -$r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This allows $mkeps$ to always pick up among two matches the one with a longer prefix. The value \\ -$Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\ -tells us how about the empty string matches the final regular expression after doing all the derivatives: among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* )$, \\ +\end{center} +Now instead when $nullable$ gives a $yes$ on $r_3$, we call $mkeps$ +to construct a parse tree for how $r_3$ matched the string $abc$. +$mkeps$ gives the following value $v_3$: +\[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\] +The outer 2 $Left$ tells us the first nullable part hides in the term +\[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \] + in \[r_3 = ( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0 + 1 + 0) + \cdot r^*) +((0+1+0 + 0 + 0) \cdot r^*+(0+0+0 + 1 + 0) \cdot r^* ) + \](underlined). Note that the leftmost location of term \[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \] + (which corresponds to the intial sub-match $abc$) + allows $mkeps$ to choose it as the first candidate that meets the requirement of being $nullable$ + because $mkeps$ is defined to always choose the left one when nullable. + In the case of this example, $abc$ is preferred over $a$ or $ab$. + This location is naturally generated by the splitting clause\[ + (r_1 \cdot r_2)\backslash c (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c\]. + By this clause, we put +$r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. +This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match. + The inside subvalue + \[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\] +tells us how about the empty string matches : among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* )$, \\ we choose the left most nullable one, which is composed of a sequence of an alternative and a star. In that alternative $0+0+0 + 0 + 1 \cdot 1 \cdot 1$ we take the rightmost choice. Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, \\