cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:747c8cf666ca
+:8ea861a19d70
 A correctness proof using induction can be routinely established.
 It is instructive to see how this algorithm works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to match it against the string $abc$(when $abc$ is written
 as a regular expression, the most standard way of expressing it should be $a \cdot (b \cdot c)$. We omit the parenthesis and dots here for readability).
 By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching.
-First, we build successive derivatives until we exhaust the string, as illustrated here( we simplified some regular expressions like $0 \cdot b$ to $0$ for conciseness):
+First, we build successive derivatives until we exhaust the string, as illustrated here( we simplified some regular expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow $\textit{ALT}$ to take a list of regular expressions as an argument instead of just 2 operands to reduce the nested depth of $\textit{ALT}$):
 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]
 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r* \xrightarrow{\backslash c}\]
 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )
 \]
 $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* $\\
 in $r_3$. Note that its leftmost location allows $mkeps$ to choose  it as the first candidate that meets the requirement of being $nullable$. This location is naturally generated by the splitting clause\\ $(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.  \\$ By this clause, we put
 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This allows $mkeps$ to always pick up among two matches the one with a longer prefix. The value \\
 $Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\
 tells us how about the empty string matches the final regular expression after doing all the derivatives: among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )$, \\
-we choose the left most nullable one, which is composed of a sequence of an alternative and a star. In that alternative we take the rightmost choice.
+we choose the left most nullable one, which is composed of a sequence of an alternative and a star. In that alternative $0+0+0 + 0 + 1 \cdot 1 \cdot 1$ we take the rightmost choice.
-Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched  the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get  the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$.
+Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, \\
+which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched  the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get  the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$.
 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}.
 Readers might have noticed that the parse tree information as actually already available when doing derivatives. For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with a, we can either take the initial match to be
 \begin{enumerate}
 \item[1)] just $a$ or
 \item[2)] string $ab$ or

changeset 57	8ea861a19d70
parent 56	747c8cf666ca
child 58	f0360e17080e