cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:8ff7b7508824
+:c737a0259194
 We need a function that
 reverses the "chopping off" for values which we did in the
 first phase for derivatives. The corresponding function is
 called $\textit{inj}$ (short for injection). This function takes three
-arguments: the first one is a regular expression $\textit{r_{i-1}}$
+arguments: the first one is a regular expression ${r_{i-1}}$,
-before the character is chopped off, the second is  $\textit{c_{i-1}}$,
+before the character is chopped off, the second is  ${c_{i-1}}$,
 the character
 we
 want to inject and the third argument is the value $textit{v_i}$, where we
 will inject the character(it corresponds to the regular expression
 after the character
 $r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$.
 This leaves a "hole" on $r_i$. In order to calculate a value for
 $r_{i-1}$, we need to locate where that hole is. We can find this location
 informtaion by comparing $r_{i-1}$ and $v_i$.
 For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$
-is of shape $\textit{Left(Seq(v_a, v_b))}$, we know immediately that
+is of shape $\textit{Left}(\textit{Seq}(v_a, v_b))$, we know immediately that
 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\],
 otherwise if $r_a$ is not nullable,
 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\],
 the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that
 $v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape
 $Left(\ldots)$ instead of $Right(\ldots)$, we know that the left
 branch is taken instead of the right one. We have therefore found out
 that the hole will be on $r_a$. So we recursively call $\textit{inj}
 r_a c v_a$ to fill that hole in $v_a$. After injection, the value
-$v_i$ for $r_i = r_a \cdot \rb$ should be $\textit{Seq}(\textit{inj}
+$v_i$ for $r_i = r_a \cdot r_b$ should be $\textit{Seq}(\textit{inj}
 r_a c v_a, v_b)$.
+Other clauses can be understood in a similar way.
 To understand what is going on, it might be best to do some
 example calculations and compare them with Figure~\eqref{graph:2}.
+% A correctness proof using induction can be routinely established.
+Suppose we have a regular expression $((((a+b)+ab)+c)+abc)^*$ and want to
-A correctness proof using induction can be routinely established.
-It is instructive to see how this algorithm works by a little example.
-Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to
 match it against the string $abc$(when $abc$ is written as a regular
 expression, the most standard way of expressing it should be $a \cdot (b
-\cdot c)$. We omit the parenthesis and dots here for readability). By
+\cdot c)$. We omit the parentheses and dots here for readability). By
 POSIX rules the lexer should go for the longest matching, i.e. it should
 match the string $abc$ in one star iteration, using the longest string
 $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this
 sub-expression for conciseness). Here is how the lexer achieves a parse
-tree for this matching. First, we build successive derivatives until we
+tree for this matching. First, build successive derivatives until we
-exhaust the string, as illustrated here( we simplified some regular
+exhaust the string, as illustrated here(we simplified some regular
-expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow
+expressions like $0 \cdot b$ to $0$ for conciseness. We also omit
-$\textit{ALT}$ to take a list of regular expressions as an argument
+some parentheses if they are clear from the context.):
-instead of just 2 operands to reduce the nested depth of
+%Similarly, we allow
-$\textit{ALT}$):
+%$\textit{ALT}$ to take a list of regular expressions as an argument
+%instead of just 2 operands to reduce the nested depth of
-\[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]
+%$\textit{ALT}$
-\[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r* \xrightarrow{\backslash c}\]
+\begin{center}
-\[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )
+\[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^* \xrightarrow{\backslash b}\]
+\[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r^* \xrightarrow{\backslash c}\]
+\[r_3 = (( 0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* )
 \]
+\end{center}
-Now instead of using $nullable$ to give a $yes$, we  call $mkeps$ to construct a parse tree for how $r_3$ matched the string $abc$. $mkeps$ gives the following value $v_3$: \\$Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\
+Now instead when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$
-This corresponds to the leftmost term
+to construct a parse tree for how $r_3$ matched the string $abc$.
-$((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* $\\
+$mkeps$ gives the following value $v_3$:
-in $r_3$. Note that its leftmost location allows $mkeps$ to choose  it as the first candidate that meets the requirement of being $nullable$. This location is naturally generated by the splitting clause\\ $(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.  \\$ By this clause, we put
+\[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\]
-$r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This allows $mkeps$ to always pick up among two matches the one with a longer prefix. The value \\
+The outer 2 $Left$ tells us the first nullable part  hides in the term
-$Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\
+\[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \]
-tells us how about the empty string matches the final regular expression after doing all the derivatives: among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )$, \\
+in \[r_3 = ( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0  + 1 + 0)
+\cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* )
+\](underlined). Note that the leftmost location of term \[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \]
+(which corresponds to the intial sub-match $abc$)
+allows $mkeps$ to choose  it as the first candidate that meets the requirement of being $nullable$
+because $mkeps$ is defined to always choose the left one when nullable.
+In the case of this example, $abc$ is preferred over $a$ or $ab$.
+This location is naturally generated by the splitting clause\[
+(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c\].
+By this clause, we put
+$r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$.
+This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match.
+The inside subvalue
+\[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\]
+tells us how about the empty string matches : among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )$, \\
 we choose the left most nullable one, which is composed of a sequence of an alternative and a star. In that alternative $0+0+0 + 0 + 1 \cdot 1 \cdot 1$ we take the rightmost choice.
 Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, \\
 which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched  the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get  the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$.
 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}.

changeset 60	c737a0259194
parent 59	8ff7b7508824
child 61	580c7b84f900