ninems/ninems.tex
changeset 60 c737a0259194
parent 59 8ff7b7508824
child 61 580c7b84f900
equal deleted inserted replaced
59:8ff7b7508824 60:c737a0259194
   487 
   487 
   488 We need a function that
   488 We need a function that
   489 reverses the "chopping off" for values which we did in the
   489 reverses the "chopping off" for values which we did in the
   490 first phase for derivatives. The corresponding function is
   490 first phase for derivatives. The corresponding function is
   491 called $\textit{inj}$ (short for injection). This function takes three
   491 called $\textit{inj}$ (short for injection). This function takes three
   492 arguments: the first one is a regular expression $\textit{r_{i-1}}$ 
   492 arguments: the first one is a regular expression ${r_{i-1}}$, 
   493 before the character is chopped off, the second is  $\textit{c_{i-1}}$,
   493 before the character is chopped off, the second is  ${c_{i-1}}$,
   494 the character
   494 the character
   495 we
   495 we
   496 want to inject and the third argument is the value $textit{v_i}$, where we
   496 want to inject and the third argument is the value $textit{v_i}$, where we
   497 will inject the character(it corresponds to the regular expression
   497 will inject the character(it corresponds to the regular expression
   498  after the character
   498  after the character
   518 $r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$.
   518 $r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$.
   519 This leaves a "hole" on $r_i$. In order to calculate a value for
   519 This leaves a "hole" on $r_i$. In order to calculate a value for
   520 $r_{i-1}$, we need to locate where that hole is. We can find this location
   520 $r_{i-1}$, we need to locate where that hole is. We can find this location
   521 informtaion by comparing $r_{i-1}$ and $v_i$.
   521 informtaion by comparing $r_{i-1}$ and $v_i$.
   522 For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$
   522 For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$
   523 is of shape $\textit{Left(Seq(v_a, v_b))}$, we know immediately that 
   523 is of shape $\textit{Left}(\textit{Seq}(v_a, v_b))$, we know immediately that 
   524 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\],
   524 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\],
   525 otherwise if $r_a$ is not nullable,
   525 otherwise if $r_a$ is not nullable,
   526 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\],
   526 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\],
   527 the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that
   527 the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that
   528 $v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape
   528 $v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape
   529 $Left(\ldots)$ instead of $Right(\ldots)$, we know that the left
   529 $Left(\ldots)$ instead of $Right(\ldots)$, we know that the left
   530 branch is taken instead of the right one. We have therefore found out 
   530 branch is taken instead of the right one. We have therefore found out 
   531 that the hole will be on $r_a$. So we recursively call $\textit{inj} 
   531 that the hole will be on $r_a$. So we recursively call $\textit{inj} 
   532 r_a c v_a$ to fill that hole in $v_a$. After injection, the value 
   532 r_a c v_a$ to fill that hole in $v_a$. After injection, the value 
   533 $v_i$ for $r_i = r_a \cdot \rb$ should be $\textit{Seq}(\textit{inj}
   533 $v_i$ for $r_i = r_a \cdot r_b$ should be $\textit{Seq}(\textit{inj}
   534 r_a c v_a, v_b)$.
   534 r_a c v_a, v_b)$.
       
   535 Other clauses can be understood in a similar way.
   535 
   536 
   536 To understand what is going on, it might be best to do some
   537 To understand what is going on, it might be best to do some
   537 example calculations and compare them with Figure~\eqref{graph:2}.
   538 example calculations and compare them with Figure~\eqref{graph:2}.
   538 
   539 % A correctness proof using induction can be routinely established.
   539 
   540 Suppose we have a regular expression $((((a+b)+ab)+c)+abc)^*$ and want to
   540 
       
   541  A correctness proof using induction can be routinely established.
       
   542 
       
   543 It is instructive to see how this algorithm works by a little example.
       
   544 Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to
       
   545 match it against the string $abc$(when $abc$ is written as a regular
   541 match it against the string $abc$(when $abc$ is written as a regular
   546 expression, the most standard way of expressing it should be $a \cdot (b
   542 expression, the most standard way of expressing it should be $a \cdot (b
   547 \cdot c)$. We omit the parenthesis and dots here for readability). By
   543 \cdot c)$. We omit the parentheses and dots here for readability). By
   548 POSIX rules the lexer should go for the longest matching, i.e. it should
   544 POSIX rules the lexer should go for the longest matching, i.e. it should
   549 match the string $abc$ in one star iteration, using the longest string
   545 match the string $abc$ in one star iteration, using the longest string
   550 $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this
   546 $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this
   551 sub-expression for conciseness). Here is how the lexer achieves a parse
   547 sub-expression for conciseness). Here is how the lexer achieves a parse
   552 tree for this matching. First, we build successive derivatives until we
   548 tree for this matching. First, build successive derivatives until we
   553 exhaust the string, as illustrated here( we simplified some regular
   549 exhaust the string, as illustrated here(we simplified some regular
   554 expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow
   550 expressions like $0 \cdot b$ to $0$ for conciseness. We also omit 
   555 $\textit{ALT}$ to take a list of regular expressions as an argument
   551 some parentheses if they are clear from the context.):
   556 instead of just 2 operands to reduce the nested depth of
   552 %Similarly, we allow
   557 $\textit{ALT}$):
   553 %$\textit{ALT}$ to take a list of regular expressions as an argument
   558 
   554 %instead of just 2 operands to reduce the nested depth of
   559 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]
   555 %$\textit{ALT}$
   560 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] 
   556 \begin{center}
   561 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )
   557 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^* \xrightarrow{\backslash b}\]
       
   558 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r^* \xrightarrow{\backslash c}\] 
       
   559 \[r_3 = (( 0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* )
   562 \]
   560 \]
   563 
   561 \end{center}
   564 Now instead of using $nullable$ to give a $yes$, we  call $mkeps$ to construct a parse tree for how $r_3$ matched the string $abc$. $mkeps$ gives the following value $v_3$: \\$Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\
   562 Now instead when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$ 
   565 This corresponds to the leftmost term 
   563 to construct a parse tree for how $r_3$ matched the string $abc$. 
   566 $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* $\\
   564 $mkeps$ gives the following value $v_3$: 
   567  in $r_3$. Note that its leftmost location allows $mkeps$ to choose  it as the first candidate that meets the requirement of being $nullable$. This location is naturally generated by the splitting clause\\ $(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.  \\$ By this clause, we put
   565 \[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\]
   568 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This allows $mkeps$ to always pick up among two matches the one with a longer prefix. The value \\
   566 The outer 2 $Left$ tells us the first nullable part  hides in the term
   569 $Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\
   567 \[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \]
   570 tells us how about the empty string matches the final regular expression after doing all the derivatives: among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )$, \\
   568  in \[r_3 = ( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0  + 1 + 0)
       
   569   \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* )
       
   570  \](underlined). Note that the leftmost location of term \[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \]
       
   571  (which corresponds to the intial sub-match $abc$)
       
   572   allows $mkeps$ to choose  it as the first candidate that meets the requirement of being $nullable$
       
   573   because $mkeps$ is defined to always choose the left one when nullable.
       
   574   In the case of this example, $abc$ is preferred over $a$ or $ab$.
       
   575    This location is naturally generated by the splitting clause\[
       
   576      (r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c\].
       
   577        By this clause, we put
       
   578 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. 
       
   579 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match.
       
   580  The inside subvalue 
       
   581  \[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\]
       
   582 tells us how about the empty string matches : among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )$, \\
   571 we choose the left most nullable one, which is composed of a sequence of an alternative and a star. In that alternative $0+0+0 + 0 + 1 \cdot 1 \cdot 1$ we take the rightmost choice.
   583 we choose the left most nullable one, which is composed of a sequence of an alternative and a star. In that alternative $0+0+0 + 0 + 1 \cdot 1 \cdot 1$ we take the rightmost choice.
   572 
   584 
   573 Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, \\
   585 Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, \\
   574 which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched  the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get  the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$.
   586 which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched  the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get  the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$.
   575 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
   587 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}.