ninems/ninems.tex
changeset 41 a1f90febbc7f
parent 40 8063792920ef
child 42 2a469388c989
equal deleted inserted replaced
40:8063792920ef 41:a1f90febbc7f
   434 
   434 
   435 
   435 
   436 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
   436 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
   437 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
   437 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
   438 
   438 
       
   439 $mkeps $ $1 \,[] $ $= Empty$
       
   440 ......
       
   441 
       
   442 
   439  After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
   443  After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
   440 An inductive proof can be routinely established.
   444 An inductive proof can be routinely established.
       
   445 
       
   446 It is instructive to see how it works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)*$ and we want to match it against the string $abc$. By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching.
       
   447 First, we build successive derivatives until we exhaust the string, as illustrated here( we omitted some parenthesis for better readability):
       
   448 
       
   449 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]
       
   450 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] 
       
   451 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )
       
   452 \]
       
   453 
       
   454 Now instead of using $nullable$ to give a $yes$, we  call $mkeps$ to construct a parse tree for how $r_3$ matched the string $abc$. $mkeps$ gives the following value $v_3$: \\$Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\
       
   455 This corresponds to the leftmost term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* $ in $r_3$. Note that its leftmost location allows $mkeps$ to choose  it as the first candidate that meets the requirement of being $nullable$. This location is naturally generated by the splitting clause\\ $(r_1 \cdot r_2)\backslash c  (when \, r_1 \, nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.  \\$. By this clause, we put
       
   456 $r_1 \backslash c \cdot r_2 $ at the front and $r_2 \backslash c$ at the back. This allows $mkeps$ to always pick up among two matches the one with a longer prefix. The value \\
       
   457 $Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\
       
   458 tells us how about the empty string matches the final regular expression after doing all the derivatives: among the regular expressions $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )$ we choose the left most nullable one, which is composed of a sequence of a nested alternative and a folded star that iterates 0 times. In that nested alternative we take the rightmost alternative.
       
   459 
       
   460 Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, which tells us how $r_2$ matched $c$.
   441 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. Rather, we shall focus next on the
   461 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. Rather, we shall focus next on the
   442 process of simplification of regular expressions, which is needed in
   462 process of simplification of regular expressions, which is needed in
   443 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   463 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   444 and Lu's algorithms.  This is where the PhD-project hopes to advance
   464 and Lu's algorithms.  This is where the PhD-project hopes to advance
   445 the state-of-the-art.
   465 the state-of-the-art.