ninems/ninems.tex
changeset 55 488d35c20949
parent 54 4bfd28722884
child 56 747c8cf666ca
equal deleted inserted replaced
54:4bfd28722884 55:488d35c20949
   476 		\end{tabular}
   476 		\end{tabular}
   477 	\end{center}
   477 	\end{center}
   478 
   478 
   479  After this, we inject back the characters one by one in order to build
   479  After this, we inject back the characters one by one in order to build
   480 the parse tree $v_i$ for how the regex $r_i$ matches the string
   480 the parse tree $v_i$ for how the regex $r_i$ matches the string
   481 $s_i$ ($s_i$ means the string s with the first $i$ characters being
   481 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
   482 chopped off) from the previous parse tree. After $n$ transformations, we
   482 get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
   483 get the parse tree for how $r_0$ matches $s$, exactly as we wanted. An
   483  A correctness proof using induction can be routinely established.
   484 inductive proof can be routinely established.
   484 
   485 
   485 It is instructive to see how this algorithm works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to match it against the string $abc$(when $abc$ is written
   486 It is instructive to see how it works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)*$ and we want to match it against the string $abc$. By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching.
   486 as a regular expression, the most standard way of expressing it should be $a \cdot (b \cdot c)$. We omit the parenthesis and dots here for readability). 
   487 First, we build successive derivatives until we exhaust the string, as illustrated here( we omitted some parenthesis for better readability):
   487 By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching.
       
   488 First, we build successive derivatives until we exhaust the string, as illustrated here( we simplified some regular expressions like $0 \cdot b$ to $0$ for conciseness):
   488 
   489 
   489 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]
   490 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]
   490 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] 
   491 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] 
   491 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )
   492 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )
   492 \]
   493 \]