476 \end{tabular} |
476 \end{tabular} |
477 \end{center} |
477 \end{center} |
478 |
478 |
479 After this, we inject back the characters one by one in order to build |
479 After this, we inject back the characters one by one in order to build |
480 the parse tree $v_i$ for how the regex $r_i$ matches the string |
480 the parse tree $v_i$ for how the regex $r_i$ matches the string |
481 $s_i$ ($s_i$ means the string s with the first $i$ characters being |
481 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we |
482 chopped off) from the previous parse tree. After $n$ transformations, we |
482 get the parse tree for how $r_0$ matches $s$, exactly as we wanted. |
483 get the parse tree for how $r_0$ matches $s$, exactly as we wanted. An |
483 A correctness proof using induction can be routinely established. |
484 inductive proof can be routinely established. |
484 |
485 |
485 It is instructive to see how this algorithm works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to match it against the string $abc$(when $abc$ is written |
486 It is instructive to see how it works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)*$ and we want to match it against the string $abc$. By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching. |
486 as a regular expression, the most standard way of expressing it should be $a \cdot (b \cdot c)$. We omit the parenthesis and dots here for readability). |
487 First, we build successive derivatives until we exhaust the string, as illustrated here( we omitted some parenthesis for better readability): |
487 By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching. |
|
488 First, we build successive derivatives until we exhaust the string, as illustrated here( we simplified some regular expressions like $0 \cdot b$ to $0$ for conciseness): |
488 |
489 |
489 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\] |
490 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\] |
490 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] |
491 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] |
491 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* ) |
492 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r*) +((0+1+0 + 0 + 0) \cdot r*+(0+0+0 + 1 + 0) \cdot r* ) |
492 \] |
493 \] |