ninems/ninems.tex
changeset 39 7d18745dd7c9
parent 38 b5363c0dcd99
child 40 8063792920ef
equal deleted inserted replaced
38:b5363c0dcd99 39:7d18745dd7c9
   336 \[
   336 \[
   337 match\;s\;r \;\dn\; nullable(r\backslash s)
   337 match\;s\;r \;\dn\; nullable(r\backslash s)
   338 \]
   338 \]
   339 This algorithm can be illustrated as follows:
   339 This algorithm can be illustrated as follows:
   340 \begin{tikzcd}\label{graph:*} 
   340 \begin{tikzcd}\label{graph:*} 
   341 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  
   341 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"nullable?"] & Yes/No
   342 \end{tikzcd}
   342 \end{tikzcd}
   343 
   343 where we bnuild the successive derivative until we exhaust the string.
   344 
   344 
   345 One limitation, however, of Brzozowski's algorithm is that it only
   345 One limitation, however, of Brzozowski's algorithm is that it only
   346 produces a YES/NO answer for whether a string is being matched by a
   346 produces a YES/NO answer for whether a string is being matched by a
   347 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   347 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   348 algorithm to allow generation of an actual matching, called a
   348 algorithm to allow generation of an actual matching, called a
   373 		\end{tabular}
   373 		\end{tabular}
   374 	\end{tabular}
   374 	\end{tabular}
   375 \end{center}
   375 \end{center}
   376 
   376 
   377 \noindent
   377 \noindent
   378  Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. 
   378 Values and regular expressions correspond to each other as illustrated by placing corresponding values next to the regular expressions.
   379  
   379 The idea of values is to express parse trees. 
   380  The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition.
   380 Suppose by using a flatten operation, written $|v|$, we can extract the underlying string of v.
   381  Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. 
   381 For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition.
       
   382 Using this flatten notation, we now elaborate how values can express parse trees. $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and respectively $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. 
   382 
   383 
   383  To give a concrete example of how value works, consider the string $xy$ and the
   384  To give a concrete example of how value works, consider the string $xy$ and the
   384 regular expression $(x + (y + xy))^*$. We can view this regular
   385 regular expression $(x + (y + xy))^*$. We can view this regular
   385 expression as a tree and if the string $xy$ is matched by two Star
   386 expression as a tree and if the string $xy$ is matched by two Star
   386 ``iterations'', then the $x$ is matched by the left-most alternative
   387 ``iterations'', then the $x$ is matched by the left-most alternative
   408 expression.
   409 expression.
   409 
   410 
   410 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   411 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   411 algorithm by a second phase (the first phase being building successive
   412 algorithm by a second phase (the first phase being building successive
   412 derivatives). In this second phase, for every successful match the
   413 derivatives). In this second phase, for every successful match the
   413 corresponding POSIX value is computed. The whole process looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is given before \ref{graph:*}):\\
   414 corresponding POSIX value is computed. Pictorially, the Sulzmann and Lu algorithm looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is given before \ref{graph:*}):\\
   414 \begin{tikzcd}
   415 \begin{tikzcd}
   415 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   416 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   416 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   417 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   417 \end{tikzcd}
       
   418 \begin{tikzcd}
       
   419 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n   
       
   420 \end{tikzcd}
   418 \end{tikzcd}
   421 
   419 
   422 
   420 
   423 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
   421 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
   424 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
   422 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string: