cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:c737a0259194
+:580c7b84f900
 By this clause, we put
 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$.
 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match.
 The inside subvalue
 \[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\]
-tells us how about the empty string matches : among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )$, \\
+tells us how about the empty string matches
-we choose the left most nullable one, which is composed of a sequence of an alternative and a star. In that alternative $0+0+0 + 0 + 1 \cdot 1 \cdot 1$ we take the rightmost choice.
+\[(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*  \],
+a sequence of 2 empty regular expressions, the first one is an alternative, we take the rightmost alternative
-Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, \\
+to get that empty regular expression. The second one is a star, which iterates 0 times to form the second
-which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched  the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get  the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$.
+empty regular expression.
-We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}.
-Readers might have noticed that the parse tree information as actually already available when doing derivatives. For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with a, we can either take the initial match to be
+Using the value $v_3$, the character c, and the regular expression $r_2$,
+we can recover how $r_2$ matched the string $[c]$ :
+$\textit{inj} r_2 c v_3$ gives us\[ v_2 = Left(Seq(Right(Seq(Empty, Seq(Empty, c))), Stars []))\],
+which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get
+\[ v_1 = Seq(Right(Seq(Empty, Seq(b, c))), Stars [])\]
+for how
+\[r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*\]
+matched  the string $bc$ before it split into 2 pieces.
+Finally, after injecting character $a$ back to $v_1$,
+we get  the parse tree $v_0= Stars [Right(Seq(a, Seq(b, c)))]$ for how $r$ matched $abc$.
+%We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}.
+Readers might have noticed that the parse tree information
+is actually already available when doing derivatives.
+For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$,
+we can either take the initial match to be
 \begin{enumerate}
 \item[1)] just $a$ or
 \item[2)] string $ab$ or
 \item[3)] string $abc$.
 \end{enumerate}
-In order to differentiate between these choices, we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. Which one of these alternatives is chosen later does not affect their relative position because our algorithm does not change this order. There is no need to traverse this information twice. This leads to a new approach of lexing-- if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
+In order to differentiate between these choices,
+we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right.
+Which one of these alternatives is chosen later does not affect their relative position because our algorithm
+does not change this order. If this parsing information can be determined and does
+not change because of later derivatives,
+there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
 In the next section, we shall focus on the bit-coded algorithm and the natural
 process of simplification of regular expressions using bit-codes, which is needed in
 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
 and Lu's algorithms.  This is where the PhD-project hopes to advance
 \section{Simplification of Regular Expressions}
 Using bit-codes to guide  parsing is not a new idea.
 It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by integrating bitcodes into derivatives.
 The argument for complicating the data structures from basic regular expressions to those with bitcodes
-is that we can introduce simplification without making the algorithm crash or impossible to reason about.
+is that we can introduce simplification without making the algorithm crash or overly complex to reason about.
-The reason why we need simplification is due to the shortcoming of a naive algorithm using Brzozowski's definition only.
+The reason why we need simplification is due to the shortcoming of the previous algorithm.
 The main drawback of building successive derivatives according to
 Brzozowski's definition is that they can grow very quickly in size.
 This is mainly due to the fact that the derivative operation generates
 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,
 if implemented naively both algorithms by Brzozowski and by Sulzmann

changeset 61	580c7b84f900
parent 60	c737a0259194
child 62	5b10d83b0834