cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:afd0d702a4fe
+:df2e0faccb23
 that the hole will be on $r_a$. So we recursively call $\inj\,
 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value
 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
 Other clauses can be understood in a similar way.
-Let us do some further examples involving $\inj$: Suppose we have the
+The following example gives a taste of $\textit{inj}$'s effect
-regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against
+and how Sulzmann and Lu's algorithm works as a whole.
+Suppose we have a
+regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against
 the string $abc$ (when $abc$ is written as a regular expression, the most
 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
-the parentheses and dots here for readability). By POSIX rules the algorithm
+the parentheses and dots here for readability).
-should go for the longest matching, i.e. it should match the string
+This algorithm returns a POSIX value, which means it
-$abc$ in one star iteration, using the longest string $abc$ in the
+will go for the longest matching, i.e.~it should match the string
-sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression
+$abc$ in one star iteration, using the longest alternative $abc$ in the
+sub-expression $((((a+b)+ab)+c)+abc)$ (we use $r$ to denote this sub-expression
 for conciseness).
-Here is how the lexer achieves a parse tree for this
+Before $\textit{inj}$ comes into play,
-matching. First, build successive derivatives until we exhaust the
+our lexer first builds derivative using string $abc$ (we simplified some regular expressions like
-string, as illustrated here (we simplified some regular expressions like
+$0 \cdot b$ to $0$ for conciseness; we also omit parentheses if
-$0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if
 they are clear from the context):
 %Similarly, we allow
 %$\textit{ALT}$ to take a list of regular expressions as an argument
 %instead of just 2 operands to reduce the nested depth of
 %$\textit{ALT}$
 &                              & $\phantom{r_3 = (} ((0+1+0  + 0 + 0) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^* )$
 \end{tabular}
 \end{center}
 \noindent
-Now instead when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$
+Now when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$
 to construct a parse tree for how $r_3$ matched the string $abc$.
 $mkeps$ gives the following value $v_3$:
-\[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\]
+\begin{center}
-The outer 2 $Left$ tells us the first nullable part  hides in the term
+$\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
-\[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \]
+\end{center}
-in \[r_3 = ( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0  + 1 + 0)
+The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
-\cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* )
+\begin{center}
-\](underlined). Note that the leftmost location of term \[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \]
+$( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0  + 1 + 0)
+\cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* ).$
+\end{center}
+Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$
 (which corresponds to the initial sub-match $abc$)
-allows $mkeps$ to choose  it as the first candidate that meets the requirement of being $nullable$
+allows $mkeps$ to pick it up
-because $mkeps$ is defined to always choose the left one when nullable.
+because $mkeps$ is defined to always choose the left one when it is nullable.
 In the case of this example, $abc$ is preferred over $a$ or $ab$.
-This location is naturally generated by the splitting clause\[
+This $\Left(\Left(\ldots))$ location is naturally generated by
-(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c\].
+two applications of the splitting clause
+\begin{center}
+$(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
+\end{center}
 By this clause, we put
 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$.
 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match.
-The inside subvalue
+Removing the outside $\Left(\Left(...))$, the inside sub-value
-\[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\]
+\begin{center}
-tells us how about the empty string matches
+$\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
-\[(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*  \],
+\end{center}
-a sequence of 2 empty regular expressions, the first one is an alternative, we take the rightmost alternative
+tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$.
-to get that empty regular expression. The second one is a star, which iterates 0 times to form the second
+We match $[]$ by a sequence of 2 nullable regular expressions.
-empty regular expression.
+The first one is an alternative, we take the rightmost alternative---whose language
+contains the empty string. The second nullable regular
+expression is a Kleene star. $\Stars$ tells us how it
+generates the nullable regular expression: by 0 iterations to form $\epsilon$.
+Now $\textit{inj}$ injects characters back and
+incrementally builds a parse tree based on $v_3$.
 Using the value $v_3$, the character c, and the regular expression $r_2$,
 we can recover how $r_2$ matched the string $[c]$ :
-$\textit{inj} r_2 c v_3$ gives us\[ v_2 = Left(Seq(Right(Seq(Empty, Seq(Empty, c))), Stars []))\],
+$\textit{inj} \; r_2 \; c \; v_3$ gives us
-which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get
+\begin{center}
-\[ v_1 = Seq(Right(Seq(Empty, Seq(b, c))), Stars [])\]
+$v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
+\end{center}
+which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get
+\begin{center}
+$v_1 = \Seq(\Right(\Seq(\Empty, \Seq(b, c))), \Stars [])$
+\end{center}
 for how
-\[r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*\]
+\begin{center}
+$r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$
+\end{center}
 matched  the string $bc$ before it split into 2 pieces.
 Finally, after injecting character $a$ back to $v_1$,
-we get  the parse tree $v_0= Stars [Right(Seq(a, Seq(b, c)))]$ for how $r$ matched $abc$.
+we get  the parse tree
+\begin{center}
+$v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
+\end{center}
+for how $r$ matched $abc$. This completes the algorithm.
 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}.
 Readers might have noticed that the parse tree information
 is actually already available when doing derivatives.
 For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$,
 we can either take the initial match to be
+\begin{center}
 \begin{enumerate}
 \item[1)] just $a$ or
 \item[2)] string $ab$ or
 \item[3)] string $abc$.
 \end{enumerate}
+\end{center}
 In order to differentiate between these choices,
 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right.
 Which one of these alternatives is chosen later does not affect their relative position because our algorithm
 does not change this order. If this parsing information can be determined and does
 not change because of later derivatives,
-there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
+there is no point in traversing this information twice. This leads to an optimization---if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an $n$-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
 In the next section, we shall focus on the bit-coded algorithm and the
 process of simplification of regular expressions. This is needed in
 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
 and Lu's algorithms.  This is where the PhD-project aims to advance the

changeset 65	df2e0faccb23
parent 64	afd0d702a4fe
child 66	8c8c82c0515f