ninems/ninems.tex
changeset 65 df2e0faccb23
parent 64 afd0d702a4fe
child 66 8c8c82c0515f
--- a/ninems/ninems.tex	Sun Jul 07 18:43:13 2019 +0100
+++ b/ninems/ninems.tex	Sun Jul 07 21:38:35 2019 +0100
@@ -545,19 +545,21 @@
 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
 Other clauses can be understood in a similar way.
 
-Let us do some further examples involving $\inj$: Suppose we have the
-regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against
+The following example gives a taste of $\textit{inj}$'s effect
+and how Sulzmann and Lu's algorithm works as a whole.
+ Suppose we have a
+regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against
 the string $abc$ (when $abc$ is written as a regular expression, the most
 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
-the parentheses and dots here for readability). By POSIX rules the algorithm
-should go for the longest matching, i.e. it should match the string
-$abc$ in one star iteration, using the longest string $abc$ in the
-sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression
+the parentheses and dots here for readability). 
+This algorithm returns a POSIX value, which means it
+will go for the longest matching, i.e.~it should match the string
+$abc$ in one star iteration, using the longest alternative $abc$ in the
+sub-expression $((((a+b)+ab)+c)+abc)$ (we use $r$ to denote this sub-expression
 for conciseness). 
-Here is how the lexer achieves a parse tree for this
-matching. First, build successive derivatives until we exhaust the
-string, as illustrated here (we simplified some regular expressions like
-$0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if
+Before $\textit{inj}$ comes into play, 
+our lexer first builds derivative using string $abc$ (we simplified some regular expressions like
+$0 \cdot b$ to $0$ for conciseness; we also omit parentheses if
 they are clear from the context):
 %Similarly, we allow
 %$\textit{ALT}$ to take a list of regular expressions as an argument
@@ -573,58 +575,83 @@
 \end{center}
 
 \noindent
-Now instead when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$ 
+Now when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$ 
 to construct a parse tree for how $r_3$ matched the string $abc$. 
 $mkeps$ gives the following value $v_3$: 
-\[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\]
-The outer 2 $Left$ tells us the first nullable part  hides in the term
-\[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \]
- in \[r_3 = ( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0  + 1 + 0)
-  \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* )
- \](underlined). Note that the leftmost location of term \[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \]
+\begin{center}
+$\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
+\end{center}
+The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
+\begin{center}
+   $( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0  + 1 + 0)
+  \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* ).$
+  
+ \end{center}
+ Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$
  (which corresponds to the initial sub-match $abc$)
-  allows $mkeps$ to choose  it as the first candidate that meets the requirement of being $nullable$
-  because $mkeps$ is defined to always choose the left one when nullable.
+  allows $mkeps$ to pick it up
+  because $mkeps$ is defined to always choose the left one when it is nullable.
   In the case of this example, $abc$ is preferred over $a$ or $ab$.
-   This location is naturally generated by the splitting clause\[
-     (r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c\].
+   This $\Left(\Left(\ldots))$ location is naturally generated by 
+   two applications of the splitting clause
+   \begin{center}
+     $(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
+     \end{center}
        By this clause, we put
 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. 
 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match.
- The inside subvalue 
- \[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\]
-tells us how about the empty string matches 
-\[(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*  \],
-a sequence of 2 empty regular expressions, the first one is an alternative, we take the rightmost alternative 
-to get that empty regular expression. The second one is a star, which iterates 0 times to form the second
-empty regular expression.
-
+ Removing the outside $\Left(\Left(...))$, the inside sub-value 
+ \begin{center}
+ $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
+ \end{center}
+tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$.
+We match $[]$ by a sequence of 2 nullable regular expressions. 
+The first one is an alternative, we take the rightmost alternative---whose language
+contains the empty string. The second nullable regular 
+expression is a Kleene star. $\Stars$ tells us how it
+generates the nullable regular expression: by 0 iterations to form $\epsilon$.
+Now $\textit{inj}$ injects characters back and
+ incrementally builds a parse tree based on $v_3$.
 Using the value $v_3$, the character c, and the regular expression $r_2$, 
 we can recover how $r_2$ matched the string $[c]$ : 
- $\textit{inj} r_2 c v_3$ gives us\[ v_2 = Left(Seq(Right(Seq(Empty, Seq(Empty, c))), Stars []))\],
-which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get
-\[ v_1 = Seq(Right(Seq(Empty, Seq(b, c))), Stars [])\]
+ $\textit{inj} \; r_2 \; c \; v_3$ gives us
+ \begin{center}
+ $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
+ \end{center}
+which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get
+\begin{center}
+$v_1 = \Seq(\Right(\Seq(\Empty, \Seq(b, c))), \Stars [])$
+\end{center}
  for how 
- \[r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*\]
+ \begin{center}
+ $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$
+ \end{center}
   matched  the string $bc$ before it split into 2 pieces. 
   Finally, after injecting character $a$ back to $v_1$, 
-  we get  the parse tree $v_0= Stars [Right(Seq(a, Seq(b, c)))]$ for how $r$ matched $abc$.
+  we get  the parse tree 
+  \begin{center}
+  $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
+  \end{center}
+   for how $r$ matched $abc$. This completes the algorithm.
+   
 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
 Readers might have noticed that the parse tree information 
 is actually already available when doing derivatives. 
 For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$,
  we can either take the initial match to be 
+ \begin{center}
 \begin{enumerate}
     \item[1)] just $a$ or
     \item[2)] string $ab$ or 
     \item[3)] string $abc$.
 \end{enumerate}
+\end{center}
 In order to differentiate between these choices, 
 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. 
 Which one of these alternatives is chosen later does not affect their relative position because our algorithm 
 does not change this order. If this parsing information can be determined and does
 not change because of later derivatives,
-there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
+there is no point in traversing this information twice. This leads to an optimization---if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an $n$-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
 
 In the next section, we shall focus on the bit-coded algorithm and the
 process of simplification of regular expressions. This is needed in