ninems/ninems.tex
changeset 65 df2e0faccb23
parent 64 afd0d702a4fe
child 66 8c8c82c0515f
equal deleted inserted replaced
64:afd0d702a4fe 65:df2e0faccb23
   543 that the hole will be on $r_a$. So we recursively call $\inj\, 
   543 that the hole will be on $r_a$. So we recursively call $\inj\, 
   544 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
   544 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
   545 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
   545 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
   546 Other clauses can be understood in a similar way.
   546 Other clauses can be understood in a similar way.
   547 
   547 
   548 Let us do some further examples involving $\inj$: Suppose we have the
   548 The following example gives a taste of $\textit{inj}$'s effect
   549 regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against
   549 and how Sulzmann and Lu's algorithm works as a whole.
       
   550  Suppose we have a
       
   551 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against
   550 the string $abc$ (when $abc$ is written as a regular expression, the most
   552 the string $abc$ (when $abc$ is written as a regular expression, the most
   551 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
   553 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
   552 the parentheses and dots here for readability). By POSIX rules the algorithm
   554 the parentheses and dots here for readability). 
   553 should go for the longest matching, i.e. it should match the string
   555 This algorithm returns a POSIX value, which means it
   554 $abc$ in one star iteration, using the longest string $abc$ in the
   556 will go for the longest matching, i.e.~it should match the string
   555 sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression
   557 $abc$ in one star iteration, using the longest alternative $abc$ in the
       
   558 sub-expression $((((a+b)+ab)+c)+abc)$ (we use $r$ to denote this sub-expression
   556 for conciseness). 
   559 for conciseness). 
   557 Here is how the lexer achieves a parse tree for this
   560 Before $\textit{inj}$ comes into play, 
   558 matching. First, build successive derivatives until we exhaust the
   561 our lexer first builds derivative using string $abc$ (we simplified some regular expressions like
   559 string, as illustrated here (we simplified some regular expressions like
   562 $0 \cdot b$ to $0$ for conciseness; we also omit parentheses if
   560 $0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if
       
   561 they are clear from the context):
   563 they are clear from the context):
   562 %Similarly, we allow
   564 %Similarly, we allow
   563 %$\textit{ALT}$ to take a list of regular expressions as an argument
   565 %$\textit{ALT}$ to take a list of regular expressions as an argument
   564 %instead of just 2 operands to reduce the nested depth of
   566 %instead of just 2 operands to reduce the nested depth of
   565 %$\textit{ALT}$
   567 %$\textit{ALT}$
   571       &                              & $\phantom{r_3 = (} ((0+1+0  + 0 + 0) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^* )$
   573       &                              & $\phantom{r_3 = (} ((0+1+0  + 0 + 0) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^* )$
   572 \end{tabular}
   574 \end{tabular}
   573 \end{center}
   575 \end{center}
   574 
   576 
   575 \noindent
   577 \noindent
   576 Now instead when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$ 
   578 Now when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$ 
   577 to construct a parse tree for how $r_3$ matched the string $abc$. 
   579 to construct a parse tree for how $r_3$ matched the string $abc$. 
   578 $mkeps$ gives the following value $v_3$: 
   580 $mkeps$ gives the following value $v_3$: 
   579 \[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\]
   581 \begin{center}
   580 The outer 2 $Left$ tells us the first nullable part  hides in the term
   582 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
   581 \[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \]
   583 \end{center}
   582  in \[r_3 = ( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0  + 1 + 0)
   584 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
   583   \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* )
   585 \begin{center}
   584  \](underlined). Note that the leftmost location of term \[((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* \]
   586    $( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0  + 1 + 0)
       
   587   \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* ).$
       
   588   
       
   589  \end{center}
       
   590  Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$
   585  (which corresponds to the initial sub-match $abc$)
   591  (which corresponds to the initial sub-match $abc$)
   586   allows $mkeps$ to choose  it as the first candidate that meets the requirement of being $nullable$
   592   allows $mkeps$ to pick it up
   587   because $mkeps$ is defined to always choose the left one when nullable.
   593   because $mkeps$ is defined to always choose the left one when it is nullable.
   588   In the case of this example, $abc$ is preferred over $a$ or $ab$.
   594   In the case of this example, $abc$ is preferred over $a$ or $ab$.
   589    This location is naturally generated by the splitting clause\[
   595    This $\Left(\Left(\ldots))$ location is naturally generated by 
   590      (r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c\].
   596    two applications of the splitting clause
       
   597    \begin{center}
       
   598      $(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
       
   599      \end{center}
   591        By this clause, we put
   600        By this clause, we put
   592 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. 
   601 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. 
   593 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match.
   602 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match.
   594  The inside subvalue 
   603  Removing the outside $\Left(\Left(...))$, the inside sub-value 
   595  \[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\]
   604  \begin{center}
   596 tells us how about the empty string matches 
   605  $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
   597 \[(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*  \],
   606  \end{center}
   598 a sequence of 2 empty regular expressions, the first one is an alternative, we take the rightmost alternative 
   607 tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$.
   599 to get that empty regular expression. The second one is a star, which iterates 0 times to form the second
   608 We match $[]$ by a sequence of 2 nullable regular expressions. 
   600 empty regular expression.
   609 The first one is an alternative, we take the rightmost alternative---whose language
   601 
   610 contains the empty string. The second nullable regular 
       
   611 expression is a Kleene star. $\Stars$ tells us how it
       
   612 generates the nullable regular expression: by 0 iterations to form $\epsilon$.
       
   613 Now $\textit{inj}$ injects characters back and
       
   614  incrementally builds a parse tree based on $v_3$.
   602 Using the value $v_3$, the character c, and the regular expression $r_2$, 
   615 Using the value $v_3$, the character c, and the regular expression $r_2$, 
   603 we can recover how $r_2$ matched the string $[c]$ : 
   616 we can recover how $r_2$ matched the string $[c]$ : 
   604  $\textit{inj} r_2 c v_3$ gives us\[ v_2 = Left(Seq(Right(Seq(Empty, Seq(Empty, c))), Stars []))\],
   617  $\textit{inj} \; r_2 \; c \; v_3$ gives us
   605 which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get
   618  \begin{center}
   606 \[ v_1 = Seq(Right(Seq(Empty, Seq(b, c))), Stars [])\]
   619  $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
       
   620  \end{center}
       
   621 which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get
       
   622 \begin{center}
       
   623 $v_1 = \Seq(\Right(\Seq(\Empty, \Seq(b, c))), \Stars [])$
       
   624 \end{center}
   607  for how 
   625  for how 
   608  \[r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*\]
   626  \begin{center}
       
   627  $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$
       
   628  \end{center}
   609   matched  the string $bc$ before it split into 2 pieces. 
   629   matched  the string $bc$ before it split into 2 pieces. 
   610   Finally, after injecting character $a$ back to $v_1$, 
   630   Finally, after injecting character $a$ back to $v_1$, 
   611   we get  the parse tree $v_0= Stars [Right(Seq(a, Seq(b, c)))]$ for how $r$ matched $abc$.
   631   we get  the parse tree 
       
   632   \begin{center}
       
   633   $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
       
   634   \end{center}
       
   635    for how $r$ matched $abc$. This completes the algorithm.
       
   636    
   612 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
   637 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
   613 Readers might have noticed that the parse tree information 
   638 Readers might have noticed that the parse tree information 
   614 is actually already available when doing derivatives. 
   639 is actually already available when doing derivatives. 
   615 For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$,
   640 For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$,
   616  we can either take the initial match to be 
   641  we can either take the initial match to be 
       
   642  \begin{center}
   617 \begin{enumerate}
   643 \begin{enumerate}
   618     \item[1)] just $a$ or
   644     \item[1)] just $a$ or
   619     \item[2)] string $ab$ or 
   645     \item[2)] string $ab$ or 
   620     \item[3)] string $abc$.
   646     \item[3)] string $abc$.
   621 \end{enumerate}
   647 \end{enumerate}
       
   648 \end{center}
   622 In order to differentiate between these choices, 
   649 In order to differentiate between these choices, 
   623 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. 
   650 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. 
   624 Which one of these alternatives is chosen later does not affect their relative position because our algorithm 
   651 Which one of these alternatives is chosen later does not affect their relative position because our algorithm 
   625 does not change this order. If this parsing information can be determined and does
   652 does not change this order. If this parsing information can be determined and does
   626 not change because of later derivatives,
   653 not change because of later derivatives,
   627 there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
   654 there is no point in traversing this information twice. This leads to an optimization---if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an $n$-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
   628 
   655 
   629 In the next section, we shall focus on the bit-coded algorithm and the
   656 In the next section, we shall focus on the bit-coded algorithm and the
   630 process of simplification of regular expressions. This is needed in
   657 process of simplification of regular expressions. This is needed in
   631 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   658 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   632 and Lu's algorithms.  This is where the PhD-project aims to advance the
   659 and Lu's algorithms.  This is where the PhD-project aims to advance the