ninems/ninems.tex
changeset 61 580c7b84f900
parent 60 c737a0259194
child 62 5b10d83b0834
equal deleted inserted replaced
60:c737a0259194 61:580c7b84f900
   577        By this clause, we put
   577        By this clause, we put
   578 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. 
   578 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. 
   579 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match.
   579 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match.
   580  The inside subvalue 
   580  The inside subvalue 
   581  \[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\]
   581  \[$Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])$\]
   582 tells us how about the empty string matches : among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )$, \\
   582 tells us how about the empty string matches 
   583 we choose the left most nullable one, which is composed of a sequence of an alternative and a star. In that alternative $0+0+0 + 0 + 1 \cdot 1 \cdot 1$ we take the rightmost choice.
   583 \[(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*  \],
   584 
   584 a sequence of 2 empty regular expressions, the first one is an alternative, we take the rightmost alternative 
   585 Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, \\
   585 to get that empty regular expression. The second one is a star, which iterates 0 times to form the second
   586 which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched  the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get  the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$.
   586 empty regular expression.
   587 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
   587 
   588 Readers might have noticed that the parse tree information as actually already available when doing derivatives. For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with a, we can either take the initial match to be 
   588 Using the value $v_3$, the character c, and the regular expression $r_2$, 
       
   589 we can recover how $r_2$ matched the string $[c]$ : 
       
   590  $\textit{inj} r_2 c v_3$ gives us\[ v_2 = Left(Seq(Right(Seq(Empty, Seq(Empty, c))), Stars []))\],
       
   591 which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get
       
   592 \[ v_1 = Seq(Right(Seq(Empty, Seq(b, c))), Stars [])\]
       
   593  for how 
       
   594  \[r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*\]
       
   595   matched  the string $bc$ before it split into 2 pieces. 
       
   596   Finally, after injecting character $a$ back to $v_1$, 
       
   597   we get  the parse tree $v_0= Stars [Right(Seq(a, Seq(b, c)))]$ for how $r$ matched $abc$.
       
   598 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
       
   599 Readers might have noticed that the parse tree information 
       
   600 is actually already available when doing derivatives. 
       
   601 For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$,
       
   602  we can either take the initial match to be 
   589 \begin{enumerate}
   603 \begin{enumerate}
   590     \item[1)] just $a$ or
   604     \item[1)] just $a$ or
   591     \item[2)] string $ab$ or 
   605     \item[2)] string $ab$ or 
   592     \item[3)] string $abc$.
   606     \item[3)] string $abc$.
   593 \end{enumerate}
   607 \end{enumerate}
   594 In order to differentiate between these choices, we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. Which one of these alternatives is chosen later does not affect their relative position because our algorithm does not change this order. There is no need to traverse this information twice. This leads to a new approach of lexing-- if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
   608 In order to differentiate between these choices, 
       
   609 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. 
       
   610 Which one of these alternatives is chosen later does not affect their relative position because our algorithm 
       
   611 does not change this order. If this parsing information can be determined and does
       
   612 not change because of later derivatives,
       
   613 there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
   595 
   614 
   596 In the next section, we shall focus on the bit-coded algorithm and the natural
   615 In the next section, we shall focus on the bit-coded algorithm and the natural
   597 process of simplification of regular expressions using bit-codes, which is needed in
   616 process of simplification of regular expressions using bit-codes, which is needed in
   598 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   617 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   599 and Lu's algorithms.  This is where the PhD-project hopes to advance
   618 and Lu's algorithms.  This is where the PhD-project hopes to advance
   603 \section{Simplification of Regular Expressions}
   622 \section{Simplification of Regular Expressions}
   604 Using bit-codes to guide  parsing is not a new idea.
   623 Using bit-codes to guide  parsing is not a new idea.
   605 It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by integrating bitcodes into derivatives.
   624 It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by integrating bitcodes into derivatives.
   606 
   625 
   607 The argument for complicating the data structures from basic regular expressions to those with bitcodes
   626 The argument for complicating the data structures from basic regular expressions to those with bitcodes
   608 is that we can introduce simplification without making the algorithm crash or impossible to reason about.
   627 is that we can introduce simplification without making the algorithm crash or overly complex to reason about.
   609 The reason why we need simplification is due to the shortcoming of a naive algorithm using Brzozowski's definition only. 
   628 The reason why we need simplification is due to the shortcoming of the previous algorithm. 
   610 The main drawback of building successive derivatives according to
   629 The main drawback of building successive derivatives according to
   611 Brzozowski's definition is that they can grow very quickly in size.
   630 Brzozowski's definition is that they can grow very quickly in size.
   612 This is mainly due to the fact that the derivative operation generates
   631 This is mainly due to the fact that the derivative operation generates
   613 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,
   632 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,
   614 if implemented naively both algorithms by Brzozowski and by Sulzmann
   633 if implemented naively both algorithms by Brzozowski and by Sulzmann