ninems/ninems.tex
changeset 42 2a469388c989
parent 41 a1f90febbc7f
child 43 52a3cec0a5c7
equal deleted inserted replaced
41:a1f90febbc7f 42:2a469388c989
     3 \usepackage{data}
     3 \usepackage{data}
     4 \usepackage{tikz-cd}
     4 \usepackage{tikz-cd}
     5 \usepackage{algorithm}
     5 \usepackage{algorithm}
     6 \usepackage{amsmath}
     6 \usepackage{amsmath}
     7 \usepackage[noend]{algpseudocode}
     7 \usepackage[noend]{algpseudocode}
       
     8 \usepackage{enumitem}
     8 % \documentclass{article}
     9 % \documentclass{article}
     9 %\usepackage[utf8]{inputenc}
    10 %\usepackage[utf8]{inputenc}
    10 %\usepackage[english]{babel}
    11 %\usepackage[english]{babel}
    11 %\usepackage{listings}
    12 %\usepackage{listings}
    12 % \usepackage{amsthm}
    13 % \usepackage{amsthm}
   455 This corresponds to the leftmost term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* $ in $r_3$. Note that its leftmost location allows $mkeps$ to choose  it as the first candidate that meets the requirement of being $nullable$. This location is naturally generated by the splitting clause\\ $(r_1 \cdot r_2)\backslash c  (when \, r_1 \, nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.  \\$. By this clause, we put
   456 This corresponds to the leftmost term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* $ in $r_3$. Note that its leftmost location allows $mkeps$ to choose  it as the first candidate that meets the requirement of being $nullable$. This location is naturally generated by the splitting clause\\ $(r_1 \cdot r_2)\backslash c  (when \, r_1 \, nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.  \\$. By this clause, we put
   456 $r_1 \backslash c \cdot r_2 $ at the front and $r_2 \backslash c$ at the back. This allows $mkeps$ to always pick up among two matches the one with a longer prefix. The value \\
   457 $r_1 \backslash c \cdot r_2 $ at the front and $r_2 \backslash c$ at the back. This allows $mkeps$ to always pick up among two matches the one with a longer prefix. The value \\
   457 $Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\
   458 $Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\
   458 tells us how about the empty string matches the final regular expression after doing all the derivatives: among the regular expressions $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )$ we choose the left most nullable one, which is composed of a sequence of a nested alternative and a folded star that iterates 0 times. In that nested alternative we take the rightmost alternative.
   459 tells us how about the empty string matches the final regular expression after doing all the derivatives: among the regular expressions $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )$ we choose the left most nullable one, which is composed of a sequence of a nested alternative and a folded star that iterates 0 times. In that nested alternative we take the rightmost alternative.
   459 
   460 
   460 Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, which tells us how $r_2$ matched $c$.
   461 Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched  the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get  the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$.
   461 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. Rather, we shall focus next on the
   462 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
   462 process of simplification of regular expressions, which is needed in
   463 Readers might have noticed that the parse tree information as actually already available when doing derivatives. For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with a, we can either take the initial match to be 
       
   464 \begin{enumerate}
       
   465     \item[1)] just $a$ or
       
   466     \item[2)] string $ab$ or 
       
   467     \item[3)] string $abc$.
       
   468 \end{enumerate}
       
   469 In order to differentiate between these choices, we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. Which one of these alternatives is chosen later does not affect their relative position because our algorithm does not change this order. There is no need to traverse this information twice. This leads to a new approach of lexing-- if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when fininshed with derivatives and calling $mkeps$ for deiciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
       
   470 
       
   471 In the next section, we shall focus on the bit-coded algorithm and the natural
       
   472 process of simplification of regular expressions using bit-codes, which is needed in
   463 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   473 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   464 and Lu's algorithms.  This is where the PhD-project hopes to advance
   474 and Lu's algorithms.  This is where the PhD-project hopes to advance
   465 the state-of-the-art.
   475 the state-of-the-art.
   466 
   476 
   467 
   477 
   468 \section{Simplification of Regular Expressions}
   478 \section{Simplification of Regular Expressions}
       
   479 Using bit-codes to guide  parsing is not a new idea.
   469 
   480 
   470 The main drawback of building successive derivatives according to
   481 The main drawback of building successive derivatives according to
   471 Brzozowski's definition is that they can grow very quickly in size.
   482 Brzozowski's definition is that they can grow very quickly in size.
   472 This is mainly due to the fact that the derivative operation generates
   483 This is mainly due to the fact that the derivative operation generates
   473 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,
   484 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,