ninems/ninems.tex
changeset 43 52a3cec0a5c7
parent 42 2a469388c989
child 44 4d674a971852
equal deleted inserted replaced
42:2a469388c989 43:52a3cec0a5c7
   475 the state-of-the-art.
   475 the state-of-the-art.
   476 
   476 
   477 
   477 
   478 \section{Simplification of Regular Expressions}
   478 \section{Simplification of Regular Expressions}
   479 Using bit-codes to guide  parsing is not a new idea.
   479 Using bit-codes to guide  parsing is not a new idea.
   480 
   480 It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by intergrating bitcodes into derivatives.
       
   481 
       
   482 The argument for complicating the data structures from basic regular expressions to those with bitcodes
       
   483 is that we can introduce simplification without making the algorithm crash or impossible to reason about.
       
   484 The reason why we need simplification is due to the shortcoming of a naive algorithm using Brzozowski's definition only. 
   481 The main drawback of building successive derivatives according to
   485 The main drawback of building successive derivatives according to
   482 Brzozowski's definition is that they can grow very quickly in size.
   486 Brzozowski's definition is that they can grow very quickly in size.
   483 This is mainly due to the fact that the derivative operation generates
   487 This is mainly due to the fact that the derivative operation generates
   484 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,
   488 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,
   485 if implemented naively both algorithms by Brzozowski and by Sulzmann
   489 if implemented naively both algorithms by Brzozowski and by Sulzmann
   512 
   516 
   513  %We believe, and have generated test
   517  %We believe, and have generated test
   514 %data, that a similar bound can be obtained for the derivatives in
   518 %data, that a similar bound can be obtained for the derivatives in
   515 %Sulzmann and Lu's algorithm. Let us give some details about this next.
   519 %Sulzmann and Lu's algorithm. Let us give some details about this next.
   516 
   520 
   517 We first followed Sulzmann and Lu's idea of introducing
   521 Bit-codes look like this:
   518 \emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
   522 \[			b ::=   S \mid  Z \; \;\;
   519 defined by the following grammar:
   523 bs ::= [] \mid b:bs    
   520 
   524 \]
   521 \begin{center}
   525 They are just a string of bits, the names "S" and "Z"  here are kind of arbitrary, we can use 0 and 1 or binary symbol to substitute them. They are a compact form of parse trees.
   522 \begin{tabular}{lcl}
   526 Here is how values and bit-codes are related:
   523   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
   527 Bitcodes are essentially incomplete values.
   524                   & $\mid$ & $\textit{ONE}\;\;bs$\\
       
   525                   & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
       
   526                   & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
       
   527                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
       
   528                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
       
   529 \end{tabular}    
       
   530 \end{center}  
       
   531 
       
   532 \noindent
       
   533 where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
       
   534 list of annotated regular expressions. These bitsequences encode
       
   535 information about the (POSIX) value that should be generated by the
       
   536 Sulzmann and Lu algorithm. Bitcodes are essentially incomplete values.
       
   537 This can be straightforwardly seen in the following transformation: 
   528 This can be straightforwardly seen in the following transformation: 
   538 \begin{center}
   529 \begin{center}
   539 \begin{tabular}{lcl}
   530 \begin{tabular}{lcl}
   540   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
   531   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
   541   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
   532   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
   577        \textit{else}\;\textit{None}$                       
   568        \textit{else}\;\textit{None}$                       
   578 \end{tabular}    
   569 \end{tabular}    
   579 \end{center}    
   570 \end{center}    
   580 %\end{definition}
   571 %\end{definition}
   581 
   572 
       
   573 
       
   574 Sulzmann and Lu's integrated the bitcodes into annotated regular expressions by attaching them to the head of every substructure of a regularexpression\emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
       
   575 defined by the following grammar:
       
   576 
       
   577 \begin{center}
       
   578 \begin{tabular}{lcl}
       
   579   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
       
   580                   & $\mid$ & $\textit{ONE}\;\;bs$\\
       
   581                   & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
       
   582                   & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
       
   583                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
       
   584                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
       
   585 \end{tabular}    
       
   586 \end{center}  
       
   587 
       
   588 \noindent
       
   589 where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
       
   590 list of annotated regular expressions. These bitsequences encode
       
   591 information about the (POSIX) value that should be generated by the
       
   592 Sulzmann and Lu algorithm. 
       
   593 
   582 To do lexing using annotated regular expressions, we shall first transform the
   594 To do lexing using annotated regular expressions, we shall first transform the
   583 usual (un-annotated) regular expressions into annotated regular
   595 usual (un-annotated) regular expressions into annotated regular
   584 expressions:\\
   596 expressions:\\
   585 %\begin{definition}
   597 %\begin{definition}
   586 \begin{center}
   598 \begin{center}
   596   $(r^*)^\uparrow$ & $\dn$ &
   608   $(r^*)^\uparrow$ & $\dn$ &
   597          $\textit{STAR}\;[]\,r^\uparrow$\\
   609          $\textit{STAR}\;[]\,r^\uparrow$\\
   598 \end{tabular}    
   610 \end{tabular}    
   599 \end{center}    
   611 \end{center}    
   600 %\end{definition}
   612 %\end{definition}
   601 Then we do successive derivative operations on the annotated regular expression. This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\
   613 After internalise we do successive derivative operations on the annotated regular expression.
       
   614 Here $fuse$ is an auxiliary  function that helps to attach bits to the front of an annotated regular expression.
       
   615  This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\
   602 %\begin{definition}{bder}
   616 %\begin{definition}{bder}
   603 \begin{center}
   617 \begin{center}
   604   \begin{tabular}{@{}lcl@{}}
   618   \begin{tabular}{@{}lcl@{}}
   605   $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   619   $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   606   $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   620   $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\