cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:2a469388c989
+:52a3cec0a5c7
 the state-of-the-art.
 \section{Simplification of Regular Expressions}
 Using bit-codes to guide  parsing is not a new idea.
+It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by intergrating bitcodes into derivatives.
+The argument for complicating the data structures from basic regular expressions to those with bitcodes
+is that we can introduce simplification without making the algorithm crash or impossible to reason about.
+The reason why we need simplification is due to the shortcoming of a naive algorithm using Brzozowski's definition only.
 The main drawback of building successive derivatives according to
 Brzozowski's definition is that they can grow very quickly in size.
 This is mainly due to the fact that the derivative operation generates
 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,
 if implemented naively both algorithms by Brzozowski and by Sulzmann
 %We believe, and have generated test
 %data, that a similar bound can be obtained for the derivatives in
 %Sulzmann and Lu's algorithm. Let us give some details about this next.
-We first followed Sulzmann and Lu's idea of introducing
+Bit-codes look like this:
-\emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
+\[			b ::=   S \mid  Z \; \;\;
-defined by the following grammar:
+bs ::= [] \mid b:bs
+\]
-\begin{center}
+They are just a string of bits, the names "S" and "Z"  here are kind of arbitrary, we can use 0 and 1 or binary symbol to substitute them. They are a compact form of parse trees.
-\begin{tabular}{lcl}
+Here is how values and bit-codes are related:
-$\textit{a}$ & $::=$  & $\textit{ZERO}$\\
+Bitcodes are essentially incomplete values.
-& $\mid$ & $\textit{ONE}\;\;bs$\\
-& $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
-& $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
-& $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
-& $\mid$ & $\textit{STAR}\;\;bs\,a$
-\end{tabular}
-\end{center}
-\noindent
-where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
-list of annotated regular expressions. These bitsequences encode
-information about the (POSIX) value that should be generated by the
-Sulzmann and Lu algorithm. Bitcodes are essentially incomplete values.
 This can be straightforwardly seen in the following transformation:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
 \textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 %\end{definition}
+Sulzmann and Lu's integrated the bitcodes into annotated regular expressions by attaching them to the head of every substructure of a regularexpression\emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
+defined by the following grammar:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{a}$ & $::=$  & $\textit{ZERO}$\\
+& $\mid$ & $\textit{ONE}\;\;bs$\\
+& $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
+& $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
+& $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
+& $\mid$ & $\textit{STAR}\;\;bs\,a$
+\end{tabular}
+\end{center}
+\noindent
+where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
+list of annotated regular expressions. These bitsequences encode
+information about the (POSIX) value that should be generated by the
+Sulzmann and Lu algorithm.
 To do lexing using annotated regular expressions, we shall first transform the
 usual (un-annotated) regular expressions into annotated regular
 expressions:\\
 %\begin{definition}
 \begin{center}
 $(r^*)^\uparrow$ & $\dn$ &
 $\textit{STAR}\;[]\,r^\uparrow$\\
 \end{tabular}
 \end{center}
 %\end{definition}
-Then we do successive derivative operations on the annotated regular expression. This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\
+After internalise we do successive derivative operations on the annotated regular expression.
+Here $fuse$ is an auxiliary  function that helps to attach bits to the front of an annotated regular expression.
+This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\
 %\begin{definition}{bder}
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\
 $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\

changeset 43	52a3cec0a5c7
parent 42	2a469388c989
child 44	4d674a971852