cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:4c7173b7ddca
+:cab5eab1f6f1
 \usepackage{tikz-cd}
 %\usepackage{algorithm}
 \usepackage{amsmath}
 \usepackage[noend]{algpseudocode}
 \usepackage{enumitem}
+\usepackage{nccmath}
 \definecolor{darkblue}{rgb}{0,0,0.6}
 \hypersetup{colorlinks=true,allcolors=darkblue}
+\newcommand{\comment}[1]%
+{{\color{red}$\Rightarrow$}\marginpar{\raggedright\small{\bf\color{red}#1}}}
 % \documentclass{article}
 %\usepackage[utf8]{inputenc}
 %\usepackage[english]{babel}
 %\usepackage{listings}
 % \usepackage{amsthm}
 looked at extended regular expressions, such as bounded repetitions,
 negation and back-references.
 \end{abstract}
 \section{Introduction}
 This PhD-project is about regular expression matching and
 lexing. Given the maturity of this topic, the reader might wonder:
 Surely, regular expressions must have already been studied to death?
 What could possibly be \emph{not} known in this area? And surely all
 algorithm by a second phase (the first phase being building successive
 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
 is generated assuming the regular expression matches  the string.
 Pictorially, the algorithm is as follows:
+\begin{ceqn}
 \begin{equation}\label{graph:2}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
 \end{tikzcd}
 \end{equation}
+\end{ceqn}
 \noindent
 For convenience, we shall employ the following notations: the regular expression we
 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
 \ldots c_{n-1}$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
 that the hole will be on $r_a$. So we recursively call $\inj\,
 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value
 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
 Other clauses can be understood in a similar way.
-The following example gives a taste of $\textit{inj}$'s effect
+The following example gives a \comment{Other word: insight?}taste of $\textit{inj}$'s effect
 and how Sulzmann and Lu's algorithm works as a whole.
 Suppose we have a
 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against
 the string $abc$ (when $abc$ is written as a regular expression, the most
 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
 &                              & $\phantom{r_3 = (} ((0+1+0  + 0 + 0) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^* )$
 \end{tabular}
 \end{center}
 \noindent
-Now when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$
+In  case $r_3$ is nullable, we can call $mkeps$
 to construct a parse tree for how $r_3$ matched the string $abc$.
 $mkeps$ gives the following value $v_3$:
 \begin{center}
 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
 \end{center}
 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
 \begin{center}
 $( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0  + 1 + 0)
 \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* ).$
 \end{center}
-Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$
-(which corresponds to the initial sub-match $abc$)
+\noindent
-allows $mkeps$ to pick it up
+Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot
-because $mkeps$ is defined to always choose the left one when it is nullable.
+1) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
-In the case of this example, $abc$ is preferred over $a$ or $ab$.
+$mkeps$ to pick it up because $mkeps$ is defined to always choose the
-This $\Left(\Left(\ldots))$ location is naturally generated by
+left one when it is nullable. In the case of this example, $abc$ is
-two applications of the splitting clause
+preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is
-\begin{center}
+naturally generated by two applications of the splitting clause
+\begin{center}
 $(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
 \end{center}
-By this clause, we put
-$r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$.
+\noindent
-This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match.
+By this clause, we put $r_1 \backslash c \cdot r_2 $ at the
-Removing the outside $\Left(\Left(...))$, the inside sub-value
+$\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This
-\begin{center}
+allows $mkeps$ to always pick up among two matches the one with a longer
+initial sub-match. Removing the outside $\Left(\Left(...))$, the inside
+sub-value
+\begin{center}
 $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
 \end{center}
-tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$.
-We match $[]$ by a sequence of 2 nullable regular expressions.
+\noindent
-The first one is an alternative, we take the rightmost alternative---whose language
+tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot
-contains the empty string. The second nullable regular
+1 \cdot 1) \cdot r^*$. We match $[]$ by a sequence of 2 nullable regular
-expression is a Kleene star. $\Stars$ tells us how it
+expressions. The first one is an alternative, we take the rightmost
-generates the nullable regular expression: by 0 iterations to form $\epsilon$.
+alternative---whose language contains the empty string. The second
-Now $\textit{inj}$ injects characters back and
+nullable regular expression is a Kleene star. $\Stars$ tells us how it
-incrementally builds a parse tree based on $v_3$.
+generates the nullable regular expression: by 0 iterations to form
-Using the value $v_3$, the character c, and the regular expression $r_2$,
+$\epsilon$. Now $\textit{inj}$ injects characters back and incrementally
-we can recover how $r_2$ matched the string $[c]$ :
+builds a parse tree based on $v_3$. Using the value $v_3$, the character
-$\textit{inj} \; r_2 \; c \; v_3$ gives us
+c, and the regular expression $r_2$, we can recover how $r_2$ matched
+the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us
 \begin{center}
 $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
 \end{center}
 which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get
 \begin{center}
 \item[1)] just $a$ or
 \item[2)] string $ab$ or
 \item[3)] string $abc$.
 \end{enumerate}
 \end{center}
-In order to differentiate between these choices,
-we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right.
+\noindent
-Which one of these alternatives is chosen later does not affect their relative position because our algorithm
+In order to differentiate between these choices, we just need to
-does not change this order. If this parsing information can be determined and does
+remember their positions--$a$ is on the left, $ab$ is in the middle ,
-not change because of later derivatives,
+and $abc$ is on the right. Which one of these alternatives is chosen
-there is no point in traversing this information twice. This leads to an optimisation---if we store the information for parse trees inside the regular expression, update it when we do derivative on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing the rest $n$ injections. This leads to Sulzmann and Lu's novel idea of using bit-codes in derivatives.
+later does not affect their relative position because our algorithm does
+not change this order. If this parsing information can be determined and
+does not change because of later derivatives, there is no point in
+traversing this information twice. This leads to an optimisation---if we
+store the information for parse trees inside the regular expression,
+update it when we do derivative on them, and collect the information
+when finished with derivatives and call $mkeps$ for deciding which
+branch is POSIX, we can generate the parse tree in one pass, instead of
+doing the rest $n$ injections. This leads to Sulzmann and Lu's novel
+idea of using bit-codes in derivatives.
 In the next section, we shall focus on the bit-coded algorithm and the
 process of simplification of regular expressions. This is needed in
 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
 and Lu's algorithms.  This is where the PhD-project aims to advance the
 state-of-the-art.
 \section{Simplification of Regular Expressions}
+Using bitcodes to guide  parsing is not a novel idea. It was applied to
-Using bit-codes to guide  parsing is not a novel idea. It was applied to
 context free grammars and then adapted by Henglein and Nielson for
-efficient regular expression parsing using DFAs \cite{nielson11bcre}. Sulzmann and
+efficient regular expression parsing using DFAs~\cite{nielson11bcre}.
-Lu took a step further by integrating bitcodes into derivatives.
+Sulzmann and Lu took this idea of bitcodes a step further by integrating
-The reason why we want to use bitcodes in
+bitcodes into derivatives. The reason why we want to use bitcodes in
-this project is that we want to introduce aggressive
+this project is that we want to introduce more aggressive
-simplifications.
+simplifications in order to keep the size of derivatives small
+throughout. This is because the main drawback of building successive
-The main drawback of building
+derivatives according to Brzozowski's definition is that they can grow
-successive derivatives according to Brzozowski's definition is that they
+very quickly in size. This is mainly due to the fact that the derivative
-can grow very quickly in size. This is mainly due to the fact that the
+operation generates often ``useless'' $\ZERO$s and $\ONE$s in
-derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in
 derivatives.  As a result, if implemented naively both algorithms by
-Brzozowski and by Sulzmann and Lu are excruciatingly slow.
+Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
-For example when starting with the regular expression $(a + aa)^*$ and building 12
+when starting with the regular expression $(a + aa)^*$ and building 12
 successive derivatives w.r.t.~the character $a$, one obtains a
 derivative regular expression with more than 8000 nodes (when viewed as
 a tree). Operations like derivative and $\nullable$ need to traverse
 such trees and consequently the bigger the size of the derivative the
 slower the algorithm.
-Fortunately, one can simplify regular expressions
-after each derivative step. Various simplifications of regular
+Fortunately, one can simplify regular expressions after each derivative
-expressions are possible, such as the simplifications of $\ZERO + r$, $r
+step. Various simplifications of regular expressions are possible, such
-+ \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These
+as the simplifications of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
-simplifications do not affect the answer for whether a regular
+\cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
-expression matches a string or not, but fortunately also do not affect
+affect the answer for whether a regular expression matches a string or
-the POSIX strategy of how regular expressions match strings---although
+not, but fortunately also do not affect the POSIX strategy of how
-the latter is much harder to establish.
+regular expressions match strings---although the latter is much harder
-The argument for complicating the data structures from basic regular
+to establish. \comment{Does not make sense.} The argument for
-expressions to those with bitcodes is that we can introduce
+complicating the data structures from basic regular expressions to those
-simplification without making the algorithm crash or overly complex to
+with bitcodes is that we can introduce simplification without making the
-reason about. The latter is crucial for a correctness proof.
+algorithm crash or overly complex to reason about. The latter is crucial
-Some initial results in this
+for a correctness proof. Some initial results in this regard have been
-regard have been obtained in \cite{AusafDyckhoffUrban2016}. However,
+obtained in \cite{AusafDyckhoffUrban2016}.
-what has not been achieved yet is correctness for the bit-coded algorithm
-that involves simplifications and a very tight bound for the size. Such
+Unfortunately, the simplification rules outlined above  are not
-a tight bound is suggested by work of Antimirov who proved that
+sufficient to prevent an explosion for all regular expression. We
-(partial) derivatives can be bound by the number of characters contained
+believe a tighter bound can be achieved that prevents an explosion in
-in the initial regular expression \cite{Antimirov95}.
+all cases. Such a tighter bound is suggested by work of Antimirov who
+proved that (partial) derivatives can be bound by the number of
-Antimirov defined the \emph{partial derivatives} of regular expressions to be this:
+characters contained in the initial regular expression
-%TODO definition of partial derivatives
+\cite{Antimirov95}. He defined the \emph{partial derivatives} of regular
+expressions as follows:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\
 $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\
 $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{  1   \}  \; \textit{else} \; \emptyset$ \\
 $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \;  r_2$ \\
-$\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \}  \cup pder \; c \; r_2 \; \textit{else} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \} $ \\
+$\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\
+& & $\textit{then} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \}  \cup pder \; c \; r_2 \;$\\
+& & $\textit{else} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \} $ \\
 $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{  r' \cdot r^* \mid r' \in pder \; c \; r   \}  $ \\
 \end{tabular}
 \end{center}
-A partial derivative of a regular expression $r$ is essentially a set of regular expressions that are either $r$'s children expressions or a concatenation of them.
-Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression.  Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives.
+\noindent
+A partial derivative of a regular expression $r$ is essentially a set of
+regular expressions that are either $r$'s children expressions or a
+concatenation of them. Antimirov has proved a tight bound of the size of
+partial derivatives. \comment{That looks too preliminary to me.} Roughly
+speaking the size will not exceed the fourth power of the number of
+nodes in that regular expression. \comment{Improve: which
+simplifications?}Interestingly, we observed from experiment that after
+the simplification step, our regular expression has the same size or is
+smaller than the partial derivatives. This allows us to prove a tight
+bound on the size of regular expression during the running time of the
+algorithm if we can establish the connection between our simplification
+rules and partial derivatives.
 %We believe, and have generated test
 %data, that a similar bound can be obtained for the derivatives in
 %Sulzmann and Lu's algorithm. Let us give some details about this next.
 $\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\
 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\;
 code(\Stars\,vs)$
 \end{tabular}
 \end{center}
-Here code encodes a value into a bit-sequence by converting Left into $\Z$, Right into $\S$, the start point of a non-empty star iteration into $\S$, and the border where a local star terminates into $\Z$. This conversion is apparently lossy, as it throws away the character information, and does not decode the boundary between the two operands of the sequence constructor. Moreover, with only the bitcode we cannot even tell whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily moved around. In order to recover the bitcode back into values, we will need the regular expression as the extra information and decode it back into value:\\
+Here code encodes a value into a bit-sequence by converting Left into
+$\Z$, Right into $\S$, the start point of a non-empty star iteration
+into $\S$, and the border where a local star terminates into $\Z$. This
+conversion is apparently lossy, as it throws away the character
+information, and does not decode the boundary between the two operands
+of the sequence constructor. Moreover, with only the bitcode we cannot
+even tell whether the $\S$s and $\Z$s are for $\Left/\Right$ or
+$\Stars$. The reason for choosing this compact way of storing
+information is that the relatively small size of bits can be easily
+moved around. In order to recover the bitcode back into values, we will
+need the regular expression as the extra information and decode it back
+into value:\\
 %\begin{definition}[Bitdecoding of Values]\mbox{}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
 \end{tabular}
 \end{center}
 %\end{definition}
-Sulzmann and Lu's integrated the bitcodes into annotated regular expressions by attaching them to the head of every substructure of a regular expression\cite{Sulzmann2014}. They are
+Sulzmann and Lu's integrated the bitcodes into annotated regular
-defined by the following grammar:
+expressions by attaching them to the head of every substructure of a
+regular expression\cite{Sulzmann2014}. They are defined by the following
+grammar:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
 & $\mid$ & $\textit{ONE}\;\;bs$\\
 where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
 list of annotated regular expressions. These bitsequences encode
 information about the (POSIX) value that should be generated by the
 Sulzmann and Lu algorithm.
-To do lexing using annotated regular expressions, we shall first transform the
+To do lexing using annotated regular expressions, we shall first
-usual (un-annotated) regular expressions into annotated regular
+transform the usual (un-annotated) regular expressions into annotated
-expressions:\\
+regular expressions. This operation is called \emph{internalisation} and
+defined as follows:
 %\begin{definition}
 \begin{center}
 \begin{tabular}{lcl}
 $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
 $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
 $\textit{STAR}\;[]\,r^\uparrow$\\
 \end{tabular}
 \end{center}
 %\end{definition}
-Here $fuse$ is an auxiliary  function that helps to attach bits to the front of an annotated regular expression. Its definition goes as follows:
+\noindent
+In the fourth clause, $fuse$ is an auxiliary function that helps to attach bits to the
+front of an annotated regular expression. Its definition is as follows:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
 $\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ &
 $\textit{ONE}\,(bs\,@\,bs')$\\
 $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
 $\textit{STAR}\,(bs\,@\,bs')\,a$
 \end{tabular}
 \end{center}
-After internalise we do successive derivative operations on the annotated regular expression.
+\noindent
-This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits :\\
+After internalise we do successive derivative operations on the
-%\begin{definition}{bder}
+annotated regular expression. This derivative operation is the same as
+what we previously have for the simple regular expressions, except that
+we take special care of the bits :\\
+%\begin{definition}{bder}
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\
 $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\
 $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
 $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
 $bs \,@\, [\S]$
 \end{tabular}
 \end{center}
 %\end{definition}
+\noindent
 This function completes the parse tree information by
 travelling along the path on the regular expression that corresponds to a POSIX value snd collect all the bits, and
 using S to indicate the end of star iterations. If we take the bitsproduced by $bmkeps$ and decode it,
 we get the parse tree we need, the working flow looks like this:\\
 \begin{center}
 flatten and distinct to open up nested ALT and reduce as many duplicates as possible.
 Function distinct  keeps the first occurring copy only and remove all later ones when detected duplicates.
 Function flatten opens up nested ALT. Its recursive definition is given below:
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
-$\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{ map fuse}( \textit{bs, \_} )  \textit{ as}) \; +\!+ \; \textit{flatten} \; as' $ \\
+$\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
+(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
 $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
-$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as' $
+$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise)
 \end{tabular}
 \end{center}
-Here flatten behaves like the traditional functional programming flatten function,
+\noindent
+\comment{No: functional flatten  does not remove ZEROs}Here flatten behaves like the traditional functional programming flatten function,
 what it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
 Suppose we apply simplification after each derivative step,
 and view these two operations as an atomic one: $a \backslash_{simp} c \dn \textit{simp}(a \backslash c)$.
 Then we can use the previous natural extension from derivative w.r.t   character to derivative w.r.t string:
 This algorithm effectively keeps the regular expression size small, for example,
 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is.
-We are currently engaged in 2 tasks related to this algorithm.
+\section{Current Work}
+We are currently engaged in two tasks related to this algorithm.
+\begin{itemize}
+\item
 The first one is proving that our simplification rules
 actually do not affect the POSIX value that should be generated by the
 algorithm according to the specification of a POSIX value
 and furthermore obtain a much
 tighter bound on the sizes of derivatives. The result is that our
 have to work out all proof details.''
 \end{quote}
 \noindent
 We would settle the correctness claim.
-It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words,
+It is relatively straightforward to establish that after one simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words,
-bmkeps r = bmkeps simp r
+\comment{that  only holds when r is nullable}bmkeps r = bmkeps simp r
 as this basically comes down to proving actions like removing the additional $r$ in $r+r$  does not delete important POSIX information in a regular expression.
 The hardcore of this problem is to prove that
 bmkeps bders r = bmkeps bders simp r
 That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straightforward as the previous proposition, as the two regular expression r and simp r  might become very different regular expressions after repeated application ofd simp and derivative.
-The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification.
+The crucial point is to find the \comment{What?}"gene" of a regular expression and how it is kept intact during simplification.
-To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu:
+To aid this, we are use the helping function retrieve described by Sulzmann and Lu:
 \\definition of retrieve\\
 This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value).
 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\
 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\
 A little fact that needs to be stated to help comprehension:\\
 $\textit{retrieve} \; r   \backslash  s \; v = \;\textit{retrieve} \; \textit{simp}(r)  \backslash  s \; v'$\\
 and subsequently\\
 $\textit{retrieve} \; r \backslash  s \; v\; = \; \textit{retrieve} \; r  \backslash_{simp}   s \; v'$.\\
 This proves that our simplified version of regular expression still contains all the bitcodes needed.
-The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplification(the naive version as implemented in ADU of course can explode in some cases).
+\item
-So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress.
+The second task is to speed up the more aggressive simplification.
+Currently it is slower than a naive simplification(the naive version as
+implemented in ADU of course can explode in some cases). So it needs to
+be explored how to make it faster. Our possibility would be to explore
+again the connection to DFAs. This is very much work in progress.
+\end{itemize}
 \section{Conclusion}
 In this PhD-project we are interested in fast algorithms for regular
 expression matching. While this seems to be a ``settled'' area, in

changeset 70	cab5eab1f6f1
parent 69	4c7173b7ddca
child 71	0573615e41a3