ninems/ninems.tex
changeset 58 f0360e17080e
parent 57 8ea861a19d70
child 59 8ff7b7508824
equal deleted inserted replaced
57:8ea861a19d70 58:f0360e17080e
    66   Brzozowski introduced in 1964 a beautifully simple algorithm for
    66   Brzozowski introduced in 1964 a beautifully simple algorithm for
    67   regular expression matching based on the notion of derivatives of
    67   regular expression matching based on the notion of derivatives of
    68   regular expressions. In 2014, Sulzmann and Lu extended this
    68   regular expressions. In 2014, Sulzmann and Lu extended this
    69   algorithm to not just give a YES/NO answer for whether or not a
    69   algorithm to not just give a YES/NO answer for whether or not a
    70   regular expression matches a string, but in case it matches also
    70   regular expression matches a string, but in case it matches also
    71   \emph{how} it matches the string.  This is important for
    71   answers with \emph{how} it matches the string.  This is important for
    72   applications such as lexing (tokenising a string). The problem is to
    72   applications such as lexing (tokenising a string). The problem is to
    73   make the algorithm by Sulzmann and Lu fast on all inputs without
    73   make the algorithm by Sulzmann and Lu fast on all inputs without
    74   breaking its correctness. We have already developed some
    74   breaking its correctness. We have already developed some
    75   simplification rules for this, but have not proved yet that they
    75   simplification rules for this, but have not proved yet that they
    76   preserve the correctness of the algorithm. We also have not yet
    76   preserve the correctness of the algorithm. We also have not yet
    84 lexing. Given the maturity of this topic, the reader might wonder:
    84 lexing. Given the maturity of this topic, the reader might wonder:
    85 Surely, regular expressions must have already been studied to death?
    85 Surely, regular expressions must have already been studied to death?
    86 What could possibly be \emph{not} known in this area? And surely all
    86 What could possibly be \emph{not} known in this area? And surely all
    87 implemented algorithms for regular expression matching are blindingly
    87 implemented algorithms for regular expression matching are blindingly
    88 fast?
    88 fast?
    89 
       
    90 
       
    91 
    89 
    92 Unfortunately these preconceptions are not supported by evidence: Take
    90 Unfortunately these preconceptions are not supported by evidence: Take
    93 for example the regular expression $(a^*)^*\,b$ and ask whether
    91 for example the regular expression $(a^*)^*\,b$ and ask whether
    94 strings of the form $aa..a$ match this regular
    92 strings of the form $aa..a$ match this regular
    95 expression. Obviously they do not match---the expected $b$ in the last
    93 expression. Obviously they do not match---the expected $b$ in the last
   167 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
   165 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
   168            of the form $\underbrace{aa..a}_{n}$.}
   166            of the form $\underbrace{aa..a}_{n}$.}
   169 \end{tabular}    
   167 \end{tabular}    
   170 \end{center}  
   168 \end{center}  
   171 
   169 
   172 \noindent These are clearly abysmal and possibly surprising results.
   170 \noindent These are clearly abysmal and possibly surprising results. One
   173 One would expect these systems doing much better than that---after
   171 would expect these systems doing much better than that---after all,
   174 all, given a DFA and a string, whether a string is matched by this DFA
   172 given a DFA and a string, deciding whether a string is matched by this
   175 should be linear.
   173 DFA should be linear.
   176 
   174 
   177 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
   175 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
   178 exhibit this ``exponential behaviour''.  Unfortunately, such regular
   176 exhibit this ``exponential behaviour''.  Unfortunately, such regular
   179 expressions are not just a few ``outliers'', but actually they are
   177 expressions are not just a few ``outliers'', but actually they are
   180 frequent enough that a separate name has been created for
   178 frequent enough that a separate name has been created for
   181 them---\emph{evil regular expressions}. In empiric work, Davis et al
   179 them---\emph{evil regular expressions}. In empiric work, Davis et al
   182 report that they have found thousands of such evil regular expressions
   180 report that they have found thousands of such evil regular expressions
   183 in the JavaScript and Python ecosystems \cite{Davis18}.
   181 in the JavaScript and Python ecosystems \cite{Davis18}.
   184 
   182 
   185 This exponential blowup sometimes causes pain in real life:
   183 This exponential blowup in matching algorithms sometimes causes
   186 for example on 20 July 2016 one evil regular expression brought the
   184 considerable grief in real life: for example on 20 July 2016 one evil
   187 webpage \href{http://stackexchange.com}{Stack Exchange} to its 
   185 regular expression brought the webpage
       
   186 \href{http://stackexchange.com}{Stack Exchange} to its
   188 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   187 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   189 In this instance, a regular expression intended to just trim white
   188 In this instance, a regular expression intended to just trim white
   190 spaces from the beginning and the end of a line actually consumed
   189 spaces from the beginning and the end of a line actually consumed
   191 massive amounts of CPU-resources and because of this the web servers
   190 massive amounts of CPU-resources and because of this the web servers
   192 ground to a halt. This happened when a post with 20,000 white spaces
   191 ground to a halt. This happened when a post with 20,000 white spaces was
   193 was submitted, but importantly the white spaces were neither at the
   192 submitted, but importantly the white spaces were neither at the
   194 beginning nor at the end. As a result, the regular expression matching
   193 beginning nor at the end. As a result, the regular expression matching
   195 engine needed to backtrack over many choices.
   194 engine needed to backtrack over many choices. The underlying problem is
   196 
   195 that many ``real life'' regular expression matching engines do not use
   197 The underlying problem is that many ``real life'' regular expression
   196 DFAs for matching. This is because they support regular expressions that
   198 matching engines do not use DFAs for matching. This is because they
   197 are not covered by the classical automata theory, and in this more
   199 support regular expressions that are not covered by the classical
   198 general setting there are quite a few research questions still
   200 automata theory, and in this more general setting there are quite a
   199 unanswered and fast algorithms still need to be developed (for example
   201 few research questions still unanswered and fast algorithms still need
   200 how to include bounded repetitions, negation and  back-references).
   202 to be developed (for example how to include bounded repetitions, negation
       
   203 and  back-references).
       
   204 
   201 
   205 There is also another under-researched problem to do with regular
   202 There is also another under-researched problem to do with regular
   206 expressions and lexing, i.e.~the process of breaking up strings into
   203 expressions and lexing, i.e.~the process of breaking up strings into
   207 sequences of tokens according to some regular expressions. In this
   204 sequences of tokens according to some regular expressions. In this
   208 setting one is not just interested in whether or not a regular
   205 setting one is not just interested in whether or not a regular
   244 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
   241 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
   245 regular expression matching according to the POSIX strategy
   242 regular expression matching according to the POSIX strategy
   246 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
   243 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
   247 Brzozowski from 1964 where he introduced the notion of derivatives of
   244 Brzozowski from 1964 where he introduced the notion of derivatives of
   248 regular expressions \cite{Brzozowski1964}. We shall briefly explain
   245 regular expressions \cite{Brzozowski1964}. We shall briefly explain
   249 this algorithms next.
   246 this algorithm next.
   250 
   247 
   251 \section{The Algorithm by Brzozowski based on Derivatives of Regular
   248 \section{The Algorithm by Brzozowski based on Derivatives of Regular
   252 Expressions}
   249 Expressions}
   253 
   250 
   254 Suppose (basic) regular expressions are given by the following grammar:
   251 Suppose (basic) regular expressions are given by the following grammar:
   263 The intended meaning of the constructors is as follows: $\ZERO$
   260 The intended meaning of the constructors is as follows: $\ZERO$
   264 cannot match any string, $\ONE$ can match the empty string, the
   261 cannot match any string, $\ONE$ can match the empty string, the
   265 character regular expression $c$ can match the character $c$, and so
   262 character regular expression $c$ can match the character $c$, and so
   266 on.
   263 on.
   267 
   264 
   268 The brilliant contribution by Brzozowski is the notion of
   265 The ingenious contribution by Brzozowski is the notion of
   269 \emph{derivatives} of regular expressions.  The idea behind this
   266 \emph{derivatives} of regular expressions.  The idea behind this
   270 notion is as follows: suppose a regular expression $r$ can match a
   267 notion is as follows: suppose a regular expression $r$ can match a
   271 string of the form $c\!::\! s$ (that is a list of characters starting
   268 string of the form $c\!::\! s$ (that is a list of characters starting
   272 with $c$), what does the regular expression look like that can match
   269 with $c$), what does the regular expression look like that can match
   273 just $s$? Brzozowski gave a neat answer to this question. He started
   270 just $s$? Brzozowski gave a neat answer to this question. He started
   344 \[
   341 \[
   345 match\;s\;r \;\dn\; nullable(r\backslash s)
   342 match\;s\;r \;\dn\; nullable(r\backslash s)
   346 \]
   343 \]
   347 
   344 
   348 \noindent
   345 \noindent
   349 This algorithm can be illustrated as follows:
   346 This algorithm can be illustrated graphically as follows
   350 \begin{equation}\label{graph:*}
   347 \begin{equation}\label{graph:*}
   351 \begin{tikzcd}
   348 \begin{tikzcd}
   352 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   349 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   353 \end{tikzcd}
   350 \end{tikzcd}
   354 \end{equation}
   351 \end{equation}
   356 \noindent
   353 \noindent
   357 where we start with  a regular expression  $r_0$, build successive
   354 where we start with  a regular expression  $r_0$, build successive
   358 derivatives until we exhaust the string and then use \textit{nullable}
   355 derivatives until we exhaust the string and then use \textit{nullable}
   359 to test whether the result can match the empty string. It can  be
   356 to test whether the result can match the empty string. It can  be
   360 relatively  easily shown that this matcher is correct  (that is given
   357 relatively  easily shown that this matcher is correct  (that is given
   361 $s$ and $r$, it generates YES if and only if $s \in L(r)$).
   358 an $s$ and a $r$, it generates YES if and only if $s \in L(r)$).
   362 
   359 
   363  
   360  
   364 \section{Values and the Algorithm by Sulzmann and Lu}
   361 \section{Values and the Algorithm by Sulzmann and Lu}
   365 
   362 
   366 One limitation, however, of Brzozowski's algorithm is that it only
   363 One limitation, however, of Brzozowski's algorithm is that it only
   396 		\end{tabular}
   393 		\end{tabular}
   397 	\end{tabular}
   394 	\end{tabular}
   398 \end{center}
   395 \end{center}
   399 
   396 
   400 \noindent
   397 \noindent
   401 The idea of values is to express parse trees. Suppose a flatten
   398 There is no value  corresponding to $\ZERO$; $\Empty$ corresponds to
       
   399 $\ONE$; $\Seq$ to the sequence regular expression and so on. The idea of
       
   400 values is to encode parse trees. To see this, suppose a \emph{flatten}
   402 operation, written $|v|$, which we can use to extract the underlying
   401 operation, written $|v|$, which we can use to extract the underlying
   403 string of $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \,
   402 string of a value $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \,
   404 (\textit{Char y})|$ is the string $xy$. We omit the straightforward
   403 (\textit{Char y})|$ is the string $xy$. We omit the straightforward
   405 definition of flatten. Using flatten, we can describe how
   404 definition of flatten. Using flatten, we can describe how values encode
   406 values encode parse trees: $\Seq\,v_1\, v_2$ tells us how the
   405 parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$
   407 string $|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$
   406 matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and,
   408 matches $|v_1|$ and, respectively, $r_2$ matches $|v_2|$. Exactly how
   407 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched
   409 these two are matched is contained in the sub-structure of $v_1$ and
   408 is contained in the sub-structure of $v_1$ and $v_2$. 
   410 $v_2$. 
   409 
   411 
   410  To give a concrete example of how value work, consider the string $xy$
   412  To give a concrete example of how value works, consider the string $xy$
       
   413 and the regular expression $(x + (y + xy))^*$. We can view this regular
   411 and the regular expression $(x + (y + xy))^*$. We can view this regular
   414 expression as a tree and if the string $xy$ is matched by two Star
   412 expression as a tree and if the string $xy$ is matched by two Star
   415 ``iterations'', then the $x$ is matched by the left-most alternative in
   413 ``iterations'', then the $x$ is matched by the left-most alternative in
   416 this tree and the $y$ by the right-left alternative. This suggests to
   414 this tree and the $y$ by the right-left alternative. This suggests to
   417 record this matching as
   415 record this matching as
   448 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   446 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   449 \end{tikzcd}
   447 \end{tikzcd}
   450 \end{center}
   448 \end{center}
   451 
   449 
   452 \noindent
   450 \noindent
   453 We shall briefly explain this algorithm. For the convenience of
   451 For the convenience, we shall employ the following notations: the regular expression we
   454 explanation, we have the following notations: the regular expression we
   452 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
   455 start with is $r_0$ and the string $s$ is composed characters $c_0 c_1
   453 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots  according to
   456 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots, using
       
   457 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
   454 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
   458 arrive at the derivative $r_n$. We test whether this derivative is
   455 obtain at the derivative $r_n$. We test whether this derivative is
   459 $\textit{nullable}$ or not. If not, we know the string does not match
   456 $\textit{nullable}$ or not. If not, we know the string does not match
   460 $r$ and no value needs to be generated. If yes, we start building the
   457 $r$ and no value needs to be generated. If yes, we start building the
   461 parse tree incrementally by \emph{injecting} back the characters into
   458 parse tree incrementally by \emph{injecting} back the characters into
   462 the values $v_n, \ldots, v_0$. We first call the function
   459 the values $v_n, \ldots, v_0$. For this we first call the function
   463 $\textit{mkeps}$, which builds the parse tree for how the empty string
   460 $\textit{mkeps}$, which builds the parse tree for how the empty string
   464 has matched the empty regular expression $r_n$. This function is defined
   461 has matched the (nullable) regular expression $r_n$. This function is defined
   465 as
   462 as
   466 
   463 
   467 	\begin{center}
   464 	\begin{center}
   468 		\begin{tabular}{lcl}
   465 		\begin{tabular}{lcl}
   469 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
   466 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
   480 the parse tree $v_i$ for how the regex $r_i$ matches the string
   477 the parse tree $v_i$ for how the regex $r_i$ matches the string
   481 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
   478 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
   482 get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
   479 get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
   483  A correctness proof using induction can be routinely established.
   480  A correctness proof using induction can be routinely established.
   484 
   481 
   485 It is instructive to see how this algorithm works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to match it against the string $abc$(when $abc$ is written
   482 It is instructive to see how this algorithm works by a little example.
   486 as a regular expression, the most standard way of expressing it should be $a \cdot (b \cdot c)$. We omit the parenthesis and dots here for readability). 
   483 Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to
   487 By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching.
   484 match it against the string $abc$(when $abc$ is written as a regular
   488 First, we build successive derivatives until we exhaust the string, as illustrated here( we simplified some regular expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow $\textit{ALT}$ to take a list of regular expressions as an argument instead of just 2 operands to reduce the nested depth of $\textit{ALT}$):
   485 expression, the most standard way of expressing it should be $a \cdot (b
       
   486 \cdot c)$. We omit the parenthesis and dots here for readability). By
       
   487 POSIX rules the lexer should go for the longest matching, i.e. it should
       
   488 match the string $abc$ in one star iteration, using the longest string
       
   489 $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this
       
   490 sub-expression for conciseness). Here is how the lexer achieves a parse
       
   491 tree for this matching. First, we build successive derivatives until we
       
   492 exhaust the string, as illustrated here( we simplified some regular
       
   493 expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow
       
   494 $\textit{ALT}$ to take a list of regular expressions as an argument
       
   495 instead of just 2 operands to reduce the nested depth of
       
   496 $\textit{ALT}$):
   489 
   497 
   490 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]
   498 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]
   491 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] 
   499 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] 
   492 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )
   500 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )
   493 \]
   501 \]
   854 
   862 
   855 \noindent
   863 \noindent
   856 We would settle the correctness claim.
   864 We would settle the correctness claim.
   857 It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words,
   865 It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words,
   858 bmkeps r = bmkeps simp r
   866 bmkeps r = bmkeps simp r
   859 as this basically comes down to proving actions like removing the additional $r$ in $r+r$  does not delete imporatnt POSIX information in a regular expression.
   867 as this basically comes down to proving actions like removing the additional $r$ in $r+r$  does not delete important POSIX information in a regular expression.
   860 The hardcore of this problem is to prove that
   868 The hardcore of this problem is to prove that
   861 bmkeps bders r = bmkeps bders simp r
   869 bmkeps bders r = bmkeps bders simp r
   862 That is, if we do derivative on regular expression r and the simplificed version fo r , they can still provie the same POSIZ calue if there is one . This is not as strightforward as the previous proposition, as the two regular expression r and simp r  might become very different regular epxressions after repeated application ofd simp and derivative.
   870 That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straghtforward as the previous proposition, as the two regular expression r and simp r  might become very different regular expressions after repeated application ofd simp and derivative.
   863 The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification.
   871 The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification.
   864 To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu:
   872 To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu:
   865 \\definition of retrieve\\
   873 \\definition of retrieve\\
   866  This function assembled the bitcode that corresponds to a parse tree for how the current derivative mathces the suffix of the string(the characters that have not yet appeared, but is stored in the value).
   874  This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value).
   867  Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\
   875  Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\
   868  $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\
   876  $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\
   869  A little fact that needs to be stated to help comprehension:\\
   877  A little fact that needs to be stated to help comprehension:\\
   870  $r^\uparrow = a$($a$ stands for $annotated$).\\
   878  $r^\uparrow = a$($a$ stands for $annotated$).\\
   871  Fahad and Christian also used this fact to prove  the correctness of bit-coded algorithm without simplificaiton.
   879  Fahad and Christian also used this fact to prove  the correctness of bit-coded algorithm without simplification.
   872  Our purpose of using this, however, is try to establish \\
   880  Our purpose of using this, however, is try to establish \\
   873 $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve}  \; \textit{simp}(a) \; v'.$\\
   881 $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve}  \; \textit{simp}(a) \; v'.$\\
   874  The idea is that using $v'$,
   882  The idea is that using $v'$,
   875   a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still  able to extract the bitsequence that gievs the same parsing information as the unsimplified one.
   883   a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still  able to extract the bitsequence that gives the same parsing information as the unsimplified one.
   876  After establishing this, we might be able to finally bridge the gap of proving\\
   884  After establishing this, we might be able to finally bridge the gap of proving\\
   877  $\textit{retrieve} \; r   \backslash  s \; v = \;\textit{retrieve} \; \textit{simp}(r)  \backslash  s \; v'$\\
   885  $\textit{retrieve} \; r   \backslash  s \; v = \;\textit{retrieve} \; \textit{simp}(r)  \backslash  s \; v'$\\
   878  and subsequently\\
   886  and subsequently\\
   879  $\textit{retrieve} \; r \backslash  s \; v\; = \; \textit{retrieve} \; r  \backslash_{simp}   s \; v'$.\\
   887  $\textit{retrieve} \; r \backslash  s \; v\; = \; \textit{retrieve} \; r  \backslash_{simp}   s \; v'$.\\
   880  This proves that our simplified version of regular expression still contains all the bitcodes neeeded.
   888  This proves that our simplified version of regular expression still contains all the bitcodes needed.
   881 
   889 
   882 The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplifiction(the naive version as implemented in ADU of course can explode in some cases).
   890 The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplification(the naive version as implemented in ADU of course can explode in some cases).
   883 So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress.
   891 So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress.
   884 
   892 
   885 \section{Conclusion}
   893 \section{Conclusion}
   886 
   894 
   887 In this PhD-project we are interested in fast algorithms for regular
   895 In this PhD-project we are interested in fast algorithms for regular