ninems/ninems.tex
changeset 63 d3c22f809dde
parent 62 5b10d83b0834
child 64 afd0d702a4fe
equal deleted inserted replaced
62:5b10d83b0834 63:d3c22f809dde
     1 \documentclass[a4paper,UKenglish]{lipics}
     1 \documentclass[a4paper,UKenglish]{lipics}
     2 \usepackage{graphic}
     2 \usepackage{graphic}
     3 \usepackage{data}
     3 \usepackage{data}
     4 \usepackage{tikz-cd}
     4 \usepackage{tikz-cd}
     5 \usepackage{algorithm}
     5 %\usepackage{algorithm}
     6 \usepackage{amsmath}
     6 \usepackage{amsmath}
     7 \usepackage[noend]{algpseudocode}
     7 \usepackage[noend]{algpseudocode}
     8 \usepackage{enumitem}
     8 \usepackage{enumitem}
       
     9 
       
    10 \definecolor{darkblue}{rgb}{0,0,0.6}
       
    11 \hypersetup{colorlinks=true,allcolors=darkblue}
     9 % \documentclass{article}
    12 % \documentclass{article}
    10 %\usepackage[utf8]{inputenc}
    13 %\usepackage[utf8]{inputenc}
    11 %\usepackage[english]{babel}
    14 %\usepackage[english]{babel}
    12 %\usepackage{listings}
    15 %\usepackage{listings}
    13 % \usepackage{amsthm}
    16 % \usepackage{amsthm}
    14 % \usepackage{hyperref}
    17 %\usepackage{hyperref}
    15 % \usepackage[margin=0.5in]{geometry}
    18 % \usepackage[margin=0.5in]{geometry}
    16 %\usepackage{pmboxdraw}
    19 %\usepackage{pmboxdraw}
    17  
    20  
    18 \title{POSIX Regular Expression Matching and Lexing}
    21 \title{POSIX Regular Expression Matching and Lexing}
    19 \author{Chengsong Tan}
    22 \author{Chengsong Tan}
   179 them---\emph{evil regular expressions}. In empiric work, Davis et al
   182 them---\emph{evil regular expressions}. In empiric work, Davis et al
   180 report that they have found thousands of such evil regular expressions
   183 report that they have found thousands of such evil regular expressions
   181 in the JavaScript and Python ecosystems \cite{Davis18}.
   184 in the JavaScript and Python ecosystems \cite{Davis18}.
   182 
   185 
   183 This exponential blowup in matching algorithms sometimes causes
   186 This exponential blowup in matching algorithms sometimes causes
   184 considerable disturbance in real life: for example on 20 July 2016 one evil
   187 considerable grief in real life: for example on 20 July 2016 one evil
   185 regular expression brought the webpage
   188 regular expression brought the webpage
   186 \href{http://stackexchange.com}{Stack Exchange} to its
   189 \href{http://stackexchange.com}{Stack Exchange} to its
   187 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   190 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
   188 In this instance, a regular expression intended to just trim white
   191 In this instance, a regular expression intended to just trim white
   189 spaces from the beginning and the end of a line actually consumed
   192 spaces from the beginning and the end of a line actually consumed
   190 massive amounts of CPU-resources and because of this the web servers
   193 massive amounts of CPU-resources and because of this the web servers
   191 ground to a halt. This happened when a post with 20,000 white spaces was
   194 ground to a halt. This happened when a post with 20,000 white spaces was
   192 submitted, but importantly the white spaces were neither at the
   195 submitted, but importantly the white spaces were neither at the
   195 that many ``real life'' regular expression matching engines do not use
   198 that many ``real life'' regular expression matching engines do not use
   196 DFAs for matching. This is because they support regular expressions that
   199 DFAs for matching. This is because they support regular expressions that
   197 are not covered by the classical automata theory, and in this more
   200 are not covered by the classical automata theory, and in this more
   198 general setting there are quite a few research questions still
   201 general setting there are quite a few research questions still
   199 unanswered and fast algorithms still need to be developed (for example
   202 unanswered and fast algorithms still need to be developed (for example
   200 how to include bounded repetitions, negation and  back-references).
   203 how to treat bounded repetitions, negation and  back-references
       
   204 efficiently).
   201 
   205 
   202 There is also another under-researched problem to do with regular
   206 There is also another under-researched problem to do with regular
   203 expressions and lexing, i.e.~the process of breaking up strings into
   207 expressions and lexing, i.e.~the process of breaking up strings into
   204 sequences of tokens according to some regular expressions. In this
   208 sequences of tokens according to some regular expressions. In this
   205 setting one is not just interested in whether or not a regular
   209 setting one is not just interested in whether or not a regular
   394 	\end{tabular}
   398 	\end{tabular}
   395 \end{center}
   399 \end{center}
   396 
   400 
   397 \noindent
   401 \noindent
   398 There is no value  corresponding to $\ZERO$; $\Empty$ corresponds to
   402 There is no value  corresponding to $\ZERO$; $\Empty$ corresponds to
   399 $\ONE$; $\Seq$ to the sequence regular expression and so on. The idea of
   403 $\ONE$; $\Char$ to the character regular expression; $\Seq$ to the
   400 values is to encode parse trees. To see this, suppose a \emph{flatten}
   404 sequence regular expression and so on. The idea of values is to encode
   401 operation, written $|v|$, which we can use to extract the underlying
   405 parse trees. To see this, suppose a \emph{flatten} operation, written
   402 string of a value $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \,
   406 $|v|$. We use this function to extract the underlying string of a value
   403 (\textit{Char y})|$ is the string $xy$. We omit the straightforward
   407 $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char
   404 definition of flatten. Using flatten, we can describe how values encode
   408 y})|$ is the string $xy$.  Using flatten, we can describe how values
   405 parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$
   409 encode parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @
   406 matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and,
   410 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
   407 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched
   411 substring $|v_1|$ and, respectively, $r_2$ matches the substring
   408 is contained in the sub-structure of $v_1$ and $v_2$. 
   412 $|v_2|$. Exactly how these two are matched is contained in the
       
   413 sub-structure of $v_1$ and $v_2$. 
   409 
   414 
   410  To give a concrete example of how value works, consider the string $xy$
   415  To give a concrete example of how value works, consider the string $xy$
   411 and the regular expression $(x + (y + xy))^*$. We can view this regular
   416 and the regular expression $(x + (y + xy))^*$. We can view this regular
   412 expression as a tree and if the string $xy$ is matched by two Star
   417 expression as a tree and if the string $xy$ is matched by two Star
   413 ``iterations'', then the $x$ is matched by the left-most alternative in
   418 ``iterations'', then the $x$ is matched by the left-most alternative in
   448 \end{equation}
   453 \end{equation}
   449 
   454 
   450 \noindent
   455 \noindent
   451 For convenience, we shall employ the following notations: the regular expression we
   456 For convenience, we shall employ the following notations: the regular expression we
   452 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
   457 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
   453 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots  according to
   458 \ldots c_n$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
   454 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
   459 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
   455 obtain at the derivative $r_n$. We test whether this derivative is
   460 obtain at the derivative $r_n$. We test whether this derivative is
   456 $\textit{nullable}$ or not. If not, we know the string does not match
   461 $\textit{nullable}$ or not. If not, we know the string does not match
   457 $r$ and no value needs to be generated. If yes, we start building the
   462 $r$ and no value needs to be generated. If yes, we start building the
   458 parse tree incrementally by \emph{injecting} back the characters into
   463 parse tree incrementally by \emph{injecting} back the characters into
   479 also that in case of alternatives we give preference to the
   484 also that in case of alternatives we give preference to the
   480 regular expression on the left-hand side. This will become
   485 regular expression on the left-hand side. This will become
   481 important later on.
   486 important later on.
   482 
   487 
   483 After this, we inject back the characters one by one in order to build
   488 After this, we inject back the characters one by one in order to build
   484 the parse tree $v_i$ for how the regex $r_i$ matches the string
   489 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
   485 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
   490 ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After
   486 get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
   491 injecting back $n$ characters, we get the parse tree for how $r_0$
   487 
   492 matches $s$. For this Sulzmann and Lu defined a function that reverses
   488 We need a function that
   493 the ``chopping off'' of characters during the derivative phase. The
   489 reverses the "chopping off" for values which we did in the
   494 corresponding function is called $\textit{inj}$; it takes three
   490 first phase for derivatives. The corresponding function is
   495 arguments: the first one is a regular expression ${r_{i-1}}$, before the
   491 called $\textit{inj}$ (short for injection). This function takes three
   496 character is chopped off, the second is a character ${c_{i-1}}$, the
   492 arguments: the first one is a regular expression ${r_{i-1}}$, 
   497 character we want to inject and the third argument is the value
   493 before the character is chopped off, the second is  ${c_{i-1}}$,
   498 $textit{v_i}$, into which one wants to inject the character (it
   494 the character
   499 corresponds to the regular expression after the character has been
   495 we
   500 chopped off). The result of this function is a new value. The definition
   496 want to inject and the third argument is the value $textit{v_i}$, where we
   501 of $\textit{inj}$ is as follows: 
   497 will inject the character(it corresponds to the regular expression
       
   498  after the character
       
   499 has been chopped off). The result of this function is a
       
   500 new value. The definition of $\textit{inj}$ is as follows: 
       
   501 
   502 
   502 \begin{center}
   503 \begin{center}
   503 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   504 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   504   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
   505   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
   505   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
   506   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
   509   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
   510   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
   510   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
   511   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
   511 \end{tabular}
   512 \end{tabular}
   512 \end{center}
   513 \end{center}
   513 
   514 
   514 \noindent This definition is by recursion on the regular
   515 \noindent This definition is by recursion on the ``shape'' of regular
   515 expressions' and values' shapes. 
   516 expressions and values. To understands this definition better consider
   516 To understands this definition in a more vivid way, the reader 
   517 the situation when we build the derivative on regular expression $r_{i-1}$.
   517 might imagine this: when we do the derivative on regular expression 
   518 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a
   518 $r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$.
   519 ``hole'' in $r_i$. In order to calculate a value for $r_{i-1}$, we need to
   519 This leaves a "hole" on $r_i$. In order to calculate a value for
   520 locate where that hole is. We can find this location information by
   520 $r_{i-1}$, we need to locate where that hole is. We can find this location
   521 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape
   521 informtation by comparing $r_{i-1}$ and $v_i$.
   522 $r_a \cdot r_b$, and $v_i$ is of shape $\textit{Left}(\textit{Seq}(v_a,
   522 For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$
   523 v_b))$, we know immediately that 
   523 is of shape $\textit{Left}(\textit{Seq}(v_a, v_b))$, we know immediately that 
   524 %
   524 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\],
   525 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]
       
   526 
       
   527 \noindent
   525 otherwise if $r_a$ is not nullable,
   528 otherwise if $r_a$ is not nullable,
   526 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\],
   529 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]
   527 the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that
   530 
   528 $v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape
   531 \noindent
   529 $Left(\ldots)$ instead of $Right(\ldots)$, we know that the left
   532 the value $v_i$ should be of shape $\Seq(\ldots)$, contradicting the fact that
       
   533 $v_i$ is actually $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
       
   534 $\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left
   530 branch is taken instead of the right one. We have therefore found out 
   535 branch is taken instead of the right one. We have therefore found out 
   531 that the hole will be on $r_a$. So we recursively call $\textit{inj} 
   536 that the hole will be on $r_a$. So we recursively call $\inj\, 
   532 r_a c v_a$ to fill that hole in $v_a$. After injection, the value 
   537 r_a\,c\,v_a$ in order to fill that hole in $v_a$. After injection, the value 
   533 $v_i$ for $r_i = r_a \cdot r_b$ should be $\textit{Seq}(\textit{inj}
   538 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
   534 r_a c v_a, v_b)$.
       
   535 Other clauses can be understood in a similar way.
   539 Other clauses can be understood in a similar way.
   536 
   540 
   537 To understand what is going on, it might be best to do some
   541 Let us do some further examples involving $\inj$: Suppose we have the
   538 example calculations and compare them with Figure~\eqref{graph:2}.
   542 regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against
   539 % A correctness proof using induction can be routinely established.
   543 the string $abc$ (when $abc$ is written as a regular expression, the most
   540 Suppose we have a regular expression $((((a+b)+ab)+c)+abc)^*$ and want to
   544 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
   541 match it against the string $abc$(when $abc$ is written as a regular
   545 the parentheses and dots here for readability). By POSIX rules the lexer
   542 expression, the most standard way of expressing it should be $a \cdot (b
   546 should go for the longest matching, i.e. it should match the string
   543 \cdot c)$. We omit the parentheses and dots here for readability). By
   547 $abc$ in one star iteration, using the longest string $abc$ in the
   544 POSIX rules the lexer should go for the longest matching, i.e. it should
   548 sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression
   545 match the string $abc$ in one star iteration, using the longest string
   549 for conciseness). Here is how the lexer achieves a parse tree for this
   546 $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this
   550 matching. First, build successive derivatives until we exhaust the
   547 sub-expression for conciseness). Here is how the lexer achieves a parse
   551 string, as illustrated here (we simplified some regular expressions like
   548 tree for this matching. First, build successive derivatives until we
   552 $0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if
   549 exhaust the string, as illustrated here(we simplified some regular
   553 they are clear from the context):
   550 expressions like $0 \cdot b$ to $0$ for conciseness. We also omit 
       
   551 some parentheses if they are clear from the context.):
       
   552 %Similarly, we allow
   554 %Similarly, we allow
   553 %$\textit{ALT}$ to take a list of regular expressions as an argument
   555 %$\textit{ALT}$ to take a list of regular expressions as an argument
   554 %instead of just 2 operands to reduce the nested depth of
   556 %instead of just 2 operands to reduce the nested depth of
   555 %$\textit{ALT}$
   557 %$\textit{ALT}$
   556 \begin{center}
   558 \begin{center}
   557 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^* \xrightarrow{\backslash b}\]
   559 \begin{tabular}{lcl}
   558 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r^* \xrightarrow{\backslash c}\] 
   560 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^*$\\
   559 \[r_3 = (( 0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* )
   561       & $\xrightarrow{\backslash b}$ & $r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r^*$\\
   560 \]
   562       & $\xrightarrow{\backslash c}$ & $r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^*) + $\\ 
       
   563       &                              & $\phantom{r_3 = (} ((0+1+0  + 0 + 0) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^* )$
       
   564 \end{tabular}
   561 \end{center}
   565 \end{center}
       
   566 
       
   567 \noindent
   562 Now instead when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$ 
   568 Now instead when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$ 
   563 to construct a parse tree for how $r_3$ matched the string $abc$. 
   569 to construct a parse tree for how $r_3$ matched the string $abc$. 
   564 $mkeps$ gives the following value $v_3$: 
   570 $mkeps$ gives the following value $v_3$: 
   565 \[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\]
   571 \[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\]
   566 The outer 2 $Left$ tells us the first nullable part  hides in the term
   572 The outer 2 $Left$ tells us the first nullable part  hides in the term
   610 Which one of these alternatives is chosen later does not affect their relative position because our algorithm 
   616 Which one of these alternatives is chosen later does not affect their relative position because our algorithm 
   611 does not change this order. If this parsing information can be determined and does
   617 does not change this order. If this parsing information can be determined and does
   612 not change because of later derivatives,
   618 not change because of later derivatives,
   613 there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
   619 there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
   614 
   620 
   615 In the next section, we shall focus on the bit-coded algorithm and the natural
   621 In the next section, we shall focus on the bit-coded algorithm and the
   616 process of simplification of regular expressions using bit-codes, which is needed in
   622 process of simplification of regular expressions. This is needed in
   617 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   623 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   618 and Lu's algorithms.  This is where the PhD-project hopes to advance
   624 and Lu's algorithms.  This is where the PhD-project aims to advance the
   619 the state-of-the-art.
   625 state-of-the-art.
   620 
   626 
   621 
   627 
   622 \section{Simplification of Regular Expressions}
   628 \section{Simplification of Regular Expressions}
   623 Using bit-codes to guide  parsing is not a new idea.
   629 
   624 It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by integrating bitcodes into derivatives.
   630 
   625 
   631 Using bit-codes to guide  parsing is not a novel idea. It was applied to
   626 The argument for complicating the data structures from basic regular expressions to those with bitcodes
   632 context free grammars and then adapted by Henglein and Nielson for
   627 is that we can introduce simplification without making the algorithm crash or overly complex to reason about.
   633 efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and
   628 The reason why we need simplification is due to the shortcoming of the previous algorithm. 
   634 Lu took a step further by integrating bitcodes into derivatives.
   629 The main drawback of building successive derivatives according to
   635 
   630 Brzozowski's definition is that they can grow very quickly in size.
   636 The argument for complicating the data structures from basic regular
   631 This is mainly due to the fact that the derivative operation generates
   637 expressions to those with bitcodes is that we can introduce
   632 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,
   638 simplification without making the algorithm crash or overly complex to
   633 if implemented naively both algorithms by Brzozowski and by Sulzmann
   639 reason about. The reason why we need simplification is due to the
   634 and Lu are excruciatingly slow. For example when starting with the
   640 shortcoming of the previous algorithm. The main drawback of building
   635 regular expression $(a + aa)^*$ and building 12 successive derivatives
   641 successive derivatives according to Brzozowski's definition is that they
   636 w.r.t.~the character $a$, one obtains a derivative regular expression
   642 can grow very quickly in size. This is mainly due to the fact that the
   637 with more than 8000 nodes (when viewed as a tree). Operations like
   643 derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in
   638 derivative and $\nullable$ need to traverse such trees and
   644 derivatives.  As a result, if implemented naively both algorithms by
   639 consequently the bigger the size of the derivative the slower the
   645 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
   640 algorithm. Fortunately, one can simplify regular expressions after
   646 when starting with the regular expression $(a + aa)^*$ and building 12
   641 each derivative step. Various simplifications of regular expressions
   647 successive derivatives w.r.t.~the character $a$, one obtains a
   642 are possible, such as the simplifications of $\ZERO + r$,
   648 derivative regular expression with more than 8000 nodes (when viewed as
   643 $r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just
   649 a tree). Operations like derivative and $\nullable$ need to traverse
   644 $r$. These simplifications do not affect the answer for whether a
   650 such trees and consequently the bigger the size of the derivative the
   645 regular expression matches a string or not, but fortunately also do
   651 slower the algorithm. Fortunately, one can simplify regular expressions
   646 not affect the POSIX strategy of how regular expressions match
   652 after each derivative step. Various simplifications of regular
   647 strings---although the latter is much harder to establish. Some
   653 expressions are possible, such as the simplifications of $\ZERO + r$, $r
   648 initial results in this regard have been obtained in
   654 + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These
   649 \cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
   655 simplifications do not affect the answer for whether a regular
   650 is a very tight bound for the size. Such a tight bound is suggested by
   656 expression matches a string or not, but fortunately also do not affect
   651 work of Antimirov who proved that (partial) derivatives can be bound
   657 the POSIX strategy of how regular expressions match strings---although
   652 by the number of characters contained in the initial regular
   658 the latter is much harder to establish. Some initial results in this
   653 expression \cite{Antimirov95}.
   659 regard have been obtained in \cite{AusafDyckhoffUrban2016}. However,
   654 
   660 what has not been achieved yet is a very tight bound for the size. Such
   655 Antimirov defined the "partial derivatives" of regular expressions to be this:
   661 a tight bound is suggested by work of Antimirov who proved that
       
   662 (partial) derivatives can be bound by the number of characters contained
       
   663 in the initial regular expression \cite{Antimirov95}.
       
   664 
       
   665 Antimirov defined the \emph{partial derivatives} of regular expressions to be this:
   656 %TODO definition of partial derivatives
   666 %TODO definition of partial derivatives
   657 \begin{center}
   667 \begin{center}
   658 \begin{tabular}{lcl}
   668 \begin{tabular}{lcl}
   659  $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\
   669  $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\
   660  $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\
   670  $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\