ninems/ninems.tex
changeset 64 afd0d702a4fe
parent 63 d3c22f809dde
child 65 df2e0faccb23
equal deleted inserted replaced
63:d3c22f809dde 64:afd0d702a4fe
    68 \begin{abstract}
    68 \begin{abstract}
    69   Brzozowski introduced in 1964 a beautifully simple algorithm for
    69   Brzozowski introduced in 1964 a beautifully simple algorithm for
    70   regular expression matching based on the notion of derivatives of
    70   regular expression matching based on the notion of derivatives of
    71   regular expressions. In 2014, Sulzmann and Lu extended this
    71   regular expressions. In 2014, Sulzmann and Lu extended this
    72   algorithm to not just give a YES/NO answer for whether or not a
    72   algorithm to not just give a YES/NO answer for whether or not a
    73   regular expression matches a string, but in case it matches also
    73   regular expression matches a string, but in case it does also
    74   answers with \emph{how} it matches the string.  This is important for
    74   answers with \emph{how} it matches the string.  This is important for
    75   applications such as lexing (tokenising a string). The problem is to
    75   applications such as lexing (tokenising a string). The problem is to
    76   make the algorithm by Sulzmann and Lu fast on all inputs without
    76   make the algorithm by Sulzmann and Lu fast on all inputs without
    77   breaking its correctness. We have already developed some
    77   breaking its correctness. We have already developed some
    78   simplification rules for this, but have not yet proved that they
    78   simplification rules for this, but have not yet proved that they
    91 fast?
    91 fast?
    92 
    92 
    93 Unfortunately these preconceptions are not supported by evidence: Take
    93 Unfortunately these preconceptions are not supported by evidence: Take
    94 for example the regular expression $(a^*)^*\,b$ and ask whether
    94 for example the regular expression $(a^*)^*\,b$ and ask whether
    95 strings of the form $aa..a$ match this regular
    95 strings of the form $aa..a$ match this regular
    96 expression. Obviously they do not match---the expected $b$ in the last
    96 expression. Obviously not---the expected $b$ in the last
    97 position is missing. One would expect that modern regular expression
    97 position is missing. One would expect that modern regular expression
    98 matching engines can find this out very quickly. Alas, if one tries
    98 matching engines can find this out very quickly. Alas, if one tries
    99 this example in JavaScript, Python or Java 8 with strings like 28
    99 this example in JavaScript, Python or Java 8 with strings like 28
   100 $a$'s, one discovers that this decision takes around 30 seconds and
   100 $a$'s, one discovers that this decision takes around 30 seconds and
   101 takes considerably longer when adding a few more $a$'s, as the graphs
   101 takes considerably longer when adding a few more $a$'s, as the graphs
   169            of the form $\underbrace{aa..a}_{n}$.}
   169            of the form $\underbrace{aa..a}_{n}$.}
   170 \end{tabular}    
   170 \end{tabular}    
   171 \end{center}  
   171 \end{center}  
   172 
   172 
   173 \noindent These are clearly abysmal and possibly surprising results. One
   173 \noindent These are clearly abysmal and possibly surprising results. One
   174 would expect these systems doing much better than that---after all,
   174 would expect these systems to do  much better than that---after all,
   175 given a DFA and a string, deciding whether a string is matched by this
   175 given a DFA and a string, deciding whether a string is matched by this
   176 DFA should be linear.
   176 DFA should be linear.
   177 
   177 
   178 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
   178 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
   179 exhibit this ``exponential behaviour''.  Unfortunately, such regular
   179 exhibit this exponential behaviour.  Unfortunately, such regular
   180 expressions are not just a few ``outliers'', but actually they are
   180 expressions are not just a few outliers. They are actually 
   181 frequent enough that a separate name has been created for
   181 frequent enough to have a separate name created for
   182 them---\emph{evil regular expressions}. In empiric work, Davis et al
   182 them---\emph{evil regular expressions}. In empiric work, Davis et al
   183 report that they have found thousands of such evil regular expressions
   183 report that they have found thousands of such evil regular expressions
   184 in the JavaScript and Python ecosystems \cite{Davis18}.
   184 in the JavaScript and Python ecosystems \cite{Davis18}.
   185 
   185 
   186 This exponential blowup in matching algorithms sometimes causes
   186 This exponential blowup in matching algorithms sometimes causes
   188 regular expression brought the webpage
   188 regular expression brought the webpage
   189 \href{http://stackexchange.com}{Stack Exchange} to its
   189 \href{http://stackexchange.com}{Stack Exchange} to its
   190 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
   190 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
   191 In this instance, a regular expression intended to just trim white
   191 In this instance, a regular expression intended to just trim white
   192 spaces from the beginning and the end of a line actually consumed
   192 spaces from the beginning and the end of a line actually consumed
   193 massive amounts of CPU-resources and because of this the web servers
   193 massive amounts of CPU-resources---causing web servers to
   194 ground to a halt. This happened when a post with 20,000 white spaces was
   194 grind to a halt. This happened when a post with 20,000 white spaces was
   195 submitted, but importantly the white spaces were neither at the
   195 submitted, but importantly the white spaces were neither at the
   196 beginning nor at the end. As a result, the regular expression matching
   196 beginning nor at the end. As a result, the regular expression matching
   197 engine needed to backtrack over many choices. The underlying problem is
   197 engine needed to backtrack over many choices. The underlying problem is
   198 that many ``real life'' regular expression matching engines do not use
   198 that many ``real life'' regular expression matching engines do not use
   199 DFAs for matching. This is because they support regular expressions that
   199 DFAs for matching. This is because they support regular expressions that
   200 are not covered by the classical automata theory, and in this more
   200 are not covered by the classical automata theory, and in this more
   201 general setting there are quite a few research questions still
   201 general setting there are quite a few research questions still
   202 unanswered and fast algorithms still need to be developed (for example
   202 unanswered and fast algorithms still need to be developed (for example
   203 how to treat bounded repetitions, negation and  back-references
   203 how to treat bounded repetitions, negation and  back-references
   204 efficiently).
   204 efficiently).
   205 
   205 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
       
   206 %how do they avoid dfas exponential states if they use them for fast matching?
   206 There is also another under-researched problem to do with regular
   207 There is also another under-researched problem to do with regular
   207 expressions and lexing, i.e.~the process of breaking up strings into
   208 expressions and lexing, i.e.~the process of breaking up strings into
   208 sequences of tokens according to some regular expressions. In this
   209 sequences of tokens according to some regular expressions. In this
   209 setting one is not just interested in whether or not a regular
   210 setting one is not just interested in whether or not a regular
   210 expression matches a string, but if it matches also in \emph{how} it
   211 expression matches a string, but also in \emph{how}.  Consider for example a regular expression
   211 matches the string.  Consider for example a regular expression
       
   212 $r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
   212 $r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
   213 and so on; and a regular expression $r_{id}$ for recognising
   213 and so on; and a regular expression $r_{id}$ for recognising
   214 identifiers (say, a single character followed by characters or
   214 identifiers (say, a single character followed by characters or
   215 numbers). One can then form the compound regular expression
   215 numbers). One can then form the compound regular expression
   216 $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But then how
   216 $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But then how
   219 single identifier.  Similarly, how should the string \textit{if} be
   219 single identifier.  Similarly, how should the string \textit{if} be
   220 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
   220 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
   221 ``fire''---so is it an identifier or a keyword?  While in applications
   221 ``fire''---so is it an identifier or a keyword?  While in applications
   222 there is a well-known strategy to decide these questions, called POSIX
   222 there is a well-known strategy to decide these questions, called POSIX
   223 matching, only relatively recently precise definitions of what POSIX
   223 matching, only relatively recently precise definitions of what POSIX
   224 matching actually means have been formalised
   224 matching actually means has been formalised
   225 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. 
   225 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. 
   226 Such a definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014}, but the
   226 Such a definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014}, but the
   227 corresponding correctness proof turned out to be  faulty \cite{AusafDyckhoffUrban2016}.
   227 corresponding correctness proof turned out to be  faulty \cite{AusafDyckhoffUrban2016}.
   228 Roughly, POSIX matching means matching the longest initial substring.
   228 Roughly, POSIX matching means matching the longest initial substring.
   229 In the case of a tie, the initial submatch is chosen according to some priorities attached to the
   229 In the case of a tie, the initial sub-match is chosen according to some priorities attached to the
   230 regular expressions (e.g.~keywords have a higher priority than
   230 regular expressions (e.g.~keywords have a higher priority than
   231 identifiers). This sounds rather simple, but according to Grathwohl et
   231 identifiers). This sounds rather simple, but according to Grathwohl et
   232 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
   232 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
   233 
   233 
   234 \begin{quote}
   234 \begin{quote}
   345 \[
   345 \[
   346 match\;s\;r \;\dn\; nullable(r\backslash s)
   346 match\;s\;r \;\dn\; nullable(r\backslash s)
   347 \]
   347 \]
   348 
   348 
   349 \noindent
   349 \noindent
   350 This algorithm can be illustrated graphically as follows
   350 This algorithm looks graphically as follows:
   351 \begin{equation}\label{graph:*}
   351 \begin{equation}\label{graph:*}
   352 \begin{tikzcd}
   352 \begin{tikzcd}
   353 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   353 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   354 \end{tikzcd}
   354 \end{tikzcd}
   355 \end{equation}
   355 \end{equation}
   357 \noindent
   357 \noindent
   358 where we start with  a regular expression  $r_0$, build successive
   358 where we start with  a regular expression  $r_0$, build successive
   359 derivatives until we exhaust the string and then use \textit{nullable}
   359 derivatives until we exhaust the string and then use \textit{nullable}
   360 to test whether the result can match the empty string. It can  be
   360 to test whether the result can match the empty string. It can  be
   361 relatively  easily shown that this matcher is correct  (that is given
   361 relatively  easily shown that this matcher is correct  (that is given
   362 an $s$ and a $r$, it generates YES if and only if $s \in L(r)$).
   362 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
   363 
   363 
   364  
   364  
   365 \section{Values and the Algorithm by Sulzmann and Lu}
   365 \section{Values and the Algorithm by Sulzmann and Lu}
   366 
   366 
   367 One limitation, however, of Brzozowski's algorithm is that it only
   367 One limitation, however, of Brzozowski's algorithm is that it only
   397 		\end{tabular}
   397 		\end{tabular}
   398 	\end{tabular}
   398 	\end{tabular}
   399 \end{center}
   399 \end{center}
   400 
   400 
   401 \noindent
   401 \noindent
   402 There is no value  corresponding to $\ZERO$; $\Empty$ corresponds to
   402 No value  corresponds to $\ZERO$; $\Empty$ corresponds to
   403 $\ONE$; $\Char$ to the character regular expression; $\Seq$ to the
   403 $\ONE$; $\Char$ to the character regular expression; $\Seq$ to the
   404 sequence regular expression and so on. The idea of values is to encode
   404 sequence regular expression and so on. The idea of values is to encode
   405 parse trees. To see this, suppose a \emph{flatten} operation, written
   405 parse trees for how the sub-parts of a regular expression matches
       
   406 the sub-parts of a string. To see this, suppose a \emph{flatten} operation, written
   406 $|v|$. We use this function to extract the underlying string of a value
   407 $|v|$. We use this function to extract the underlying string of a value
   407 $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char
   408 $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char
   408 y})|$ is the string $xy$.  Using flatten, we can describe how values
   409 y})|$ is the string $xy$.  Using flatten, we can describe how values
   409 encode parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @
   410 encode parse trees: $\Seq\,v_1\, v_2$ encodes a tree with 2 children nodes
       
   411 that tells how the string $|v_1| @
   410 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
   412 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
   411 substring $|v_1|$ and, respectively, $r_2$ matches the substring
   413 substring $|v_1|$ and, respectively, $r_2$ matches the substring
   412 $|v_2|$. Exactly how these two are matched is contained in the
   414 $|v_2|$. Exactly how these two are matched is contained in the
   413 sub-structure of $v_1$ and $v_2$. 
   415 children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . 
   414 
   416 
   415  To give a concrete example of how value works, consider the string $xy$
   417  To give a concrete example of how value works, consider the string $xy$
   416 and the regular expression $(x + (y + xy))^*$. We can view this regular
   418 and the regular expression $(x + (y + xy))^*$. We can view this regular
   417 expression as a tree and if the string $xy$ is matched by two Star
   419 expression as a tree and if the string $xy$ is matched by two Star
   418 ``iterations'', then the $x$ is matched by the left-most alternative in
   420 ``iterations'', then the $x$ is matched by the left-most alternative in
   422 \begin{center}
   424 \begin{center}
   423 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
   425 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
   424 \end{center}
   426 \end{center}
   425 
   427 
   426 \noindent
   428 \noindent
   427 where $\Stars$ records how many
   429 where $\Stars [\ldots]$ records all the
   428 iterations were used; and $\Left$, respectively $\Right$, which
   430 iterations; and $\Left$, respectively $\Right$, which
   429 alternative is used. The value for
   431 alternative is used. The value for
   430 matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
   432 matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
   431 would look as follows
   433 would look as follows
   432 
   434 
   433 \begin{center}
   435 \begin{center}
   453 \end{equation}
   455 \end{equation}
   454 
   456 
   455 \noindent
   457 \noindent
   456 For convenience, we shall employ the following notations: the regular expression we
   458 For convenience, we shall employ the following notations: the regular expression we
   457 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
   459 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
   458 \ldots c_n$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
   460 \ldots c_{n-1}$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
   459 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
   461 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
   460 obtain at the derivative $r_n$. We test whether this derivative is
   462 obtain at the derivative $r_n$. We test whether this derivative is
   461 $\textit{nullable}$ or not. If not, we know the string does not match
   463 $\textit{nullable}$ or not. If not, we know the string does not match
   462 $r$ and no value needs to be generated. If yes, we start building the
   464 $r$ and no value needs to be generated. If yes, we start building the
   463 parse tree incrementally by \emph{injecting} back the characters into
   465 parse tree incrementally by \emph{injecting} back the characters into
   485 regular expression on the left-hand side. This will become
   487 regular expression on the left-hand side. This will become
   486 important later on.
   488 important later on.
   487 
   489 
   488 After this, we inject back the characters one by one in order to build
   490 After this, we inject back the characters one by one in order to build
   489 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
   491 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
   490 ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After
   492 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. After
   491 injecting back $n$ characters, we get the parse tree for how $r_0$
   493 injecting back $n$ characters, we get the parse tree for how $r_0$
   492 matches $s$. For this Sulzmann and Lu defined a function that reverses
   494 matches $s$. For this Sulzmann and Lu defined a function that reverses
   493 the ``chopping off'' of characters during the derivative phase. The
   495 the ``chopping off'' of characters during the derivative phase. The
   494 corresponding function is called $\textit{inj}$; it takes three
   496 corresponding function is called $\textit{inj}$; it takes three
   495 arguments: the first one is a regular expression ${r_{i-1}}$, before the
   497 arguments: the first one is a regular expression ${r_{i-1}}$, before the
   496 character is chopped off, the second is a character ${c_{i-1}}$, the
   498 character is chopped off, the second is a character ${c_{i-1}}$, the
   497 character we want to inject and the third argument is the value
   499 character we want to inject and the third argument is the value
   498 $textit{v_i}$, into which one wants to inject the character (it
   500 ${v_i}$, into which one wants to inject the character (it
   499 corresponds to the regular expression after the character has been
   501 corresponds to the regular expression after the character has been
   500 chopped off). The result of this function is a new value. The definition
   502 chopped off). The result of this function is a new value. The definition
   501 of $\textit{inj}$ is as follows: 
   503 of $\textit{inj}$ is as follows: 
   502 
   504 
   503 \begin{center}
   505 \begin{center}
   514 
   516 
   515 \noindent This definition is by recursion on the ``shape'' of regular
   517 \noindent This definition is by recursion on the ``shape'' of regular
   516 expressions and values. To understands this definition better consider
   518 expressions and values. To understands this definition better consider
   517 the situation when we build the derivative on regular expression $r_{i-1}$.
   519 the situation when we build the derivative on regular expression $r_{i-1}$.
   518 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a
   520 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a
   519 ``hole'' in $r_i$. In order to calculate a value for $r_{i-1}$, we need to
   521 ``hole'' in $r_i$ and its corresponding value $v_i$
   520 locate where that hole is. We can find this location information by
   522 . To calculate $v_{i-1}$, we need to
       
   523 locate where that hole is and fill it. 
       
   524 We can find this location by
   521 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape
   525 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape
   522 $r_a \cdot r_b$, and $v_i$ is of shape $\textit{Left}(\textit{Seq}(v_a,
   526 $r_a \cdot r_b$, and $v_i$ is of shape $\Left(Seq(v_1,v_2))$, we know immediately that 
   523 v_b))$, we know immediately that 
       
   524 %
   527 %
   525 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]
   528 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]
   526 
   529 
   527 \noindent
   530 \noindent
   528 otherwise if $r_a$ is not nullable,
   531 otherwise if $r_a$ is not nullable,
   529 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]
   532 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]
   530 
   533 
   531 \noindent
   534 \noindent
   532 the value $v_i$ should be of shape $\Seq(\ldots)$, contradicting the fact that
   535 the value $v_i$ should be  $\Seq(\ldots)$, contradicting the fact that
   533 $v_i$ is actually $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
   536 $v_i$ is actually of shape $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
   534 $\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left
   537 $\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left
   535 branch is taken instead of the right one. We have therefore found out 
   538 branch of \[ (r_a \cdot r_b)\backslash c =
       
   539 \bold{\underline{ (r_a\backslash c) \cdot r_b} }\,+\, r_b\backslash c,\](underlined)
       
   540  is taken instead of the right one. This means $c$ is chopped off 
       
   541 from $r_a$ rather than $r_b$.
       
   542 We have therefore found out 
   536 that the hole will be on $r_a$. So we recursively call $\inj\, 
   543 that the hole will be on $r_a$. So we recursively call $\inj\, 
   537 r_a\,c\,v_a$ in order to fill that hole in $v_a$. After injection, the value 
   544 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
   538 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
   545 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
   539 Other clauses can be understood in a similar way.
   546 Other clauses can be understood in a similar way.
   540 
   547 
   541 Let us do some further examples involving $\inj$: Suppose we have the
   548 Let us do some further examples involving $\inj$: Suppose we have the
   542 regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against
   549 regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against
   543 the string $abc$ (when $abc$ is written as a regular expression, the most
   550 the string $abc$ (when $abc$ is written as a regular expression, the most
   544 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
   551 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
   545 the parentheses and dots here for readability). By POSIX rules the lexer
   552 the parentheses and dots here for readability). By POSIX rules the algorithm
   546 should go for the longest matching, i.e. it should match the string
   553 should go for the longest matching, i.e. it should match the string
   547 $abc$ in one star iteration, using the longest string $abc$ in the
   554 $abc$ in one star iteration, using the longest string $abc$ in the
   548 sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression
   555 sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression
   549 for conciseness). Here is how the lexer achieves a parse tree for this
   556 for conciseness). 
       
   557 Here is how the lexer achieves a parse tree for this
   550 matching. First, build successive derivatives until we exhaust the
   558 matching. First, build successive derivatives until we exhaust the
   551 string, as illustrated here (we simplified some regular expressions like
   559 string, as illustrated here (we simplified some regular expressions like
   552 $0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if
   560 $0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if
   553 they are clear from the context):
   561 they are clear from the context):
   554 %Similarly, we allow
   562 %Similarly, we allow