ninems/ninems.tex
changeset 46 9b48724ec609
parent 45 60cb82639691
child 47 d2a7e87ea6e1
equal deleted inserted replaced
45:60cb82639691 46:9b48724ec609
    70   regular expression matches a string, but in case it matches also
    70   regular expression matches a string, but in case it matches also
    71   \emph{how} it matches the string.  This is important for
    71   \emph{how} it matches the string.  This is important for
    72   applications such as lexing (tokenising a string). The problem is to
    72   applications such as lexing (tokenising a string). The problem is to
    73   make the algorithm by Sulzmann and Lu fast on all inputs without
    73   make the algorithm by Sulzmann and Lu fast on all inputs without
    74   breaking its correctness. We have already developed some
    74   breaking its correctness. We have already developed some
    75   simplification rules for this, but have not proved that they
    75   simplification rules for this, but have not proved yet that they
    76   preserve the correctness of the algorithm. We also have not yet
    76   preserve the correctness of the algorithm. We also have not yet
    77   looked at extended regular expressions, such as bounded repetitions,
    77   looked at extended regular expressions, such as bounded repetitions,
    78   negation and back-references.
    78   negation and back-references.
    79 \end{abstract}
    79 \end{abstract}
    80 
    80 
   180 frequent enough that a separate name has been created for
   180 frequent enough that a separate name has been created for
   181 them---\emph{evil regular expressions}. In empiric work, Davis et al
   181 them---\emph{evil regular expressions}. In empiric work, Davis et al
   182 report that they have found thousands of such evil regular expressions
   182 report that they have found thousands of such evil regular expressions
   183 in the JavaScript and Python ecosystems \cite{Davis18}.
   183 in the JavaScript and Python ecosystems \cite{Davis18}.
   184 
   184 
   185 This exponential blowup sometimes causes real pain in real life:
   185 This exponential blowup sometimes causes pain in real life:
   186 for example on 20 July 2016 one evil regular expression brought the
   186 for example on 20 July 2016 one evil regular expression brought the
   187 webpage \href{http://stackexchange.com}{Stack Exchange} to its knees \footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}.
   187 webpage \href{http://stackexchange.com}{Stack Exchange} to its 
       
   188 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   188 In this instance, a regular expression intended to just trim white
   189 In this instance, a regular expression intended to just trim white
   189 spaces from the beginning and the end of a line actually consumed
   190 spaces from the beginning and the end of a line actually consumed
   190 massive amounts of CPU-resources and because of this the web servers
   191 massive amounts of CPU-resources and because of this the web servers
   191 ground to a halt. This happened when a post with 20,000 white spaces
   192 ground to a halt. This happened when a post with 20,000 white spaces
   192 was submitted, but importantly the white spaces were neither at the
   193 was submitted, but importantly the white spaces were neither at the
   196 The underlying problem is that many ``real life'' regular expression
   197 The underlying problem is that many ``real life'' regular expression
   197 matching engines do not use DFAs for matching. This is because they
   198 matching engines do not use DFAs for matching. This is because they
   198 support regular expressions that are not covered by the classical
   199 support regular expressions that are not covered by the classical
   199 automata theory, and in this more general setting there are quite a
   200 automata theory, and in this more general setting there are quite a
   200 few research questions still unanswered and fast algorithms still need
   201 few research questions still unanswered and fast algorithms still need
   201 to be developed.
   202 to be developed (for example how to include bounded repetitions, negation
       
   203 and  back-references).
   202 
   204 
   203 There is also another under-researched problem to do with regular
   205 There is also another under-researched problem to do with regular
   204 expressions and lexing, i.e.~the process of breaking up strings into
   206 expressions and lexing, i.e.~the process of breaking up strings into
   205 sequences of tokens according to some regular expressions. In this
   207 sequences of tokens according to some regular expressions. In this
   206 setting one is not just interested in whether or not a regular
   208 setting one is not just interested in whether or not a regular
   217 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
   219 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
   218 ``fire''---so is it an identifier or a keyword?  While in applications
   220 ``fire''---so is it an identifier or a keyword?  While in applications
   219 there is a well-known strategy to decide these questions, called POSIX
   221 there is a well-known strategy to decide these questions, called POSIX
   220 matching, only relatively recently precise definitions of what POSIX
   222 matching, only relatively recently precise definitions of what POSIX
   221 matching actually means have been formalised
   223 matching actually means have been formalised
   222 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
   224 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. 
   223 POSIX matching means matching the longest initial substring.
   225 Such a definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014}, but the
       
   226 corresponding correctness proof turned out to be  faulty \cite{AusafDyckhoffUrban2016}.
       
   227 Roughly, POSIX matching means matching the longest initial substring.
   224 In the case of a tie, the initial submatch is chosen according to some priorities attached to the
   228 In the case of a tie, the initial submatch is chosen according to some priorities attached to the
   225 regular expressions (e.g.~keywords have a higher priority than
   229 regular expressions (e.g.~keywords have a higher priority than
   226 identifiers). This sounds rather simple, but according to Grathwohl et
   230 identifiers). This sounds rather simple, but according to Grathwohl et
   227 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
   231 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
   228 
   232 
   240 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
   244 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
   241 regular expression matching according to the POSIX strategy
   245 regular expression matching according to the POSIX strategy
   242 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
   246 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
   243 Brzozowski from 1964 where he introduced the notion of derivatives of
   247 Brzozowski from 1964 where he introduced the notion of derivatives of
   244 regular expressions \cite{Brzozowski1964}. We shall briefly explain
   248 regular expressions \cite{Brzozowski1964}. We shall briefly explain
   245 the algorithms next.
   249 this algorithms next.
   246 
   250 
   247 \section{The Algorithms by  Brzozowski, and Sulzmann and Lu}
   251 \section{The Algorithm by Brzozowski based on Derivatives of Regular
       
   252 Expressions}
   248 
   253 
   249 Suppose (basic) regular expressions are given by the following grammar:
   254 Suppose (basic) regular expressions are given by the following grammar:
   250 \[			r ::=   \ZERO \mid  \ONE
   255 \[			r ::=   \ZERO \mid  \ONE
   251 			 \mid  c  
   256 			 \mid  c  
   252 			 \mid  r_1 \cdot r_2
   257 			 \mid  r_1 \cdot r_2
   253 			 \mid  r_1 + r_2   
   258 			 \mid  r_1 + r_2   
   254 			 \mid r^*         
   259 			 \mid r^*         
   255 \]
   260 \]
   256 
   261 
   257 \noindent
   262 \noindent
   258 The intended meaning of the constructors is as usual: $\ZERO$
   263 The intended meaning of the constructors is as follows: $\ZERO$
   259 cannot match any string, $\ONE$ can match the empty string, the
   264 cannot match any string, $\ONE$ can match the empty string, the
   260 character regular expression $c$ can match the character $c$, and so
   265 character regular expression $c$ can match the character $c$, and so
   261 on.
   266 on.
   262 
   267 
   263 The brilliant contribution by Brzozowski is the notion of
   268 The brilliant contribution by Brzozowski is the notion of
   294 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
   299 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
   295 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
   300 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
   296 \end{tabular}
   301 \end{tabular}
   297 \end{center}
   302 \end{center}
   298 
   303 
   299 \noindent
   304 %Assuming the classic notion of a
   300 
       
   301 
       
   302 
       
   303  %Assuming the classic notion of a
       
   304 %\emph{language} of a regular expression, written $L(\_)$, t
   305 %\emph{language} of a regular expression, written $L(\_)$, t
       
   306 
   305 \noindent
   307 \noindent
   306 The main property of the derivative operation is that
   308 The main property of the derivative operation is that
   307 
   309 
   308 \begin{center}
   310 \begin{center}
   309 $c\!::\!s \in L(r)$ holds
   311 $c\!::\!s \in L(r)$ holds
   310 if and only if $s \in L(r\backslash c)$.
   312 if and only if $s \in L(r\backslash c)$.
   311 \end{center}
   313 \end{center}
   312 
   314 
   313 \noindent
   315 \noindent
   314  For us the main advantage is that derivatives can be
   316 For us the main advantage is that derivatives can be
   315 straightforwardly implemented in any functional programming language,
   317 straightforwardly implemented in any functional programming language,
   316 and are easily definable and reasoned about in theorem provers---the
   318 and are easily definable and reasoned about in theorem provers---the
   317 definitions just consist of inductive datatypes and simple recursive
   319 definitions just consist of inductive datatypes and simple recursive
   318 functions. Moreover, the notion of derivatives can be easily
   320 functions. Moreover, the notion of derivatives can be easily
   319 generalised to cover extended regular expression constructors such as
   321 generalised to cover extended regular expression constructors such as
   327 expression $r$, build the derivatives of $r$ w.r.t.\ (in succession)
   329 expression $r$, build the derivatives of $r$ w.r.t.\ (in succession)
   328 all the characters of the string $s$. Finally, test whether the
   330 all the characters of the string $s$. Finally, test whether the
   329 resulting regular expression can match the empty string.  If yes, then
   331 resulting regular expression can match the empty string.  If yes, then
   330 $r$ matches $s$, and no in the negative case. To implement this idea
   332 $r$ matches $s$, and no in the negative case. To implement this idea
   331 we can generalise the derivative operation to strings like this:
   333 we can generalise the derivative operation to strings like this:
       
   334 
   332 \begin{center}
   335 \begin{center}
   333 \begin{tabular}{lcl}
   336 \begin{tabular}{lcl}
   334 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
   337 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
   335 $r \backslash [\,] $ & $\dn$ & $r$
   338 $r \backslash [\,] $ & $\dn$ & $r$
   336 \end{tabular}
   339 \end{tabular}
   337 \end{center}
   340 \end{center}
   338 
   341 
   339 \noindent
   342 \noindent
   340 Using this definition we obtain a simple and elegant regular
   343 and then define as  regular-expression matching algorithm: 
   341 expression matching algorithm: 
       
   342 \[
   344 \[
   343 match\;s\;r \;\dn\; nullable(r\backslash s)
   345 match\;s\;r \;\dn\; nullable(r\backslash s)
   344 \]
   346 \]
   345 
   347 
   346 \noindent
   348 \noindent
   347 Pictorially this algorithm can be illustrated as follows:
   349 This algorithm can be illustrated as follows:
   348 
   350 \begin{equation}\label{graph:*}
   349 \begin{center}
   351 \begin{tikzcd}
   350 \begin{tikzcd}\label{graph:*} 
   352 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   351 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"nullable?"] & Yes/No
       
   352 \end{tikzcd}
   353 \end{tikzcd}
   353 \end{center}
   354 \end{equation}
   354 
   355 
   355 \noindent
   356 \noindent
   356 where we start with  a regular expression  $r_0$, build successive derivatives
   357 where we start with  a regular expression  $r_0$, build successive
   357 until we exhaust the string and then use \textit{nullable} to test whether the
   358 derivatives until we exhaust the string and then use \textit{nullable}
   358 result can match the empty string. It can  be relatively  easily shown that this
   359 to test whether the result can match the empty string. It can  be
   359 matcher is correct.
   360 relatively  easily shown that this matcher is correct  (that is given
       
   361 $s$ and $r$, it generates YES if and only if $s \in L(r)$).
       
   362 
       
   363  
       
   364 \section{Values and the Algorithm by Sulzmann and Lu}
   360 
   365 
   361 One limitation, however, of Brzozowski's algorithm is that it only
   366 One limitation, however, of Brzozowski's algorithm is that it only
   362 produces a YES/NO answer for whether a string is being matched by a
   367 produces a YES/NO answer for whether a string is being matched by a
   363 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   368 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   364 algorithm to allow generation of an actual matching, called a
   369 algorithm to allow generation of an actual matching, called a
   365 \emph{value}.
   370 \emph{value}. Values and regular expressions correspond to each 
       
   371 other as illustrated in the following table:
       
   372 
   366 
   373 
   367 \begin{center}
   374 \begin{center}
   368 	\begin{tabular}{c@{\hspace{20mm}}c}
   375 	\begin{tabular}{c@{\hspace{20mm}}c}
   369 		\begin{tabular}{@{}rrl@{}}
   376 		\begin{tabular}{@{}rrl@{}}
   370 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
   377 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
   389 		\end{tabular}
   396 		\end{tabular}
   390 	\end{tabular}
   397 	\end{tabular}
   391 \end{center}
   398 \end{center}
   392 
   399 
   393 \noindent
   400 \noindent
   394 Values and regular expressions correspond to each other as illustrated by placing corresponding values next to the regular expressions.
   401 The idea of values is to express parse trees. Suppose a flatten
   395 The idea of values is to express parse trees. 
   402 operation, written $|v|$, which we can use to extract the underlying
   396 Suppose by using a flatten operation, written $|v|$, we can extract the underlying string of v.
   403 string of $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \,
   397 For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition.
   404 (\textit{Char y})|$ is the string $xy$. We omit the straightforward
   398 Using this flatten notation, we now elaborate how values can express parse trees. $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and respectively $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. 
   405 definition of flatten. Using flatten, we can describe how
   399 
   406 values encode parse trees: $\Seq\,v_1\, v_2$ tells us how the
   400  To give a concrete example of how value works, consider the string $xy$ and the
   407 string $|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$
   401 regular expression $(x + (y + xy))^*$. We can view this regular
   408 matches $|v_1|$ and, respectively, $r_2$ matches $|v_2|$. Exactly how
       
   409 these two are matched is contained in the sub-structure of $v_1$ and
       
   410 $v_2$. 
       
   411 
       
   412  To give a concrete example of how value works, consider the string $xy$
       
   413 and the regular expression $(x + (y + xy))^*$. We can view this regular
   402 expression as a tree and if the string $xy$ is matched by two Star
   414 expression as a tree and if the string $xy$ is matched by two Star
   403 ``iterations'', then the $x$ is matched by the left-most alternative
   415 ``iterations'', then the $x$ is matched by the left-most alternative in
   404 in this tree and the $y$ by the right-left alternative. This suggests
   416 this tree and the $y$ by the right-left alternative. This suggests to
   405 to record this matching as
   417 record this matching as
   406 
   418 
   407 \begin{center}
   419 \begin{center}
   408 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
   420 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
   409 \end{center}
   421 \end{center}
   410 
   422 
   424 and $\Seq$ indicates that $xy$ is matched by a sequence regular
   436 and $\Seq$ indicates that $xy$ is matched by a sequence regular
   425 expression.
   437 expression.
   426 
   438 
   427 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   439 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   428 algorithm by a second phase (the first phase being building successive
   440 algorithm by a second phase (the first phase being building successive
   429 derivatives). In this second phase, for every successful match the
   441 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
   430 corresponding POSIX value is computed. Pictorially, the Sulzmann and Lu algorithm looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is given before \ref{graph:*}):\\
   442 is generated assuming the regular expression matches  the string. 
       
   443 Pictorially, the algorithm as follows:
       
   444 
       
   445 \begin{center}
   431 \begin{tikzcd}
   446 \begin{tikzcd}
   432 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   447 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   433 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   448 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   434 \end{tikzcd}
   449 \end{tikzcd}
   435 
   450 \end{center}
   436 
   451 
   437 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
   452 \noindent
   438 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
   453 We shall briefly explain this algorithm. For the convenience of
       
   454 explanation, we have the following notations: the regular expression we
       
   455 start with is $r_0$ and the string $s$ is composed characters $c_0 c_1
       
   456 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots, using
       
   457 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
       
   458 arrive at the derivative $r_n$. We test whether this derivative is
       
   459 $\textit{nullable}$ or not. If not, we know the string does not match
       
   460 $r$ and no value needs to be generated. If yes, we start building the
       
   461 parse tree incrementally by \emph{injecting} back the characters into
       
   462 the values $v_n, \ldots, v_0$. We first call the function
       
   463 $\textit{mkeps}$, which builds the parse tree for how the empty string
       
   464 is matched the empty regular expression $r_n$. This function is defined
       
   465 as
   439 
   466 
   440 $mkeps $ $1 \,[] $ $= Empty$
   467 $mkeps $ $1 \,[] $ $= Empty$
   441 ......
   468 ......
   442 
   469 
   443 
   470 
   444  After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
   471  After this, we inject back the characters one by one in order to build
   445 An inductive proof can be routinely established.
   472 the parse tree $v_i$ for how the regex $r_i$ matches the string
       
   473 $s_i$ ($s_i$ means the string s with the first $i$ characters being
       
   474 chopped off) from the previous parse tree. After $n$ transformations, we
       
   475 get the parse tree for how $r_0$ matches $s$, exactly as we wanted. An
       
   476 inductive proof can be routinely established.
   446 
   477 
   447 It is instructive to see how it works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)*$ and we want to match it against the string $abc$. By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching.
   478 It is instructive to see how it works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)*$ and we want to match it against the string $abc$. By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching.
   448 First, we build successive derivatives until we exhaust the string, as illustrated here( we omitted some parenthesis for better readability):
   479 First, we build successive derivatives until we exhaust the string, as illustrated here( we omitted some parenthesis for better readability):
   449 
   480 
   450 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]
   481 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]