70 regular expression matches a string, but in case it matches also |
70 regular expression matches a string, but in case it matches also |
71 \emph{how} it matches the string. This is important for |
71 \emph{how} it matches the string. This is important for |
72 applications such as lexing (tokenising a string). The problem is to |
72 applications such as lexing (tokenising a string). The problem is to |
73 make the algorithm by Sulzmann and Lu fast on all inputs without |
73 make the algorithm by Sulzmann and Lu fast on all inputs without |
74 breaking its correctness. We have already developed some |
74 breaking its correctness. We have already developed some |
75 simplification rules for this, but have not proved that they |
75 simplification rules for this, but have not proved yet that they |
76 preserve the correctness of the algorithm. We also have not yet |
76 preserve the correctness of the algorithm. We also have not yet |
77 looked at extended regular expressions, such as bounded repetitions, |
77 looked at extended regular expressions, such as bounded repetitions, |
78 negation and back-references. |
78 negation and back-references. |
79 \end{abstract} |
79 \end{abstract} |
80 |
80 |
180 frequent enough that a separate name has been created for |
180 frequent enough that a separate name has been created for |
181 them---\emph{evil regular expressions}. In empiric work, Davis et al |
181 them---\emph{evil regular expressions}. In empiric work, Davis et al |
182 report that they have found thousands of such evil regular expressions |
182 report that they have found thousands of such evil regular expressions |
183 in the JavaScript and Python ecosystems \cite{Davis18}. |
183 in the JavaScript and Python ecosystems \cite{Davis18}. |
184 |
184 |
185 This exponential blowup sometimes causes real pain in real life: |
185 This exponential blowup sometimes causes pain in real life: |
186 for example on 20 July 2016 one evil regular expression brought the |
186 for example on 20 July 2016 one evil regular expression brought the |
187 webpage \href{http://stackexchange.com}{Stack Exchange} to its knees \footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}. |
187 webpage \href{http://stackexchange.com}{Stack Exchange} to its |
|
188 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
188 In this instance, a regular expression intended to just trim white |
189 In this instance, a regular expression intended to just trim white |
189 spaces from the beginning and the end of a line actually consumed |
190 spaces from the beginning and the end of a line actually consumed |
190 massive amounts of CPU-resources and because of this the web servers |
191 massive amounts of CPU-resources and because of this the web servers |
191 ground to a halt. This happened when a post with 20,000 white spaces |
192 ground to a halt. This happened when a post with 20,000 white spaces |
192 was submitted, but importantly the white spaces were neither at the |
193 was submitted, but importantly the white spaces were neither at the |
196 The underlying problem is that many ``real life'' regular expression |
197 The underlying problem is that many ``real life'' regular expression |
197 matching engines do not use DFAs for matching. This is because they |
198 matching engines do not use DFAs for matching. This is because they |
198 support regular expressions that are not covered by the classical |
199 support regular expressions that are not covered by the classical |
199 automata theory, and in this more general setting there are quite a |
200 automata theory, and in this more general setting there are quite a |
200 few research questions still unanswered and fast algorithms still need |
201 few research questions still unanswered and fast algorithms still need |
201 to be developed. |
202 to be developed (for example how to include bounded repetitions, negation |
|
203 and back-references). |
202 |
204 |
203 There is also another under-researched problem to do with regular |
205 There is also another under-researched problem to do with regular |
204 expressions and lexing, i.e.~the process of breaking up strings into |
206 expressions and lexing, i.e.~the process of breaking up strings into |
205 sequences of tokens according to some regular expressions. In this |
207 sequences of tokens according to some regular expressions. In this |
206 setting one is not just interested in whether or not a regular |
208 setting one is not just interested in whether or not a regular |
217 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would |
219 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would |
218 ``fire''---so is it an identifier or a keyword? While in applications |
220 ``fire''---so is it an identifier or a keyword? While in applications |
219 there is a well-known strategy to decide these questions, called POSIX |
221 there is a well-known strategy to decide these questions, called POSIX |
220 matching, only relatively recently precise definitions of what POSIX |
222 matching, only relatively recently precise definitions of what POSIX |
221 matching actually means have been formalised |
223 matching actually means have been formalised |
222 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly, |
224 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. |
223 POSIX matching means matching the longest initial substring. |
225 Such a definition has also been given by Sulzmann and Lu \cite{Sulzmann2014}, but the |
|
226 corresponding correctness proof turned out to be faulty \cite{AusafDyckhoffUrban2016}. |
|
227 Roughly, POSIX matching means matching the longest initial substring. |
224 In the case of a tie, the initial submatch is chosen according to some priorities attached to the |
228 In the case of a tie, the initial submatch is chosen according to some priorities attached to the |
225 regular expressions (e.g.~keywords have a higher priority than |
229 regular expressions (e.g.~keywords have a higher priority than |
226 identifiers). This sounds rather simple, but according to Grathwohl et |
230 identifiers). This sounds rather simple, but according to Grathwohl et |
227 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote: |
231 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote: |
228 |
232 |
240 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for |
244 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for |
241 regular expression matching according to the POSIX strategy |
245 regular expression matching according to the POSIX strategy |
242 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by |
246 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by |
243 Brzozowski from 1964 where he introduced the notion of derivatives of |
247 Brzozowski from 1964 where he introduced the notion of derivatives of |
244 regular expressions \cite{Brzozowski1964}. We shall briefly explain |
248 regular expressions \cite{Brzozowski1964}. We shall briefly explain |
245 the algorithms next. |
249 this algorithms next. |
246 |
250 |
247 \section{The Algorithms by Brzozowski, and Sulzmann and Lu} |
251 \section{The Algorithm by Brzozowski based on Derivatives of Regular |
|
252 Expressions} |
248 |
253 |
249 Suppose (basic) regular expressions are given by the following grammar: |
254 Suppose (basic) regular expressions are given by the following grammar: |
250 \[ r ::= \ZERO \mid \ONE |
255 \[ r ::= \ZERO \mid \ONE |
251 \mid c |
256 \mid c |
252 \mid r_1 \cdot r_2 |
257 \mid r_1 \cdot r_2 |
253 \mid r_1 + r_2 |
258 \mid r_1 + r_2 |
254 \mid r^* |
259 \mid r^* |
255 \] |
260 \] |
256 |
261 |
257 \noindent |
262 \noindent |
258 The intended meaning of the constructors is as usual: $\ZERO$ |
263 The intended meaning of the constructors is as follows: $\ZERO$ |
259 cannot match any string, $\ONE$ can match the empty string, the |
264 cannot match any string, $\ONE$ can match the empty string, the |
260 character regular expression $c$ can match the character $c$, and so |
265 character regular expression $c$ can match the character $c$, and so |
261 on. |
266 on. |
262 |
267 |
263 The brilliant contribution by Brzozowski is the notion of |
268 The brilliant contribution by Brzozowski is the notion of |
294 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
299 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
295 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
300 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
296 \end{tabular} |
301 \end{tabular} |
297 \end{center} |
302 \end{center} |
298 |
303 |
299 \noindent |
304 %Assuming the classic notion of a |
300 |
|
301 |
|
302 |
|
303 %Assuming the classic notion of a |
|
304 %\emph{language} of a regular expression, written $L(\_)$, t |
305 %\emph{language} of a regular expression, written $L(\_)$, t |
|
306 |
305 \noindent |
307 \noindent |
306 The main property of the derivative operation is that |
308 The main property of the derivative operation is that |
307 |
309 |
308 \begin{center} |
310 \begin{center} |
309 $c\!::\!s \in L(r)$ holds |
311 $c\!::\!s \in L(r)$ holds |
310 if and only if $s \in L(r\backslash c)$. |
312 if and only if $s \in L(r\backslash c)$. |
311 \end{center} |
313 \end{center} |
312 |
314 |
313 \noindent |
315 \noindent |
314 For us the main advantage is that derivatives can be |
316 For us the main advantage is that derivatives can be |
315 straightforwardly implemented in any functional programming language, |
317 straightforwardly implemented in any functional programming language, |
316 and are easily definable and reasoned about in theorem provers---the |
318 and are easily definable and reasoned about in theorem provers---the |
317 definitions just consist of inductive datatypes and simple recursive |
319 definitions just consist of inductive datatypes and simple recursive |
318 functions. Moreover, the notion of derivatives can be easily |
320 functions. Moreover, the notion of derivatives can be easily |
319 generalised to cover extended regular expression constructors such as |
321 generalised to cover extended regular expression constructors such as |
327 expression $r$, build the derivatives of $r$ w.r.t.\ (in succession) |
329 expression $r$, build the derivatives of $r$ w.r.t.\ (in succession) |
328 all the characters of the string $s$. Finally, test whether the |
330 all the characters of the string $s$. Finally, test whether the |
329 resulting regular expression can match the empty string. If yes, then |
331 resulting regular expression can match the empty string. If yes, then |
330 $r$ matches $s$, and no in the negative case. To implement this idea |
332 $r$ matches $s$, and no in the negative case. To implement this idea |
331 we can generalise the derivative operation to strings like this: |
333 we can generalise the derivative operation to strings like this: |
|
334 |
332 \begin{center} |
335 \begin{center} |
333 \begin{tabular}{lcl} |
336 \begin{tabular}{lcl} |
334 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
337 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
335 $r \backslash [\,] $ & $\dn$ & $r$ |
338 $r \backslash [\,] $ & $\dn$ & $r$ |
336 \end{tabular} |
339 \end{tabular} |
337 \end{center} |
340 \end{center} |
338 |
341 |
339 \noindent |
342 \noindent |
340 Using this definition we obtain a simple and elegant regular |
343 and then define as regular-expression matching algorithm: |
341 expression matching algorithm: |
|
342 \[ |
344 \[ |
343 match\;s\;r \;\dn\; nullable(r\backslash s) |
345 match\;s\;r \;\dn\; nullable(r\backslash s) |
344 \] |
346 \] |
345 |
347 |
346 \noindent |
348 \noindent |
347 Pictorially this algorithm can be illustrated as follows: |
349 This algorithm can be illustrated as follows: |
348 |
350 \begin{equation}\label{graph:*} |
349 \begin{center} |
351 \begin{tikzcd} |
350 \begin{tikzcd}\label{graph:*} |
352 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
351 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"nullable?"] & Yes/No |
|
352 \end{tikzcd} |
353 \end{tikzcd} |
353 \end{center} |
354 \end{equation} |
354 |
355 |
355 \noindent |
356 \noindent |
356 where we start with a regular expression $r_0$, build successive derivatives |
357 where we start with a regular expression $r_0$, build successive |
357 until we exhaust the string and then use \textit{nullable} to test whether the |
358 derivatives until we exhaust the string and then use \textit{nullable} |
358 result can match the empty string. It can be relatively easily shown that this |
359 to test whether the result can match the empty string. It can be |
359 matcher is correct. |
360 relatively easily shown that this matcher is correct (that is given |
|
361 $s$ and $r$, it generates YES if and only if $s \in L(r)$). |
|
362 |
|
363 |
|
364 \section{Values and the Algorithm by Sulzmann and Lu} |
360 |
365 |
361 One limitation, however, of Brzozowski's algorithm is that it only |
366 One limitation, however, of Brzozowski's algorithm is that it only |
362 produces a YES/NO answer for whether a string is being matched by a |
367 produces a YES/NO answer for whether a string is being matched by a |
363 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
368 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
364 algorithm to allow generation of an actual matching, called a |
369 algorithm to allow generation of an actual matching, called a |
365 \emph{value}. |
370 \emph{value}. Values and regular expressions correspond to each |
|
371 other as illustrated in the following table: |
|
372 |
366 |
373 |
367 \begin{center} |
374 \begin{center} |
368 \begin{tabular}{c@{\hspace{20mm}}c} |
375 \begin{tabular}{c@{\hspace{20mm}}c} |
369 \begin{tabular}{@{}rrl@{}} |
376 \begin{tabular}{@{}rrl@{}} |
370 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
377 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
389 \end{tabular} |
396 \end{tabular} |
390 \end{tabular} |
397 \end{tabular} |
391 \end{center} |
398 \end{center} |
392 |
399 |
393 \noindent |
400 \noindent |
394 Values and regular expressions correspond to each other as illustrated by placing corresponding values next to the regular expressions. |
401 The idea of values is to express parse trees. Suppose a flatten |
395 The idea of values is to express parse trees. |
402 operation, written $|v|$, which we can use to extract the underlying |
396 Suppose by using a flatten operation, written $|v|$, we can extract the underlying string of v. |
403 string of $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, |
397 For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition. |
404 (\textit{Char y})|$ is the string $xy$. We omit the straightforward |
398 Using this flatten notation, we now elaborate how values can express parse trees. $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and respectively $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. |
405 definition of flatten. Using flatten, we can describe how |
399 |
406 values encode parse trees: $\Seq\,v_1\, v_2$ tells us how the |
400 To give a concrete example of how value works, consider the string $xy$ and the |
407 string $|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ |
401 regular expression $(x + (y + xy))^*$. We can view this regular |
408 matches $|v_1|$ and, respectively, $r_2$ matches $|v_2|$. Exactly how |
|
409 these two are matched is contained in the sub-structure of $v_1$ and |
|
410 $v_2$. |
|
411 |
|
412 To give a concrete example of how value works, consider the string $xy$ |
|
413 and the regular expression $(x + (y + xy))^*$. We can view this regular |
402 expression as a tree and if the string $xy$ is matched by two Star |
414 expression as a tree and if the string $xy$ is matched by two Star |
403 ``iterations'', then the $x$ is matched by the left-most alternative |
415 ``iterations'', then the $x$ is matched by the left-most alternative in |
404 in this tree and the $y$ by the right-left alternative. This suggests |
416 this tree and the $y$ by the right-left alternative. This suggests to |
405 to record this matching as |
417 record this matching as |
406 |
418 |
407 \begin{center} |
419 \begin{center} |
408 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$ |
420 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$ |
409 \end{center} |
421 \end{center} |
410 |
422 |
424 and $\Seq$ indicates that $xy$ is matched by a sequence regular |
436 and $\Seq$ indicates that $xy$ is matched by a sequence regular |
425 expression. |
437 expression. |
426 |
438 |
427 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
439 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
428 algorithm by a second phase (the first phase being building successive |
440 algorithm by a second phase (the first phase being building successive |
429 derivatives). In this second phase, for every successful match the |
441 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
430 corresponding POSIX value is computed. Pictorially, the Sulzmann and Lu algorithm looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is given before \ref{graph:*}):\\ |
442 is generated assuming the regular expression matches the string. |
|
443 Pictorially, the algorithm as follows: |
|
444 |
|
445 \begin{center} |
431 \begin{tikzcd} |
446 \begin{tikzcd} |
432 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
447 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
433 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
448 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
434 \end{tikzcd} |
449 \end{tikzcd} |
435 |
450 \end{center} |
436 |
451 |
437 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$. |
452 \noindent |
438 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ... until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string: |
453 We shall briefly explain this algorithm. For the convenience of |
|
454 explanation, we have the following notations: the regular expression we |
|
455 start with is $r_0$ and the string $s$ is composed characters $c_0 c_1 |
|
456 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots, using |
|
457 the characters $c_0$, $c_1$,\ldots until we exhaust the string and |
|
458 arrive at the derivative $r_n$. We test whether this derivative is |
|
459 $\textit{nullable}$ or not. If not, we know the string does not match |
|
460 $r$ and no value needs to be generated. If yes, we start building the |
|
461 parse tree incrementally by \emph{injecting} back the characters into |
|
462 the values $v_n, \ldots, v_0$. We first call the function |
|
463 $\textit{mkeps}$, which builds the parse tree for how the empty string |
|
464 is matched the empty regular expression $r_n$. This function is defined |
|
465 as |
439 |
466 |
440 $mkeps $ $1 \,[] $ $= Empty$ |
467 $mkeps $ $1 \,[] $ $= Empty$ |
441 ...... |
468 ...... |
442 |
469 |
443 |
470 |
444 After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted. |
471 After this, we inject back the characters one by one in order to build |
445 An inductive proof can be routinely established. |
472 the parse tree $v_i$ for how the regex $r_i$ matches the string |
|
473 $s_i$ ($s_i$ means the string s with the first $i$ characters being |
|
474 chopped off) from the previous parse tree. After $n$ transformations, we |
|
475 get the parse tree for how $r_0$ matches $s$, exactly as we wanted. An |
|
476 inductive proof can be routinely established. |
446 |
477 |
447 It is instructive to see how it works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)*$ and we want to match it against the string $abc$. By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching. |
478 It is instructive to see how it works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)*$ and we want to match it against the string $abc$. By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching. |
448 First, we build successive derivatives until we exhaust the string, as illustrated here( we omitted some parenthesis for better readability): |
479 First, we build successive derivatives until we exhaust the string, as illustrated here( we omitted some parenthesis for better readability): |
449 |
480 |
450 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\] |
481 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\] |