41 %\theoremstyle{lemma} |
44 %\theoremstyle{lemma} |
42 %\newtheorem{lemma}{Lemma} |
45 %\newtheorem{lemma}{Lemma} |
43 %\newcommand{\lemmaautorefname}{Lemma} |
46 %\newcommand{\lemmaautorefname}{Lemma} |
44 %\theoremstyle{definition} |
47 %\theoremstyle{definition} |
45 %\newtheorem{definition}{Definition} |
48 %\newtheorem{definition}{Definition} |
|
49 \algnewcommand\algorithmicswitch{\textbf{switch}} |
|
50 \algnewcommand\algorithmiccase{\textbf{case}} |
|
51 \algnewcommand\algorithmicassert{\texttt{assert}} |
|
52 \algnewcommand\Assert[1]{\State \algorithmicassert(#1)}% |
|
53 % New "environments" |
|
54 \algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}% |
|
55 \algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ #1}{\algorithmicend\ \algorithmiccase}% |
|
56 \algtext*{EndSwitch}% |
|
57 \algtext*{EndCase}% |
46 |
58 |
47 |
59 |
48 \begin{document} |
60 \begin{document} |
49 |
61 |
50 \maketitle |
62 \maketitle |
313 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
325 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
314 $r \backslash \epsilon $ & $\dn$ & $r$ |
326 $r \backslash \epsilon $ & $\dn$ & $r$ |
315 \end{tabular} |
327 \end{tabular} |
316 \end{center} |
328 \end{center} |
317 |
329 |
318 |
|
319 We obtain a simple and elegant regular |
|
320 expression matching algorithm: |
|
321 \begin{definition}{matcher} |
|
322 \[ |
|
323 match\;s\;r \;\dn\; nullable(r\backslash s) |
|
324 \] |
|
325 \end{definition} |
|
326 |
|
327 For us the main advantage is that derivatives can be |
330 For us the main advantage is that derivatives can be |
328 straightforwardly implemented in any functional programming language, |
331 straightforwardly implemented in any functional programming language, |
329 and are easily definable and reasoned about in theorem provers---the |
332 and are easily definable and reasoned about in theorem provers---the |
330 definitions just consist of inductive datatypes and simple recursive |
333 definitions just consist of inductive datatypes and simple recursive |
331 functions. Moreover, the notion of derivatives can be easily |
334 functions. Moreover, the notion of derivatives can be easily |
332 generalised to cover extended regular expression constructors such as |
335 generalised to cover extended regular expression constructors such as |
333 the not-regular expression, written $\neg\,r$, or bounded repetitions |
336 the not-regular expression, written $\neg\,r$, or bounded repetitions |
334 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so |
337 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so |
335 straightforwardly realised within the classic automata approach. |
338 straightforwardly realised within the classic automata approach. |
|
339 |
|
340 |
|
341 We obtain a simple and elegant regular |
|
342 expression matching algorithm: |
|
343 \begin{definition}{matcher} |
|
344 \[ |
|
345 match\;s\;r \;\dn\; nullable(r\backslash s) |
|
346 \] |
|
347 \end{definition} |
|
348 |
336 |
349 |
337 |
350 |
338 One limitation, however, of Brzozowski's algorithm is that it only |
351 One limitation, however, of Brzozowski's algorithm is that it only |
339 produces a YES/NO answer for whether a string is being matched by a |
352 produces a YES/NO answer for whether a string is being matched by a |
340 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
353 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
366 \end{tabular} |
379 \end{tabular} |
367 \end{tabular} |
380 \end{tabular} |
368 \end{center} |
381 \end{center} |
369 |
382 |
370 \noindent |
383 \noindent |
371 Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. \\ |
384 Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. |
372 Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$.\\ |
385 |
|
386 Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. |
373 |
387 |
374 To give a concrete example of how value works, consider the string $xy$ and the |
388 To give a concrete example of how value works, consider the string $xy$ and the |
375 regular expression $(x + (y + xy))^*$. We can view this regular |
389 regular expression $(x + (y + xy))^*$. We can view this regular |
376 expression as a tree and if the string $xy$ is matched by two Star |
390 expression as a tree and if the string $xy$ is matched by two Star |
377 ``iterations'', then the $x$ is matched by the left-most alternative |
391 ``iterations'', then the $x$ is matched by the left-most alternative |
399 expression. |
413 expression. |
400 |
414 |
401 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
415 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
402 algorithm by a second phase (the first phase being building successive |
416 algorithm by a second phase (the first phase being building successive |
403 derivatives). In this second phase, for every successful match the |
417 derivatives). In this second phase, for every successful match the |
404 corresponding POSIX value is computed. The whole process looks like this diagram:\\ |
418 corresponding POSIX value is computed. The whole process looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is provided on the right for the purpose of comparison):\\ |
405 \begin{tikzcd} |
419 \begin{tikzcd} |
406 r_0 \arrow[r, "c_0"] \arrow[d] & r_1 \arrow[r, "c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
420 r_0 \arrow[r, "c_0"] \arrow[d] & r_1 \arrow[r, "c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
407 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
421 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
408 \end{tikzcd} |
422 \end{tikzcd} |
|
423 |
409 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$. |
424 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$. |
410 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ... until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for the empty word epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string: |
425 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ... until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for the empty word epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string: |
411 |
426 |
412 After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted. |
427 After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted. |
413 An inductive proof can be routinely established. |
428 An inductive proof can be routinely established. |
442 initial results in this regard have been obtained in |
457 initial results in this regard have been obtained in |
443 \cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet |
458 \cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet |
444 is a very tight bound for the size. Such a tight bound is suggested by |
459 is a very tight bound for the size. Such a tight bound is suggested by |
445 work of Antimirov who proved that (partial) derivatives can be bound |
460 work of Antimirov who proved that (partial) derivatives can be bound |
446 by the number of characters contained in the initial regular |
461 by the number of characters contained in the initial regular |
447 expression \cite{Antimirov95}. We believe, and have generated test |
462 expression \cite{Antimirov95}. |
448 data, that a similar bound can be obtained for the derivatives in |
463 |
449 Sulzmann and Lu's algorithm. Let us give some details about this next. |
464 Antimirov defined the "partial derivatives" of regular expressions to be this: |
|
465 %TODO definition of partial derivatives |
|
466 |
|
467 it is essentially a set of regular expressions that come from the sub-structure of the original regular expression. |
|
468 Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression. Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives. |
|
469 |
|
470 %We believe, and have generated test |
|
471 %data, that a similar bound can be obtained for the derivatives in |
|
472 %Sulzmann and Lu's algorithm. Let us give some details about this next. |
450 |
473 |
451 We first followed Sulzmann and Lu's idea of introducing |
474 We first followed Sulzmann and Lu's idea of introducing |
452 \emph{annotated regular expressions}~\cite{Sulzmann2014}. They are |
475 \emph{annotated regular expressions}~\cite{Sulzmann2014}. They are |
453 defined by the following grammar: |
476 defined by the following grammar: |
454 |
477 |
514 is to allow a list of annotated regular expressions in the |
537 is to allow a list of annotated regular expressions in the |
515 \textit{ALTS} constructor. This allows us to not just delete |
538 \textit{ALTS} constructor. This allows us to not just delete |
516 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also |
539 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also |
517 unnecessary ``copies'' of regular expressions (very similar to |
540 unnecessary ``copies'' of regular expressions (very similar to |
518 simplifying $r + r$ to just $r$, but in a more general |
541 simplifying $r + r$ to just $r$, but in a more general |
519 setting). Another modification is that we use simplification rules |
542 setting). |
|
543 A psuedocode version of our algorithm is given below:\\ |
|
544 |
|
545 \begin{algorithm} |
|
546 \caption{simplification of annotated regular expression}\label{euclid} |
|
547 \begin{algorithmic}[1] |
|
548 \Procedure{$Simp$}{$areg$} |
|
549 \Switch{$areg$} |
|
550 \Case{$ALTS(bs, rs)$} |
|
551 \For{\textit{rs[i] in array rs}} |
|
552 \State $\textit{rs[i]} \gets$ \textit{Simp(rs[i])} |
|
553 \EndFor |
|
554 \For{\textit{rs[i] in array rs}} |
|
555 \If{$rs[i] == ALTS(bs', rs')$} |
|
556 \State $rs'' \gets$ attach bits $bs'$ to all elements in $rs'$ |
|
557 \State Insert $rs''$ into $rs$ at position $i$ ($rs[i]$ is destroyed, replaced by its list of children regular expressions) |
|
558 \EndIf |
|
559 \EndFor |
|
560 \State Remove all duplicates in $rs$, only keeping the first copy for multiple occurrences of the same regular expression |
|
561 \State Remove all $0$s in $rs$ |
|
562 \If{$ rs.length == 0$} \Return $ZERO$ \EndIf |
|
563 \If {$ rs.length == 1$} \Return$ rs[0] $\EndIf |
|
564 \EndCase |
|
565 \Case{$SEQ(bs, r_1, r_2)$} |
|
566 \If{$ r_1$ or $r_2$ is $ZERO$} \Return ZERO \EndIf |
|
567 \State update $r_1$ and $r_2$ by attaching $bs$ to their front |
|
568 \If {$r_1$ or $r_2$ is $ONE(bs')$} \Return $r_2$ or $r_1$ \EndIf |
|
569 \EndCase |
|
570 \Case{$Others$} |
|
571 \Return $areg$ as it is |
|
572 \EndCase |
|
573 \EndSwitch |
|
574 \EndProcedure |
|
575 \end{algorithmic} |
|
576 \end{algorithm} |
|
577 |
|
578 |
|
579 Another modification is that we use simplification rules |
520 inspired by Antimirov's work on partial derivatives. They maintain the |
580 inspired by Antimirov's work on partial derivatives. They maintain the |
521 idea that only the first ``copy'' of a regular expression in an |
581 idea that only the first ``copy'' of a regular expression in an |
522 alternative contributes to the calculation of a POSIX value. All |
582 alternative contributes to the calculation of a POSIX value. All |
523 subsequent copies can be pruned from the regular expression. |
583 subsequent copies can be pruned from the regular expression. |
|
584 |
524 |
585 |
525 We are currently engaged with proving that our simplification rules |
586 We are currently engaged with proving that our simplification rules |
526 actually do not affect the POSIX value that should be generated by the |
587 actually do not affect the POSIX value that should be generated by the |
527 algorithm according to the specification of a POSIX value and |
588 algorithm according to the specification of a POSIX value and |
528 furthermore that our derivatives stay small for all derivatives. For |
589 furthermore that our derivatives stay small for all derivatives. For |