278 adhoc limits for bounded regular expressions: For example in the |
278 adhoc limits for bounded regular expressions: For example in the |
279 regular expression matching library in the Go language and also in Google's RE2 library the regular expression |
279 regular expression matching library in the Go language and also in Google's RE2 library the regular expression |
280 @{term "NTIMES a 1001"} is not permitted, because no counter can be |
280 @{term "NTIMES a 1001"} is not permitted, because no counter can be |
281 above 1000; and in the built-in regular expression library in Rust |
281 above 1000; and in the built-in regular expression library in Rust |
282 expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error |
282 expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error |
283 message for being too big. These problems can of course be solved in matching |
283 message for being too big. Up until recently,\footnote{up until version 1.5.4 of the regex |
|
284 library in Rust; see also CVE-2022-24713.} Rust |
|
285 however happily generated an automaton for the regular expression |
|
286 @{text "a\<^bsup>{0}{4294967295}\<^esup>"}. This was due to a bug |
|
287 in the algorithm that decides when a regular expression is acceptable |
|
288 or too big according to Rust's classification. We shall come back to |
|
289 this example later in the paper. |
|
290 These problems can of course be solved in matching |
284 algorithms where automata go beyond the classic notion and for |
291 algorithms where automata go beyond the classic notion and for |
285 instance include explicit counters (see~\cite{CountingSet2020}). |
292 instance include explicit counters (see for example~\cite{CountingSet2020}). |
286 The point here is that Brzozowski derivatives and the algorithms by |
293 The point here is that Brzozowski derivatives and the algorithms by |
287 Sulzmann and Lu can be straightforwardly extended to deal with |
294 Sulzmann and Lu can be straightforwardly extended to deal with |
288 bounded regular expressions and moreover the resulting code |
295 bounded regular expressions and moreover the resulting code |
289 still consists of only simple recursive functions and inductive |
296 still consists of only simple recursive functions and inductive |
290 datatypes. Finally, bounded regular expressions |
297 datatypes. Finally, bounded regular expressions |
291 do not destroy our finite boundedness property, which we shall |
298 do not destroy our finite boundedness property, which we shall |
292 prove later on.%, because during the lexing process counters will only be |
299 prove later on. |
|
300 |
|
301 %, because during the lexing process counters will only be |
293 %decremented. |
302 %decremented. |
294 |
303 |
295 |
304 |
296 Central to Brzozowski's regular expression matcher are two functions |
305 Central to Brzozowski's regular expression matcher are two functions |
297 called @{text nullable} and \emph{derivative}. The latter is written |
306 called @{text nullable} and \emph{derivative}. The latter is written |
435 Fig~\ref{POSIXrules}). |
444 Fig~\ref{POSIXrules}). |
436 |
445 |
437 \begin{figure}[t] |
446 \begin{figure}[t] |
438 \begin{center}\small% |
447 \begin{center}\small% |
439 \begin{tabular}{@ {\hspace{-2mm}}c@ {}} |
448 \begin{tabular}{@ {\hspace{-2mm}}c@ {}} |
440 \\[-4.5mm] |
449 \\[-8.5mm] |
441 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad |
450 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad |
442 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad |
451 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad |
443 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad |
452 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad |
444 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\ |
453 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\ |
445 $\mprset{flushleft} |
454 $\mprset{flushleft} |
446 \inferrule |
455 \inferrule |
447 {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad |
456 {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad |
448 @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\ |
457 @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad |
449 @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}} |
458 @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}} |
450 {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\smallskip\\ |
459 {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\\ |
451 @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad |
460 @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad |
452 $\mprset{flushleft} |
461 $\mprset{flushleft} |
453 \inferrule |
462 \inferrule |
454 {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
463 {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
455 @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
464 @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
456 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ |
465 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ |
457 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} |
466 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} |
458 {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\smallskip\\ |
467 {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\\ |
459 \mprset{sep=4mm} |
468 \mprset{sep=4mm} |
460 @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad |
469 @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad\; |
461 $\mprset{flushleft} |
470 $\mprset{flushleft} |
462 \inferrule |
471 \inferrule |
463 {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad |
472 {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad |
464 @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad |
473 @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad |
465 @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\ |
474 @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\ |
485 \begin{tabular}{@ {}lcl@ {}} |
494 \begin{tabular}{@ {}lcl@ {}} |
486 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ |
495 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ |
487 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
496 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
488 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
497 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
489 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ |
498 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ |
490 @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\ |
499 @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)} |
491 \end{tabular} |
500 \end{tabular} |
492 \medskip\\ |
501 %\end{tabular} |
493 |
502 %\end{center} |
494 %!\begin{tabular}{@ {}cc@ {}} |
503 \smallskip\\ |
|
504 |
|
505 %\begin{center} |
|
506 %\begin{tabular}{@ {}cc@ {}} |
495 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}} |
507 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}} |
496 @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ |
508 @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)[of "c"]}\\ |
497 @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & |
509 @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & |
498 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ |
510 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ |
499 @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & |
511 @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & |
500 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
512 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
501 %! |
513 %! |
523 The function @{text mkeps} is run when the last derivative is nullable, that is |
535 The function @{text mkeps} is run when the last derivative is nullable, that is |
524 the string to be matched is in the language of the regular expression. It generates |
536 the string to be matched is in the language of the regular expression. It generates |
525 a value for how the last derivative can match the empty string. In case |
537 a value for how the last derivative can match the empty string. In case |
526 of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate |
538 of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate |
527 a list of exactly @{term n} copies, which is the length of the list we expect in this |
539 a list of exactly @{term n} copies, which is the length of the list we expect in this |
528 case. The injection function |
540 case. The injection function\footnote{While the character argument @{text c} is not |
|
541 strictly necessary in the @{text inj}-function for the fragment of regular expressions we |
|
542 use in this paper, it is necessary for extended regular expressions, for example for the range regular expression of the form @{text "[a-z]"}. |
|
543 We therefore keep this argument from the original formulation of @{text inj} by Sulzmann and Lu.} |
529 then calculates the corresponding value for each intermediate derivative until |
544 then calculates the corresponding value for each intermediate derivative until |
530 a value for the original regular expression is generated. |
545 a value for the original regular expression is generated. |
531 Graphically the algorithm by |
546 Graphically the algorithm by |
532 Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz} |
547 Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz} |
533 where the path from the left to the right involving @{term derivatives}/@{const |
548 where the path from the left to the right involving @{term derivatives}/@{const |
534 nullable} is the first phase of the algorithm (calculating successive |
549 nullable} is the first phase of the algorithm (calculating successive |
535 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to |
550 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to |
536 left, the second phase. |
551 left, the second phase. |
537 % |
552 % |
538 \begin{center} |
553 \begin{center} |
539 \begin{tikzpicture}[scale=0.99,node distance=9mm, |
554 \begin{tikzpicture}[scale=0.85,node distance=8mm, |
540 every node/.style={minimum size=6mm}] |
555 every node/.style={minimum size=6mm}] |
541 \node (r1) {@{term "r\<^sub>1"}}; |
556 \node (r1) {@{term "r\<^sub>1"}}; |
542 \node (r2) [right=of r1]{@{term "r\<^sub>2"}}; |
557 \node (r2) [right=of r1]{@{term "r\<^sub>2"}}; |
543 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}}; |
558 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}}; |
544 \node (r3) [right=of r2]{@{term "r\<^sub>3"}}; |
559 \node (r3) [right=of r2]{@{term "r\<^sub>3"}}; |
655 Sulzmann and Lu describe another algorithm that also generates POSIX |
671 Sulzmann and Lu describe another algorithm that also generates POSIX |
656 values but dispenses with the second phase where characters are |
672 values but dispenses with the second phase where characters are |
657 injected ``back'' into values. For this they annotate bitcodes to |
673 injected ``back'' into values. For this they annotate bitcodes to |
658 regular expressions, which we define in Isabelle/HOL as the datatype\medskip |
674 regular expressions, which we define in Isabelle/HOL as the datatype\medskip |
659 |
675 |
660 %!\begin{center} |
676 \begin{center} |
661 \noindent |
677 %\noindent |
662 \begin{minipage}{1.01\textwidth} |
678 %\begin{minipage}{0.9\textwidth} |
663 \,@{term breg} $\,::=\,$ @{term "AZERO"} |
679 \,@{term breg} $\,::=\,$ @{term "AZERO"} |
664 $\,\mid$ @{term "AONE bs"} |
680 $\,\mid$ @{term "AONE bs"} |
665 $\,\mid$ @{term "ACHAR bs c"} |
681 $\,\mid$ @{term "ACHAR bs c"} |
666 $\,\mid$ @{term "AALTs bs rs"} |
682 $\,\mid$ @{term "AALTs bs rs"} |
667 $\,\mid$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"} |
683 $\,\mid$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"} |
668 $\,\mid$ @{term "ASTAR bs r"} |
684 $\,\mid$ @{term "ASTAR bs r"} |
669 $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}\hfill\mbox{} |
685 $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}%\hfill\mbox{} |
670 \end{minipage}\medskip |
686 %\end{minipage}\medskip |
671 %!\end{center} |
687 \end{center} |
672 |
688 |
673 \noindent where @{text bs} stands for bitsequences; @{text r}, |
689 \noindent where @{text bs} stands for bitsequences; @{text r}, |
674 @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular |
690 @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular |
675 expressions; and @{text rs} for lists of bitcoded regular |
691 expressions; and @{text rs} for lists of bitcoded regular |
676 expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"} |
692 expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"} |
700 \end{tabular} |
716 \end{tabular} |
701 \end{tabular} |
717 \end{tabular} |
702 \end{center} |
718 \end{center} |
703 |
719 |
704 \noindent |
720 \noindent |
705 As can be seen, this coding is ``lossy'' in the sense that we do not |
721 As can be seen, this coding is ``lossy'' in the sense that it does not |
706 record explicitly character values and also not sequence values (for |
722 record explicitly character values and also not sequence values (for |
707 them we just append two bitsequences). However, the |
723 them it just appends two bitsequences). However, the |
708 different alternatives for @{text Left}, respectively @{text Right}, are recorded as @{text Z} and |
724 different alternatives for @{text Left}, respectively @{text Right}, are recorded as @{text Z} and |
709 @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate |
725 @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate |
710 if there is still a value coming in the list of @{text Stars}, whereas @{text S} |
726 if there is still a value coming in the list of @{text Stars}, whereas @{text S} |
711 indicates the end of the list. The lossiness makes the process of |
727 indicates the end of the list. The lossiness makes the process of |
712 decoding a bit more involved, but the point is that if we have a |
728 decoding a bit more involved, but the point is that if we have a |
989 is the same as if we first erase the bitcoded regular expression and |
1005 is the same as if we first erase the bitcoded regular expression and |
990 then perform the ``standard'' derivative operation. |
1006 then perform the ``standard'' derivative operation. |
991 |
1007 |
992 \begin{lemma}\label{bnullable}%\mbox{}\smallskip\\ |
1008 \begin{lemma}\label{bnullable}%\mbox{}\smallskip\\ |
993 \textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\ |
1009 \textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\ |
994 \mbox{\hspace{17mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\ |
1010 \mbox{\hspace{21.5mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\ |
995 \mbox{\hspace{17mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$ |
1011 \mbox{\hspace{21.5mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$ |
996 %\begin{tabular}{ll} |
1012 %\begin{tabular}{ll} |
997 %\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\ |
1013 %\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\ |
998 %\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\ |
1014 %\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\ |
999 %\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$. |
1015 %\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ \;provided\; $\textit{nullable}(r^\downarrow)$. |
1000 %\end{tabular} |
1016 %\end{tabular} |
1001 \end{lemma} |
1017 \end{lemma} |
1002 |
1018 |
1003 %\begin{proof} |
1019 %\begin{proof} |
1004 % All properties are by induction on annotated regular expressions. |
1020 % All properties are by induction on annotated regular expressions. |
1109 cannot make any further simplifications. This is a problem because |
1125 cannot make any further simplifications. This is a problem because |
1110 the outermost alternatives contains two copies of the same |
1126 the outermost alternatives contains two copies of the same |
1111 regular expression (underlined with $r$). These copies will |
1127 regular expression (underlined with $r$). These copies will |
1112 spawn new copies in later derivative steps and they in turn even more copies. This |
1128 spawn new copies in later derivative steps and they in turn even more copies. This |
1113 destroys any hope of taming the size of the derivatives. But the |
1129 destroys any hope of taming the size of the derivatives. But the |
1114 second copy of $r$ in \eqref{derivex} will never contribute to a |
1130 second copy of $r$ in~\eqref{derivex} will never contribute to a |
1115 value, because POSIX lexing will always prefer matching a string |
1131 value, because POSIX lexing will always prefer matching a string |
1116 with the first copy. So it could be safely removed without affecting the correctness of the algorithm. |
1132 with the first copy. So it could be safely removed without affecting the correctness of the algorithm. |
1117 The issue with the simple-minded |
1133 The issue with the simple-minded |
1118 simplification rules above is that the rule $r + r \Rightarrow r$ |
1134 simplification rules above is that the rule $r + r \Rightarrow r$ |
1119 will never be applicable because as can be seen in this example the |
1135 will never be applicable because as can be seen in this example the |
1188 @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions |
1204 @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions |
1189 before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from |
1205 before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from |
1190 bitcoded regular expressions to @{text bool}: |
1206 bitcoded regular expressions to @{text bool}: |
1191 % |
1207 % |
1192 \begin{center} |
1208 \begin{center} |
1193 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}} |
1209 \begin{tabular}{@ {}cc@ {}} |
|
1210 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {\hspace{1mm}}} |
1194 @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\ |
1211 @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\ |
1195 @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\ |
1212 @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\ |
|
1213 @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\ |
|
1214 @{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\ |
|
1215 \mbox{} |
|
1216 \end{tabular} |
|
1217 & |
|
1218 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}} |
1196 @{thm (lhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]} & $\dn$ & @{thm (rhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]}\\ |
1219 @{thm (lhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]} & $\dn$ & @{thm (rhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]}\\ |
1197 @{thm (lhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]} & $\dn$ & |
1220 @{thm (lhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]} & $\dn$ & |
1198 @{thm (rhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]}\\ |
1221 @{thm (rhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]}\\ |
1199 @{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\ |
|
1200 @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ & |
|
1201 @{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}\\ |
|
1202 @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\ |
|
1203 @{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\ |
1222 @{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\ |
|
1223 @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ & \\ |
|
1224 \multicolumn{3}{r}{@{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}} |
|
1225 \end{tabular} |
1204 \end{tabular} |
1226 \end{tabular} |
1205 \end{center} |
1227 \end{center} |
1206 |
1228 |
1207 \noindent |
1229 \noindent |
1208 where all other cases are set to @{text False}. |
1230 where all other cases are set to @{text False}. |
1210 but is needed in order to make the removal of unnecessary copies |
1232 but is needed in order to make the removal of unnecessary copies |
1211 to work properly. |
1233 to work properly. |
1212 |
1234 |
1213 Our simplification function depends on three more helper functions, one is called |
1235 Our simplification function depends on three more helper functions, one is called |
1214 @{text flts} and analyses lists of regular expressions coming from alternatives. |
1236 @{text flts} and analyses lists of regular expressions coming from alternatives. |
1215 It is defined as follows: |
1237 It is defined by four clauses as follows: |
1216 |
1238 |
1217 \begin{center} |
1239 \begin{center} |
1218 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
1240 \begin{tabular}{c} |
1219 \multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad |
1241 @{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad |
1220 @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\smallskip\\ |
1242 @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)} \qquad |
1221 @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\ |
1243 @{text "flts ((ALTs bs' rs') :: rs"} |
1222 \multicolumn{3}{@ {}c}{@{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"} |
1244 %@{ thm (lhs) flts.simps(3)[of "bs'" "rs'"]} |
1223 \hspace{5mm}(otherwise)} |
1245 $\dn$ @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\smallskip\\ |
|
1246 @{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"}\hspace{5mm}(otherwise) |
1224 \end{tabular} |
1247 \end{tabular} |
1225 \end{center} |
1248 \end{center} |
1226 |
1249 |
1227 \noindent |
1250 \noindent |
1228 The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and |
1251 The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and |
1229 the third ``de-nests'' alternatives (but retaining the |
1252 the third ``de-nests'' alternatives (but retains the |
1230 bitsequence @{text "bs'"} accumulated in the inner alternative). There are |
1253 bitsequence @{text "bs'"} accumulated in the inner alternative). There are |
1231 some corner cases to be considered when the resulting list inside an alternative is |
1254 some corner cases to be considered when the resulting list inside an alternative is |
1232 empty or a singleton list. We take care of those cases in the |
1255 empty or a singleton list. We take care of those cases in the |
1233 @{text "bsimpALTs"} function; similarly we define a helper function that simplifies |
1256 @{text "bsimpALTs"} function; similarly we define a helper function that simplifies |
1234 sequences according to the usual rules about @{text ZERO}s and @{text ONE}s: |
1257 sequences according to the usual rules about @{text ZERO}s and @{text ONE}s: |
1341 expression anymore. So unless one can somehow |
1366 expression anymore. So unless one can somehow |
1342 synchronise the change in the simplified regular expressions with |
1367 synchronise the change in the simplified regular expressions with |
1343 the original POSIX value, there is no hope of appealing to @{text retrieve} in the |
1368 the original POSIX value, there is no hope of appealing to @{text retrieve} in the |
1344 correctness argument for @{term blexer_simp}. |
1369 correctness argument for @{term blexer_simp}. |
1345 |
1370 |
1346 We found it more helpful to introduce the rewriting systems shown in |
1371 For our proof we found it more helpful to introduce the rewriting systems shown in |
1347 Fig~\ref{SimpRewrites}. The idea is to generate |
1372 Fig~\ref{SimpRewrites}. The idea is to generate |
1348 simplified regular expressions in small steps (unlike the @{text bsimp}-function which |
1373 simplified regular expressions in small steps (unlike the @{text bsimp}-function which |
1349 does the same in a big step), and show that each of |
1374 does the same in a big step), and show that each of |
1350 the small steps preserves the bitcodes that lead to the POSIX value. |
1375 the small steps preserves the bitcodes that lead to the POSIX value. |
1351 The rewrite system is organised such that $\leadsto$ is for bitcoded regular |
1376 The rewrite system is organised such that $\leadsto$ is for bitcoded regular |
1509 where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$). |
1534 where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$). |
1510 In (3) we know that $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is |
1535 In (3) we know that $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is |
1511 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
1536 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
1512 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands |
1537 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands |
1513 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
1538 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
1514 We reason similarly for @{text STAR}.\smallskip |
1539 We reason similarly for @{text STAR} and @{text NT}.\smallskip |
1515 |
1540 |
1516 |
1541 |
1517 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is |
1542 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is |
1518 far from the actual bound we can expect. We can do better than this, but this does not improve |
1543 far from the actual bound we can expect. We can do better than this, but this does not improve |
1519 the finiteness property we are proving. If we are interested in a polynomial bound, |
1544 the finiteness property we are proving. If we are interested in a polynomial bound, |
1534 |
1559 |
1535 \noindent |
1560 \noindent |
1536 Essentially it matches the string with the longer @{text "Right"}-alternative in the |
1561 Essentially it matches the string with the longer @{text "Right"}-alternative in the |
1537 first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence). |
1562 first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence). |
1538 If we add the simplification above, then we obtain the following value |
1563 If we add the simplification above, then we obtain the following value |
1539 % |
|
1540 \[ |
|
1541 @{term "Seq (Left (Char a)) (Left (Char b))"} |
1564 @{term "Seq (Left (Char a)) (Left (Char b))"} |
1542 \] |
|
1543 |
|
1544 \noindent |
|
1545 where the @{text "Left"}-alternatives get priority. However, this violates |
1565 where the @{text "Left"}-alternatives get priority. However, this violates |
1546 the POSIX rules and we have not been able to |
1566 the POSIX rules and we have not been able to |
1547 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm] |
1567 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm] |
1548 |
1568 |
1549 Note that while Antimirov was able to give a bound on the \emph{size} |
1569 Note that while Antimirov was able to give a bound on the \emph{size} |
1550 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound |
1570 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound |
1551 on the \emph{number} of derivatives, provided they are quotient via |
1571 on the \emph{number} of derivatives, provided they are quotient via |
1552 ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one |
1572 ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one |
1553 uses derivatives for obtaining an automaton (it essentially bounds |
1573 uses his derivatives for obtaining a DFA (it essentially bounds |
1554 the number of states). However, this result does \emph{not} |
1574 the number of states). However, this result does \emph{not} |
1555 transfer to our setting where we are interested in the \emph{size} of the |
1575 transfer to our setting where we are interested in the \emph{size} of the |
1556 derivatives. For example it is not true for our derivatives that the |
1576 derivatives. For example it is \emph{not} true for our derivatives that the |
1557 set of of derivatives $r \backslash s$ for a given $r$ and all strings |
1577 set of derivatives $r \backslash s$ for a given $r$ and all strings |
1558 $s$ is finite (even when simplified). This is because for our annotated regular expressions |
1578 $s$ is finite (even when simplified). This is because for our annotated regular expressions |
1559 the bitcode annotation is dependent on the number of iterations that |
1579 the bitcode annotation is dependent on the number of iterations that |
1560 are needed for STAR-regular expressions. This is not a problem for us: Since we intend to do lexing |
1580 are needed for @{text STAR}-regular expressions. This is not a problem for us: Since we intend to do lexing |
1561 by calculating (as fast as possible) derivatives, the bound on the size |
1581 by calculating (as fast as possible) derivatives, the bound on the size |
1562 of the derivatives is important, not the number of derivatives. |
1582 of the derivatives is important, not their number. % of derivatives. |
1563 |
1583 |
1564 |
1584 |
1565 *} |
1585 *} |
1566 |
1586 |
1567 |
1587 |
1575 the correctness of the first algorithm |
1595 the correctness of the first algorithm |
1576 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to |
1596 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to |
1577 introduce our own specification about what POSIX values are, |
1597 introduce our own specification about what POSIX values are, |
1578 because the informal definition given by Sulzmann and Lu did not |
1598 because the informal definition given by Sulzmann and Lu did not |
1579 stand up to formal proof. Also for the second algorithm we needed |
1599 stand up to formal proof. Also for the second algorithm we needed |
1580 to introduce our own definitions and proof ideas in order to establish the |
1600 to introduce our own definitions and proof ideas in order to |
1581 correctness. Our interest in the second algorithm |
1601 establish the correctness. Our interest in the second algorithm |
1582 lies in the fact that by using bitcoded regular expressions and an aggressive |
1602 lies in the fact that by using bitcoded regular expressions and an |
1583 simplification method there is a chance that the derivatives |
1603 aggressive simplification method there is a chance that the |
1584 can be kept universally small (we established in this paper that |
1604 derivatives can be kept universally small (we established in this |
1585 for a given $r$ they can be kept finitely bounded for all strings). |
1605 paper that for a given $r$ they can be kept finitely bounded for |
|
1606 all strings). Our formalisation is approximately 7500 lines of |
|
1607 Isabelle code. A little more than half of this code concerns the finiteness bound |
|
1608 obtained in Section 5. This slight ``bloat'' in the latter part is because |
|
1609 we had to introduce an intermediate datatype for annotated regular expressions and repeat many |
|
1610 definitions for this intermediate datatype. But overall this made our |
|
1611 formalisation work smoother. |
1586 %This is important if one is after |
1612 %This is important if one is after |
1587 %an efficient POSIX lexing algorithm based on derivatives. |
1613 %an efficient POSIX lexing algorithm based on derivatives. |
1588 |
1614 |
1589 Having proved the correctness of the POSIX lexing algorithm, which |
1615 Having proved the correctness of the POSIX lexing algorithm, which |
1590 lessons have we learned? Well, we feel this is a very good example |
1616 lessons have we learned? Well, we feel this is a very good example |
1595 obscure, examples. |
1621 obscure, examples. |
1596 %We found that from an implementation |
1622 %We found that from an implementation |
1597 %point-of-view it is really important to have the formal proofs of |
1623 %point-of-view it is really important to have the formal proofs of |
1598 %the corresponding properties at hand. |
1624 %the corresponding properties at hand. |
1599 |
1625 |
1600 With the results reported here, we can of course only make a claim about the correctness |
1626 With the results reported here, we can of course only make a claim |
1601 of the algorithm and the sizes of the |
1627 about the correctness of the algorithm and the sizes of the |
1602 derivatives, not about the efficiency or runtime of our version of |
1628 derivatives, not about the efficiency or runtime of our version of |
1603 Sulzman and Lu's algorithm. But we found the size is an important |
1629 Sulzmann and Lu's algorithm. But we found the size is an important |
1604 first indicator about efficiency: clearly if the derivatives can |
1630 first indicator about efficiency: clearly if the derivatives can |
1605 grow to arbitrarily big sizes and the algorithm needs to traverse |
1631 grow to arbitrarily big sizes and the algorithm needs to traverse |
1606 the derivatives possibly several times, then the algorithm will be |
1632 the derivatives possibly several times, then the algorithm will be |
1607 slow---excruciatingly slow that is. Other works seems to make |
1633 slow---excruciatingly slow that is. Other works seem to make |
1608 stronger claims, but during our work we have developed a healthy |
1634 stronger claims, but during our formalisation work we have |
1609 suspicion when for example experimental data is used to back up |
1635 developed a healthy suspicion when for example experimental data is |
1610 efficiency claims. For instance Sulzmann and Lu write about their |
1636 used to back up efficiency claims. For instance Sulzmann and Lu |
1611 equivalent of @{term blexer_simp} \textit{``...we can incrementally |
1637 write about their equivalent of @{term blexer_simp} \textit{``...we |
1612 compute bitcoded parse trees in linear time in the size of the |
1638 can incrementally compute bitcoded parse trees in linear time in |
1613 input''} \cite[Page 14]{Sulzmann2014}. Given the growth of the |
1639 the size of the input''} \cite[Page 14]{Sulzmann2014}. Given the |
1614 derivatives in some cases even after aggressive simplification, |
1640 growth of the derivatives in some cases even after aggressive |
1615 this is a hard to believe claim. A similar claim about a |
1641 simplification, this is a hard to believe claim. A similar claim |
1616 theoretical runtime of @{text "O(n\<^sup>2)"} is made for the |
1642 about a theoretical runtime of @{text "O(n\<^sup>2)"} for one |
1617 Verbatim lexer, which calculates tokens according to POSIX |
1643 specific list of regular expressions is made for the Verbatim |
|
1644 lexer, which calculates tokens according to POSIX |
1618 rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's |
1645 rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's |
1619 derivatives like in our work. The authors write: \textit{``The |
1646 derivatives like in our work. About their empirical data, the authors write: |
1620 results of our empirical tests [..] confirm that Verbatim has |
1647 \textit{``The results of our empirical tests [..] confirm that |
1621 @{text "O(n\<^sup>2)"} time complexity.''} |
1648 Verbatim has @{text "O(n\<^sup>2)"} time complexity.''} |
1622 \cite[Section~VII]{verbatim}. While their correctness proof for |
1649 \cite[Section~VII]{verbatim}. While their correctness proof for |
1623 Verbatim is formalised in Coq, the claim about the runtime |
1650 Verbatim is formalised in Coq, the claim about the runtime |
1624 complexity is only supported by some emperical evidence obtained by |
1651 complexity is only supported by some empirical evidence obtained by |
1625 using the code extraction facilities of Coq. Given our observation |
1652 using the code extraction facilities of Coq. Given our observation |
1626 with the ``growth problem'' of derivatives, we tried out their |
1653 with the ``growth problem'' of derivatives, this runtime bound |
1627 extracted OCaml code with the example \mbox{@{text "(a + |
1654 is unlikely to hold universally: indeed we tried out their extracted OCaml |
1628 aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 |
1655 code with the example \mbox{@{text "(a + aa)\<^sup>*"}} as a single |
1629 minutes to tokenise a string of 40 $a$'s and that increased to |
1656 lexing rule, and it took for us around 5 minutes to tokenise a |
1630 approximately 19 minutes when the string is 50 $a$'s long. Taking |
1657 string of 40 $a$'s and that increased to approximately 19 minutes |
1631 into account that derivatives are not simplified in the Verbatim |
1658 when the string is 50 $a$'s long. Taking into account that |
1632 lexer, such numbers are not surprising. Clearly our result of |
1659 derivatives are not simplified in the Verbatim lexer, such numbers |
1633 having finite derivatives might sound rather weak in this context |
1660 are not surprising. Clearly our result of having finite |
1634 but we think such effeciency claims really require further |
1661 derivatives might sound rather weak in this context but we think |
1635 scrutiny. |
1662 such efficiency claims really require further scrutiny. The |
1636 |
1663 contribution of this paper is to make sure derivatives do not grow |
1637 The contribution of this paper is to make sure derivatives do not |
1664 arbitrarily big (universally). In the example \mbox{@{text "(a + |
1638 grow arbitrarily big (universially) In the example \mbox{@{text "(a |
1665 aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or |
1639 + aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or |
|
1640 less. The result is that lexing a string of, say, 50\,000 a's with |
1666 less. The result is that lexing a string of, say, 50\,000 a's with |
1641 the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes |
1667 the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes |
1642 approximately 10 seconds with our Scala implementation of the |
1668 approximately 10 seconds with our Scala implementation of the |
1643 presented algorithm. |
1669 presented algorithm. |
1644 |
1670 |
1661 derivatives is never greater than 5 in this example. Even in the |
1687 derivatives is never greater than 5 in this example. Even in the |
1662 example from Section 2, where Rust raises an error message, namely |
1688 example from Section 2, where Rust raises an error message, namely |
1663 \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for |
1689 \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for |
1664 our derivatives is a moderate 14. |
1690 our derivatives is a moderate 14. |
1665 |
1691 |
|
1692 Let us also return to the example @{text |
|
1693 "a\<^bsup>{0}{4294967295}\<^esup>"} which until recently Rust |
|
1694 deemed acceptable. But this was due to a bug. It turns out that it took Rust |
|
1695 more than 11 minutes to generate an automaton for this regular |
|
1696 expression and then to determine that a string of just one(!) |
|
1697 @{text a} does \emph{not} match this regular expression. Therefore |
|
1698 it is probably a wise choice that in newer versions of Rust's |
|
1699 regular expression library such regular expressions are declared as |
|
1700 ``too big''. While this is clearly |
|
1701 a contrived example, the safety guaranties Rust wants to provide necessitate |
|
1702 this conservative approach. |
|
1703 However, with the derivatives and simplifications we have shown |
|
1704 in this paper, the example can be solved with ease: |
|
1705 it essentially only involves operations on |
|
1706 integers and our Scala implementation takes only a few seconds to |
|
1707 find out that this string, or even much larger strings, do not match. |
|
1708 |
1666 Let us also compare our work to the verified Verbatim++ lexer where |
1709 Let us also compare our work to the verified Verbatim++ lexer where |
1667 the authors of the Verbatim lexer introduced a number of |
1710 the authors of the Verbatim lexer introduced a number of |
1668 improvements and optimisations, for example memoisation |
1711 improvements and optimisations, for example memoisation |
1669 \cite{verbatimpp}. However, unlike Verbatim, which works with |
1712 \cite{verbatimpp}. However, unlike Verbatim, which works with |
1670 derivatives like in our work, Verbatim++ compiles first a regular |
1713 derivatives like in our work, Verbatim++ compiles first a regular |
1679 translation using the traditional notion of DFAs so that we can improve on this. Therefore we |
1722 translation using the traditional notion of DFAs so that we can improve on this. Therefore we |
1680 prefer to stick with calculating derivatives, but attempt to make |
1723 prefer to stick with calculating derivatives, but attempt to make |
1681 this calculation (in the future) as fast as possible. What we can guaranty |
1724 this calculation (in the future) as fast as possible. What we can guaranty |
1682 with the presented work is that the maximum size of the derivatives |
1725 with the presented work is that the maximum size of the derivatives |
1683 for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala |
1726 for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala |
1684 implementation only needs a few seconds for this example and matching 50\,000 a's. |
1727 implementation again only needs a few seconds for this example and matching 50\,000 a's, say. |
1685 % |
1728 % |
1686 % |
1729 % |
1687 %Possible ideas are |
1730 %Possible ideas are |
1688 %zippers which have been used in the context of parsing of |
1731 %zippers which have been used in the context of parsing of |
1689 %context-free grammars \cite{zipperparser}. |
1732 %context-free grammars \cite{zipperparser}. |
1690 \medskip |
1733 \\[-4mm] %\smallskip |
1691 |
1734 |
1692 \noindent |
1735 \noindent |
1693 Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}. |
1736 Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}. |
1694 %\\[-10mm] |
1737 \\[-10mm] |
1695 |
1738 |
1696 |
1739 |
1697 %%\bibliographystyle{plain} |
1740 %%\bibliographystyle{plain} |
1698 \bibliography{root} |
1741 \bibliography{root} |
1699 *} |
1742 *} |