254 \cite{AusafDyckhoffUrban2016}). |
257 \cite{AusafDyckhoffUrban2016}). |
255 |
258 |
256 In our work here we also add to the usual ``basic'' regular |
259 In our work here we also add to the usual ``basic'' regular |
257 expressions the \emph{bounded} regular expression @{term "NTIMES r |
260 expressions the \emph{bounded} regular expression @{term "NTIMES r |
258 n"} where the @{term n} specifies that @{term r} should match |
261 n"} where the @{term n} specifies that @{term r} should match |
259 exactly @{term n}-times. Again for brevity we omit the other bounded |
262 exactly @{term n}-times (it is not included in Sulzmann and Lu's original work). For brevity we omit the other bounded |
260 regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$ |
263 regular expressions @{text "r"}$^{\{..\textit{n}\}}$, |
261 and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many |
264 @{text "r"}$^{\{\textit{n}..\}}$ |
|
265 and @{text "r"}$^{\{\textit{n}..\textit{m}\}}$ which specify intervals for how many |
262 times @{text r} should match. The results presented in this paper |
266 times @{text r} should match. The results presented in this paper |
263 extend straightforwardly to them too. The importance of the bounded |
267 extend straightforwardly to them too. The importance of the bounded |
264 regular expressions is that they are often used in practical |
268 regular expressions is that they are often used in practical |
265 applications, such as Snort (a system for detecting network |
269 applications, such as Snort (a system for detecting network |
266 intrusions) and also in XML Schema definitions. According to Bj\"{o}rklund et |
270 intrusions) and also in XML Schema definitions. According to Bj\"{o}rklund et |
267 al~\cite{BjorklundMartensTimm2015}, bounded regular expressions |
271 al~\cite{BjorklundMartensTimm2015}, bounded regular expressions |
268 occur frequently in the latter and can have counters of up to |
272 occur frequently in the latter and can have counters of up to |
269 ten million. The problem is that tools based on the classic notion |
273 ten million. The problem is that tools based on the classic notion |
270 of automata need to expand @{text "r"}$^{\{n\}}$ into @{text n} |
274 of automata need to expand @{text "r"}$^{\{\textit{n}\}}$ into @{text n} |
271 connected copies of the automaton for @{text r}. This leads to very |
275 connected copies of the automaton for @{text r}. This leads to very |
272 inefficient matching algorithms or algorithms that consume large |
276 inefficient matching algorithms or algorithms that consume large |
273 amounts of memory. A classic example is the regular expression |
277 amounts of memory. A classic example is the regular expression |
274 \mbox{@{term "SEQ (SEQ (STAR (ALT a b)) a) (NTIMES (ALT a b) n)"}} |
278 \mbox{@{term "SEQ (SEQ (STAR (ALT a b)) a) (NTIMES (ALT a b) n)"}} |
275 where the minimal DFA requires at least $2^{n + 1}$ states (see |
279 where the minimal DFA requires at least $2^{n + 1}$ states (see |
276 \cite{CountingSet2020}). Therefore regular expression matching |
280 \cite{CountingSet2020}). Therefore regular expression matching |
277 libraries that rely on the classic notion of DFAs often impose |
281 libraries that rely on the classic notion of DFAs often impose |
278 adhoc limits for bounded regular expressions: For example in the |
282 adhoc limits for bounded regular expressions: For example in the |
279 regular expression matching library in the Go language and also in Google's RE2 library the regular expression |
283 regular expression matching library in the Go language and also in Google's RE2 library the regular expression |
280 @{term "NTIMES a 1001"} is not permitted, because no counter can be |
284 @{term "NTIMES a 1001"} is not permitted, because no counter can be |
281 above 1000; and in the built-in regular expression library in Rust |
285 above 1000; and in the regular expression library in Rust |
282 expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error |
286 expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error |
283 message for being too big. These problems can of course be solved in matching |
287 message for being too big. Up until recently,\footnote{up until version 1.5.4 of the regex |
|
288 library in Rust; see also CVE-2022-24713.} Rust |
|
289 however happily generated automata for regular expressions such as |
|
290 @{text "a\<^bsup>{0}{4294967295}\<^esup>"}. This was due to a bug |
|
291 in the algorithm that decides when a regular expression is acceptable |
|
292 or too big according to Rust's classification (it did not account for the fact that @{text "a\<^bsup>{0}\<^esup>"} and similar examples can match the empty string). We shall come back to |
|
293 this example later in the paper. |
|
294 These problems can of course be solved in matching |
284 algorithms where automata go beyond the classic notion and for |
295 algorithms where automata go beyond the classic notion and for |
285 instance include explicit counters (see~\cite{CountingSet2020}). |
296 instance include explicit counters (e.g.~\cite{CountingSet2020}). |
286 The point here is that Brzozowski derivatives and the algorithms by |
297 The point here is that Brzozowski derivatives and the algorithms by |
287 Sulzmann and Lu can be straightforwardly extended to deal with |
298 Sulzmann and Lu can be straightforwardly extended to deal with |
288 bounded regular expressions and moreover the resulting code |
299 bounded regular expressions and moreover the resulting code |
289 still consists of only simple recursive functions and inductive |
300 still consists of only simple recursive functions and inductive |
290 datatypes. Finally, bounded regular expressions |
301 datatypes. Finally, bounded regular expressions |
291 do not destroy our finite boundedness property, which we shall |
302 do not destroy our finite boundedness property, which we shall |
292 prove later on.%, because during the lexing process counters will only be |
303 prove later on. |
|
304 |
|
305 %, because during the lexing process counters will only be |
293 %decremented. |
306 %decremented. |
294 |
307 |
295 |
308 |
296 Central to Brzozowski's regular expression matcher are two functions |
309 Central to Brzozowski's regular expression matcher are two functions |
297 called @{text nullable} and \emph{derivative}. The latter is written |
310 called @{text nullable} and \emph{derivative}. The latter is written |
383 inhabitation relation that associates values to regular expressions. Our |
396 inhabitation relation that associates values to regular expressions. Our |
384 version of this relation is defined by the following six rules: |
397 version of this relation is defined by the following six rules: |
385 % |
398 % |
386 \begin{center} |
399 \begin{center} |
387 \begin{tabular}{@ {}l@ {}} |
400 \begin{tabular}{@ {}l@ {}} |
388 @{thm[mode=Axiom] Prf.intros(4)}\quad |
401 @{thm[mode=Axiom] Prf.intros(4)}\qquad |
389 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad |
402 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\qquad |
390 @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\quad |
403 @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\qquad |
391 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ |
404 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ |
392 @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad |
405 @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad |
393 @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad |
406 @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad |
394 $\mprset{flushleft}\inferrule{ |
407 $\mprset{flushleft}\inferrule{ |
395 @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\\\\ |
408 @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\\\\ |
396 @{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\\\\ |
409 @{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\quad |
397 @{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]} |
410 @{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]} |
398 } |
411 } |
399 {@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}} |
412 {@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}} |
400 $ |
413 $ |
401 \end{tabular} |
414 \end{tabular} |
435 Fig~\ref{POSIXrules}). |
448 Fig~\ref{POSIXrules}). |
436 |
449 |
437 \begin{figure}[t] |
450 \begin{figure}[t] |
438 \begin{center}\small% |
451 \begin{center}\small% |
439 \begin{tabular}{@ {\hspace{-2mm}}c@ {}} |
452 \begin{tabular}{@ {\hspace{-2mm}}c@ {}} |
440 \\[-4.5mm] |
453 \\[-8.5mm] |
441 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad |
454 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad |
442 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad |
455 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad |
443 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad |
456 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad |
444 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\ |
457 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\ |
445 $\mprset{flushleft} |
458 $\mprset{flushleft} |
446 \inferrule |
459 \inferrule |
447 {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad |
460 {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad |
448 @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\ |
461 @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad |
449 @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}} |
462 @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}} |
450 {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\smallskip\\ |
463 {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\\ |
451 @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad |
464 @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad |
452 $\mprset{flushleft} |
465 $\mprset{flushleft} |
453 \inferrule |
466 \inferrule |
454 {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
467 {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
455 @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
468 @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
456 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ |
469 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ |
457 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} |
470 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} |
458 {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\smallskip\\ |
471 {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\\ |
459 \mprset{sep=4mm} |
472 \mprset{sep=4mm} |
460 @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad |
473 @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad\; |
461 $\mprset{flushleft} |
474 $\mprset{flushleft} |
462 \inferrule |
475 \inferrule |
463 {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad |
476 {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad |
464 @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad |
477 @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad |
465 @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\ |
478 @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\ |
485 \begin{tabular}{@ {}lcl@ {}} |
498 \begin{tabular}{@ {}lcl@ {}} |
486 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ |
499 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ |
487 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
500 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
488 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
501 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
489 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ |
502 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ |
490 @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\ |
503 @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)} |
491 \end{tabular} |
504 \end{tabular} |
492 \medskip\\ |
505 %\end{tabular} |
493 |
506 %\end{center} |
494 %!\begin{tabular}{@ {}cc@ {}} |
507 \smallskip\\ |
|
508 |
|
509 %\begin{center} |
|
510 %\begin{tabular}{@ {}cc@ {}} |
495 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}} |
511 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}} |
496 @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ |
512 @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)[of "c"]}\\ |
497 @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & |
513 @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & |
498 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ |
514 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ |
499 @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & |
515 @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & |
500 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
516 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
501 %! |
517 %! |
523 The function @{text mkeps} is run when the last derivative is nullable, that is |
539 The function @{text mkeps} is run when the last derivative is nullable, that is |
524 the string to be matched is in the language of the regular expression. It generates |
540 the string to be matched is in the language of the regular expression. It generates |
525 a value for how the last derivative can match the empty string. In case |
541 a value for how the last derivative can match the empty string. In case |
526 of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate |
542 of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate |
527 a list of exactly @{term n} copies, which is the length of the list we expect in this |
543 a list of exactly @{term n} copies, which is the length of the list we expect in this |
528 case. The injection function |
544 case. The injection function\footnote{While the character argument @{text c} is not |
|
545 strictly necessary in the @{text inj}-function for the fragment of regular expressions we |
|
546 use in this paper, it is necessary for extended regular expressions. For example for the range regular expression of the form @{text "[a-z]"}. |
|
547 We therefore keep this argument from the original formulation of @{text inj} by Sulzmann and Lu.} |
529 then calculates the corresponding value for each intermediate derivative until |
548 then calculates the corresponding value for each intermediate derivative until |
530 a value for the original regular expression is generated. |
549 a value for the original regular expression is generated. |
531 Graphically the algorithm by |
550 Graphically the algorithm by |
532 Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz} |
551 Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz} |
533 where the path from the left to the right involving @{term derivatives}/@{const |
552 where the path from the left to the right involving @{term derivatives}/@{const |
534 nullable} is the first phase of the algorithm (calculating successive |
553 nullable} is the first phase of the algorithm (calculating successive |
535 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to |
554 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to |
536 left, the second phase. |
555 left, the second phase. |
537 % |
556 % |
538 \begin{center} |
557 \begin{center} |
539 \begin{tikzpicture}[scale=0.99,node distance=9mm, |
558 \begin{tikzpicture}[scale=0.85,node distance=8mm, |
540 every node/.style={minimum size=6mm}] |
559 every node/.style={minimum size=6mm}] |
541 \node (r1) {@{term "r\<^sub>1"}}; |
560 \node (r1) {@{term "r\<^sub>1"}}; |
542 \node (r2) [right=of r1]{@{term "r\<^sub>2"}}; |
561 \node (r2) [right=of r1]{@{term "r\<^sub>2"}}; |
543 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}}; |
562 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}}; |
544 \node (r3) [right=of r2]{@{term "r\<^sub>3"}}; |
563 \node (r3) [right=of r2]{@{term "r\<^sub>3"}}; |
655 Sulzmann and Lu describe another algorithm that also generates POSIX |
675 Sulzmann and Lu describe another algorithm that also generates POSIX |
656 values but dispenses with the second phase where characters are |
676 values but dispenses with the second phase where characters are |
657 injected ``back'' into values. For this they annotate bitcodes to |
677 injected ``back'' into values. For this they annotate bitcodes to |
658 regular expressions, which we define in Isabelle/HOL as the datatype\medskip |
678 regular expressions, which we define in Isabelle/HOL as the datatype\medskip |
659 |
679 |
660 %!\begin{center} |
680 \begin{center} |
661 \noindent |
681 %\noindent |
662 \begin{minipage}{1.01\textwidth} |
682 %\begin{minipage}{0.9\textwidth} |
663 \,@{term breg} $\,::=\,$ @{term "AZERO"} |
683 \,@{term breg} $\,::=\,$ @{term "AZERO"} |
664 $\,\mid$ @{term "AONE bs"} |
684 $\,\mid$ @{term "AONE bs"} |
665 $\,\mid$ @{term "ACHAR bs c"} |
685 $\,\mid$ @{term "ACHAR bs c"} |
666 $\,\mid$ @{term "AALTs bs rs"} |
686 $\,\mid$ @{term "AALTs bs rs"} |
667 $\,\mid$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"} |
687 $\,\mid$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"} |
668 $\,\mid$ @{term "ASTAR bs r"} |
688 $\,\mid$ @{term "ASTAR bs r"} |
669 $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}\hfill\mbox{} |
689 $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}%\hfill\mbox{} |
670 \end{minipage}\medskip |
690 %\end{minipage}\medskip |
671 %!\end{center} |
691 \end{center} |
672 |
692 |
673 \noindent where @{text bs} stands for bitsequences; @{text r}, |
693 \noindent where @{text bs} stands for bitsequences; @{text r}, |
674 @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular |
694 @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular |
675 expressions; and @{text rs} for lists of bitcoded regular |
695 expressions; and @{text rs} for lists of bitcoded regular |
676 expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"} |
696 expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"} |
700 \end{tabular} |
720 \end{tabular} |
701 \end{tabular} |
721 \end{tabular} |
702 \end{center} |
722 \end{center} |
703 |
723 |
704 \noindent |
724 \noindent |
705 As can be seen, this coding is ``lossy'' in the sense that we do not |
725 As can be seen, this coding is ``lossy'' in the sense that it does not |
706 record explicitly character values and also not sequence values (for |
726 record explicitly character values and also not sequence values (for |
707 them we just append two bitsequences). However, the |
727 them it just appends two bitsequences). However, the |
708 different alternatives for @{text Left}, respectively @{text Right}, are recorded as @{text Z} and |
728 different alternatives for @{text Left}, respectively @{text Right}, are recorded as @{text Z} and |
709 @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate |
729 @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate |
710 if there is still a value coming in the list of @{text Stars}, whereas @{text S} |
730 if there is still a value coming in the list of @{text Stars}, whereas @{text S} |
711 indicates the end of the list. The lossiness makes the process of |
731 indicates the end of the list. The lossiness makes the process of |
712 decoding a bit more involved, but the point is that if we have a |
732 decoding a bit more involved, but the point is that if we have a |
856 the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the |
876 the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the |
857 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on |
877 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on |
858 bitcoded regular expressions. |
878 bitcoded regular expressions. |
859 % |
879 % |
860 \begin{center} |
880 \begin{center} |
861 \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{6mm}}c@ {}} |
881 \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{10mm}}c@ {}} |
862 \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l} |
882 \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l} |
863 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\ |
883 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\ |
864 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\ |
884 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\ |
865 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\ |
885 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\ |
866 $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &\\ |
886 $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &\\ |
989 is the same as if we first erase the bitcoded regular expression and |
1009 is the same as if we first erase the bitcoded regular expression and |
990 then perform the ``standard'' derivative operation. |
1010 then perform the ``standard'' derivative operation. |
991 |
1011 |
992 \begin{lemma}\label{bnullable}%\mbox{}\smallskip\\ |
1012 \begin{lemma}\label{bnullable}%\mbox{}\smallskip\\ |
993 \textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\ |
1013 \textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\ |
994 \mbox{\hspace{17mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\ |
1014 \mbox{\hspace{21.5mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\ |
995 \mbox{\hspace{17mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$ |
1015 \mbox{\hspace{21.5mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$ |
996 %\begin{tabular}{ll} |
1016 %\begin{tabular}{ll} |
997 %\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\ |
1017 %\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\ |
998 %\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\ |
1018 %\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\ |
999 %\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$. |
1019 %\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ \;provided\; $\textit{nullable}(r^\downarrow)$. |
1000 %\end{tabular} |
1020 %\end{tabular} |
1001 \end{lemma} |
1021 \end{lemma} |
1002 |
1022 |
1003 %\begin{proof} |
1023 %\begin{proof} |
1004 % All properties are by induction on annotated regular expressions. |
1024 % All properties are by induction on annotated regular expressions. |
1109 cannot make any further simplifications. This is a problem because |
1129 cannot make any further simplifications. This is a problem because |
1110 the outermost alternatives contains two copies of the same |
1130 the outermost alternatives contains two copies of the same |
1111 regular expression (underlined with $r$). These copies will |
1131 regular expression (underlined with $r$). These copies will |
1112 spawn new copies in later derivative steps and they in turn even more copies. This |
1132 spawn new copies in later derivative steps and they in turn even more copies. This |
1113 destroys any hope of taming the size of the derivatives. But the |
1133 destroys any hope of taming the size of the derivatives. But the |
1114 second copy of $r$ in \eqref{derivex} will never contribute to a |
1134 second copy of $r$ in~\eqref{derivex} will never contribute to a |
1115 value, because POSIX lexing will always prefer matching a string |
1135 value, because POSIX lexing will always prefer matching a string |
1116 with the first copy. So it could be safely removed without affecting the correctness of the algorithm. |
1136 with the first copy. So it could be safely removed without affecting the correctness of the algorithm. |
1117 The issue with the simple-minded |
1137 The issue with the simple-minded |
1118 simplification rules above is that the rule $r + r \Rightarrow r$ |
1138 simplification rules above is that the rule $r + r \Rightarrow r$ |
1119 will never be applicable because as can be seen in this example the |
1139 will never be applicable because as can be seen in this example the |
1188 @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions |
1208 @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions |
1189 before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from |
1209 before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from |
1190 bitcoded regular expressions to @{text bool}: |
1210 bitcoded regular expressions to @{text bool}: |
1191 % |
1211 % |
1192 \begin{center} |
1212 \begin{center} |
1193 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}} |
1213 \begin{tabular}{@ {}cc@ {}} |
|
1214 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {\hspace{1mm}}} |
1194 @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\ |
1215 @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\ |
1195 @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\ |
1216 @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\ |
|
1217 @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\ |
|
1218 @{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\ |
|
1219 \mbox{} |
|
1220 \end{tabular} |
|
1221 & |
|
1222 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}} |
1196 @{thm (lhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]} & $\dn$ & @{thm (rhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]}\\ |
1223 @{thm (lhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]} & $\dn$ & @{thm (rhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]}\\ |
1197 @{thm (lhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]} & $\dn$ & |
1224 @{thm (lhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]} & $\dn$ & |
1198 @{thm (rhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]}\\ |
1225 @{thm (rhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]}\\ |
1199 @{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\ |
|
1200 @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ & |
|
1201 @{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}\\ |
|
1202 @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\ |
|
1203 @{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\ |
1226 @{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\ |
|
1227 @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ & \\ |
|
1228 \multicolumn{3}{r}{@{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}} |
|
1229 \end{tabular} |
1204 \end{tabular} |
1230 \end{tabular} |
1205 \end{center} |
1231 \end{center} |
1206 |
1232 |
1207 \noindent |
1233 \noindent |
1208 where all other cases are set to @{text False}. |
1234 where all other cases are set to @{text False}. |
1210 but is needed in order to make the removal of unnecessary copies |
1236 but is needed in order to make the removal of unnecessary copies |
1211 to work properly. |
1237 to work properly. |
1212 |
1238 |
1213 Our simplification function depends on three more helper functions, one is called |
1239 Our simplification function depends on three more helper functions, one is called |
1214 @{text flts} and analyses lists of regular expressions coming from alternatives. |
1240 @{text flts} and analyses lists of regular expressions coming from alternatives. |
1215 It is defined as follows: |
1241 It is defined by four clauses as follows: |
1216 |
1242 |
1217 \begin{center} |
1243 \begin{center} |
1218 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
1244 \begin{tabular}{c} |
1219 \multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad |
1245 @{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad |
1220 @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\smallskip\\ |
1246 @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)} \qquad |
1221 @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\ |
1247 @{text "flts ((ALTs bs' rs') :: rs"} |
1222 \multicolumn{3}{@ {}c}{@{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"} |
1248 %@{ thm (lhs) flts.simps(3)[of "bs'" "rs'"]} |
1223 \hspace{5mm}(otherwise)} |
1249 $\dn$ @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\smallskip\\ |
|
1250 @{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"}\hspace{5mm}(otherwise) |
1224 \end{tabular} |
1251 \end{tabular} |
1225 \end{center} |
1252 \end{center} |
1226 |
1253 |
1227 \noindent |
1254 \noindent |
1228 The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and |
1255 The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and |
1229 the third ``de-nests'' alternatives (but retaining the |
1256 the third ``de-nests'' alternatives (but retains the |
1230 bitsequence @{text "bs'"} accumulated in the inner alternative). There are |
1257 bitsequence @{text "bs'"} accumulated in the inner alternative). There are |
1231 some corner cases to be considered when the resulting list inside an alternative is |
1258 some corner cases to be considered when the resulting list inside an alternative is |
1232 empty or a singleton list. We take care of those cases in the |
1259 empty or a singleton list. We take care of those cases in the |
1233 @{text "bsimpALTs"} function; similarly we define a helper function that simplifies |
1260 @{text "bsimpALTs"} function; similarly we define a helper function that simplifies |
1234 sequences according to the usual rules about @{text ZERO}s and @{text ONE}s: |
1261 sequences according to the usual rules about @{text ZERO}s and @{text ONE}s: |
1341 expression anymore. So unless one can somehow |
1370 expression anymore. So unless one can somehow |
1342 synchronise the change in the simplified regular expressions with |
1371 synchronise the change in the simplified regular expressions with |
1343 the original POSIX value, there is no hope of appealing to @{text retrieve} in the |
1372 the original POSIX value, there is no hope of appealing to @{text retrieve} in the |
1344 correctness argument for @{term blexer_simp}. |
1373 correctness argument for @{term blexer_simp}. |
1345 |
1374 |
1346 We found it more helpful to introduce the rewriting systems shown in |
1375 For our proof we found it more helpful to introduce the rewriting systems shown in |
1347 Fig~\ref{SimpRewrites}. The idea is to generate |
1376 Fig~\ref{SimpRewrites}. The idea is to generate |
1348 simplified regular expressions in small steps (unlike the @{text bsimp}-function which |
1377 simplified regular expressions in small steps (unlike the @{text bsimp}-function which |
1349 does the same in a big step), and show that each of |
1378 does the same in a big step), and show that each of |
1350 the small steps preserves the bitcodes that lead to the POSIX value. |
1379 the small steps preserves the bitcodes that lead to the POSIX value. |
1351 The rewrite system is organised such that $\leadsto$ is for bitcoded regular |
1380 The rewrite system is organised such that $\leadsto$ is for bitcoded regular |
1509 where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$). |
1538 where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$). |
1510 In (3) we know that $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is |
1539 In (3) we know that $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is |
1511 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
1540 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
1512 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands |
1541 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands |
1513 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
1542 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
1514 We reason similarly for @{text STAR}.\smallskip |
1543 We reason similarly for @{text STAR} and @{text NT}.\smallskip |
1515 |
1544 |
1516 |
1545 |
1517 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is |
1546 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is |
1518 far from the actual bound we can expect. We can do better than this, but this does not improve |
1547 far from the actual bound we can expect. We can do better than this, but this does not improve |
1519 the finiteness property we are proving. If we are interested in a polynomial bound, |
1548 the finiteness property we are proving. If we are interested in a polynomial bound, |
1534 |
1563 |
1535 \noindent |
1564 \noindent |
1536 Essentially it matches the string with the longer @{text "Right"}-alternative in the |
1565 Essentially it matches the string with the longer @{text "Right"}-alternative in the |
1537 first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence). |
1566 first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence). |
1538 If we add the simplification above, then we obtain the following value |
1567 If we add the simplification above, then we obtain the following value |
1539 % |
|
1540 \[ |
|
1541 @{term "Seq (Left (Char a)) (Left (Char b))"} |
1568 @{term "Seq (Left (Char a)) (Left (Char b))"} |
1542 \] |
|
1543 |
|
1544 \noindent |
|
1545 where the @{text "Left"}-alternatives get priority. However, this violates |
1569 where the @{text "Left"}-alternatives get priority. However, this violates |
1546 the POSIX rules and we have not been able to |
1570 the POSIX rules and we have not been able to |
1547 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm] |
1571 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm] |
1548 |
1572 |
1549 Note that while Antimirov was able to give a bound on the \emph{size} |
1573 Note also that while Antimirov was able to give a bound on the \emph{size} |
1550 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound |
1574 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound |
1551 on the \emph{number} of derivatives, provided they are quotient via |
1575 on the \emph{number} of derivatives, provided they are quotient via |
1552 ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one |
1576 ACI rules \cite{Brzozowski1964}. Brzozowski's result is crucial when one |
1553 uses derivatives for obtaining an automaton (it essentially bounds |
1577 uses his derivatives for obtaining a DFA (it essentially bounds |
1554 the number of states). However, this result does \emph{not} |
1578 the number of states). However, this result does \emph{not} |
1555 transfer to our setting where we are interested in the \emph{size} of the |
1579 transfer to our setting where we are interested in the \emph{size} of the |
1556 derivatives. For example it is not true for our derivatives that the |
1580 derivatives. For example it is \emph{not} true for our derivatives that the |
1557 set of of derivatives $r \backslash s$ for a given $r$ and all strings |
1581 set of derivatives $r \backslash s$ for a given $r$ and all strings |
1558 $s$ is finite (even when simplified). This is because for our annotated regular expressions |
1582 $s$ is finite (even when simplified). This is because for our annotated regular expressions |
1559 the bitcode annotation is dependent on the number of iterations that |
1583 the bitcode annotation is dependent on the number of iterations that |
1560 are needed for STAR-regular expressions. This is not a problem for us: Since we intend to do lexing |
1584 are needed for @{text STAR}-regular expressions. This is not a problem for us: Since we intend to do lexing |
1561 by calculating (as fast as possible) derivatives, the bound on the size |
1585 by calculating (as fast as possible) derivatives, the bound on the size |
1562 of the derivatives is important, not the number of derivatives. |
1586 of the derivatives is important, not their number. % of derivatives. |
1563 |
1587 |
1564 |
1588 |
1565 *} |
1589 *} |
1566 |
1590 |
1567 |
1591 |
1572 We set out in this work to prove in Isabelle/HOL the correctness of |
1596 We set out in this work to prove in Isabelle/HOL the correctness of |
1573 the second POSIX lexing algorithm by Sulzmann and Lu |
1597 the second POSIX lexing algorithm by Sulzmann and Lu |
1574 \cite{Sulzmann2014}. This follows earlier work where we established |
1598 \cite{Sulzmann2014}. This follows earlier work where we established |
1575 the correctness of the first algorithm |
1599 the correctness of the first algorithm |
1576 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to |
1600 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to |
1577 introduce our own specification about what POSIX values are, |
1601 introduce our own specification for POSIX values, |
1578 because the informal definition given by Sulzmann and Lu did not |
1602 because the informal definition given by Sulzmann and Lu did not |
1579 stand up to formal proof. Also for the second algorithm we needed |
1603 stand up to formal proof. Also for the second algorithm we needed |
1580 to introduce our own definitions and proof ideas in order to establish the |
1604 to introduce our own definitions and proof ideas in order to |
1581 correctness. Our interest in the second algorithm |
1605 establish the correctness. Our interest in the second algorithm |
1582 lies in the fact that by using bitcoded regular expressions and an aggressive |
1606 lies in the fact that by using bitcoded regular expressions and an |
1583 simplification method there is a chance that the derivatives |
1607 aggressive simplification method there is a chance that the |
1584 can be kept universally small (we established in this paper that |
1608 derivatives can be kept universally small (we established in this |
1585 for a given $r$ they can be kept finitely bounded for all strings). |
1609 paper that for a given $r$ they can be kept finitely bounded for |
|
1610 \emph{all} strings). Our formalisation is approximately 7500 lines of |
|
1611 Isabelle code. A little more than half of this code concerns the finiteness bound |
|
1612 obtained in Section 5. This slight ``bloat'' in the latter part is because |
|
1613 we had to introduce an intermediate datatype for annotated regular expressions and repeat many |
|
1614 definitions for this intermediate datatype. But overall we think this made our |
|
1615 formalisation work smoother. The code of our formalisation can be found at |
|
1616 \textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}. |
1586 %This is important if one is after |
1617 %This is important if one is after |
1587 %an efficient POSIX lexing algorithm based on derivatives. |
1618 %an efficient POSIX lexing algorithm based on derivatives. |
1588 |
1619 |
1589 Having proved the correctness of the POSIX lexing algorithm, which |
1620 Having proved the correctness of the POSIX lexing algorithm, which |
1590 lessons have we learned? Well, we feel this is a very good example |
1621 lessons have we learned? Well, we feel this is a very good example |
1595 obscure, examples. |
1626 obscure, examples. |
1596 %We found that from an implementation |
1627 %We found that from an implementation |
1597 %point-of-view it is really important to have the formal proofs of |
1628 %point-of-view it is really important to have the formal proofs of |
1598 %the corresponding properties at hand. |
1629 %the corresponding properties at hand. |
1599 |
1630 |
1600 With the results reported here, we can of course only make a claim about the correctness |
1631 With the results reported here, we can of course only make a claim |
1601 of the algorithm and the sizes of the |
1632 about the correctness of the algorithm and the sizes of the |
1602 derivatives, not about the efficiency or runtime of our version of |
1633 derivatives, not about the efficiency or runtime of our version of |
1603 Sulzman and Lu's algorithm. But we found the size is an important |
1634 Sulzmann and Lu's algorithm. But we found the size is an important |
1604 first indicator about efficiency: clearly if the derivatives can |
1635 first indicator about efficiency: clearly if the derivatives can |
1605 grow to arbitrarily big sizes and the algorithm needs to traverse |
1636 grow to arbitrarily big sizes and the algorithm needs to traverse |
1606 the derivatives possibly several times, then the algorithm will be |
1637 the derivatives possibly several times, then the algorithm will be |
1607 slow---excruciatingly slow that is. Other works seems to make |
1638 slow---excruciatingly slow that is. Other works seem to make |
1608 stronger claims, but during our work we have developed a healthy |
1639 stronger claims, but during our formalisation work we have |
1609 suspicion when for example experimental data is used to back up |
1640 developed a healthy suspicion when for example experimental data is |
1610 efficiency claims. For instance Sulzmann and Lu write about their |
1641 used to back up efficiency claims. For instance Sulzmann and Lu |
1611 equivalent of @{term blexer_simp} \textit{``...we can incrementally |
1642 write about their equivalent of @{term blexer_simp} \textit{``...we |
1612 compute bitcoded parse trees in linear time in the size of the |
1643 can incrementally compute bitcoded parse trees in linear time in |
1613 input''} \cite[Page 14]{Sulzmann2014}. Given the growth of the |
1644 the size of the input''} \cite[Page 14]{Sulzmann2014}. Given the |
1614 derivatives in some cases even after aggressive simplification, |
1645 growth of the derivatives in some cases even after aggressive |
1615 this is a hard to believe claim. A similar claim about a |
1646 simplification, this is a hard to believe claim. A similar claim |
1616 theoretical runtime of @{text "O(n\<^sup>2)"} is made for the |
1647 about a theoretical runtime of @{text "O(n\<^sup>2)"} for one |
1617 Verbatim lexer, which calculates tokens according to POSIX |
1648 specific list of regular expressions is made for the Verbatim |
|
1649 lexer, which calculates tokens according to POSIX |
1618 rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's |
1650 rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's |
1619 derivatives like in our work. The authors write: \textit{``The |
1651 derivatives like in our work. About their empirical data, the authors write: |
1620 results of our empirical tests [..] confirm that Verbatim has |
1652 \textit{``The results of our empirical tests [..] confirm that |
1621 @{text "O(n\<^sup>2)"} time complexity.''} |
1653 Verbatim has @{text "O(n\<^sup>2)"} time complexity.''} |
1622 \cite[Section~VII]{verbatim}. While their correctness proof for |
1654 \cite[Section~VII]{verbatim}. While their correctness proof for |
1623 Verbatim is formalised in Coq, the claim about the runtime |
1655 Verbatim is formalised in Coq, the claim about the runtime |
1624 complexity is only supported by some emperical evidence obtained by |
1656 complexity is only supported by some empirical evidence obtained by |
1625 using the code extraction facilities of Coq. Given our observation |
1657 using the code extraction facilities of Coq. Given our observation |
1626 with the ``growth problem'' of derivatives, we tried out their |
1658 with the ``growth problem'' of derivatives, this runtime bound |
1627 extracted OCaml code with the example \mbox{@{text "(a + |
1659 is unlikely to hold universally: indeed we tried out their extracted OCaml |
1628 aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 |
1660 code with the example \mbox{@{text "(a + aa)\<^sup>*"}} as a single |
1629 minutes to tokenise a string of 40 $a$'s and that increased to |
1661 lexing rule, and it took for us around 5 minutes to tokenise a |
1630 approximately 19 minutes when the string is 50 $a$'s long. Taking |
1662 string of 40 $a$'s and that increased to approximately 19 minutes |
1631 into account that derivatives are not simplified in the Verbatim |
1663 when the string is 50 $a$'s long. Taking into account that |
1632 lexer, such numbers are not surprising. Clearly our result of |
1664 derivatives are not simplified in the Verbatim lexer, such numbers |
1633 having finite derivatives might sound rather weak in this context |
1665 are not surprising. Clearly our result of having finite |
1634 but we think such effeciency claims really require further |
1666 derivatives might sound rather weak in this context but we think |
1635 scrutiny. |
1667 such efficiency claims really require further scrutiny. The |
1636 |
1668 contribution of this paper is to make sure derivatives do not grow |
1637 The contribution of this paper is to make sure derivatives do not |
1669 arbitrarily big (universally). In the example \mbox{@{text "(a + |
1638 grow arbitrarily big (universially) In the example \mbox{@{text "(a |
1670 aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or |
1639 + aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or |
|
1640 less. The result is that lexing a string of, say, 50\,000 a's with |
1671 less. The result is that lexing a string of, say, 50\,000 a's with |
1641 the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes |
1672 the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes |
1642 approximately 10 seconds with our Scala implementation of the |
1673 approximately 10 seconds with our Scala implementation of the |
1643 presented algorithm. |
1674 presented algorithm. |
1644 |
1675 |
1661 derivatives is never greater than 5 in this example. Even in the |
1692 derivatives is never greater than 5 in this example. Even in the |
1662 example from Section 2, where Rust raises an error message, namely |
1693 example from Section 2, where Rust raises an error message, namely |
1663 \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for |
1694 \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for |
1664 our derivatives is a moderate 14. |
1695 our derivatives is a moderate 14. |
1665 |
1696 |
|
1697 Let us also return to the example @{text |
|
1698 "a\<^bsup>{0}{4294967295}\<^esup>"} which until recently Rust |
|
1699 deemed acceptable. But this was due to a bug. It turns out that it took Rust |
|
1700 more than 11 minutes to generate an automaton for this regular |
|
1701 expression and then to determine that a string of just one(!) |
|
1702 @{text a} does \emph{not} match this regular expression. Therefore |
|
1703 it is probably a wise choice that in newer versions of Rust's |
|
1704 regular expression library such regular expressions are declared as |
|
1705 ``too big'' and raise an error message. While this is clearly |
|
1706 a contrived example, the safety guaranties Rust wants to provide necessitate |
|
1707 this conservative approach. |
|
1708 However, with the derivatives and simplifications we have shown |
|
1709 in this paper, the example can be solved with ease: |
|
1710 it essentially only involves operations on |
|
1711 integers and our Scala implementation takes only a few seconds to |
|
1712 find out that this string, or even much larger strings, do not match. |
|
1713 |
1666 Let us also compare our work to the verified Verbatim++ lexer where |
1714 Let us also compare our work to the verified Verbatim++ lexer where |
1667 the authors of the Verbatim lexer introduced a number of |
1715 the authors of the Verbatim lexer introduced a number of |
1668 improvements and optimisations, for example memoisation |
1716 improvements and optimisations, for example memoisation |
1669 \cite{verbatimpp}. However, unlike Verbatim, which works with |
1717 \cite{verbatimpp}. However, unlike Verbatim, which works with |
1670 derivatives like in our work, Verbatim++ compiles first a regular |
1718 derivatives like in our work, Verbatim++ compiles first a regular |
1679 translation using the traditional notion of DFAs so that we can improve on this. Therefore we |
1727 translation using the traditional notion of DFAs so that we can improve on this. Therefore we |
1680 prefer to stick with calculating derivatives, but attempt to make |
1728 prefer to stick with calculating derivatives, but attempt to make |
1681 this calculation (in the future) as fast as possible. What we can guaranty |
1729 this calculation (in the future) as fast as possible. What we can guaranty |
1682 with the presented work is that the maximum size of the derivatives |
1730 with the presented work is that the maximum size of the derivatives |
1683 for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala |
1731 for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala |
1684 implementation only needs a few seconds for this example and matching 50\,000 a's. |
1732 implementation again only needs a few seconds for this example and matching 50\,000 a's, say. |
1685 % |
1733 % |
1686 % |
1734 % |
1687 %Possible ideas are |
1735 %Possible ideas are |
1688 %zippers which have been used in the context of parsing of |
1736 %zippers which have been used in the context of parsing of |
1689 %context-free grammars \cite{zipperparser}. |
1737 %context-free grammars \cite{zipperparser}. |
1690 \medskip |
1738 %\\[-5mm] %\smallskip |
1691 |
1739 |
1692 \noindent |
1740 %\noindent |
1693 Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}. |
1741 %Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}. |
1694 %\\[-10mm] |
1742 %%\\[-10mm] |
1695 |
1743 |
1696 |
1744 |
1697 %%\bibliographystyle{plain} |
1745 %%\bibliographystyle{plain} |
1698 \bibliography{root} |
1746 \bibliography{root} |
1699 *} |
1747 *} |