330 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
327 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
331 \end{tikzcd} |
328 \end{tikzcd} |
332 \end{equation} |
329 \end{equation} |
333 |
330 |
334 \noindent |
331 \noindent |
335 |
332 It can be |
336 |
333 relatively easily shown that this matcher is correct: |
|
334 \begin{lemma} |
|
335 $\textit{match} \; s\; r = \textit{true} \Longleftrightarrow s \in L(r)$ |
|
336 \end{lemma} |
|
337 \begin{proof} |
|
338 By the stepwise property of $\backslash$ (\ref{derStepwise}) |
|
339 \end{proof} |
|
340 \noindent |
|
341 If we implement the above algorithm naively, however, |
|
342 the algorithm can be excruciatingly slow. |
|
343 |
|
344 |
|
345 \begin{figure} |
|
346 \begin{tikzpicture} |
|
347 \begin{axis}[ |
|
348 xlabel={$n$}, |
|
349 ylabel={time in secs}, |
|
350 ymode = log, |
|
351 legend entries={Naive Matcher}, |
|
352 legend pos=north west, |
|
353 legend cell align=left] |
|
354 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data}; |
|
355 \end{axis} |
|
356 \end{tikzpicture} |
|
357 \caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher} |
|
358 \end{figure} |
|
359 |
|
360 \noindent |
|
361 For this we need to introduce certain |
|
362 rewrite rules for the intermediate results, |
|
363 such as $r + r \rightarrow r$, |
|
364 and make sure those rules do not change the |
|
365 language of the regular expression. |
|
366 We have a simplification function (that is as simple as possible |
|
367 while having much power on making a regex simpler): |
|
368 \begin{verbatim} |
|
369 def simp(r: Rexp) : Rexp = r match { |
|
370 case SEQ(r1, r2) => |
|
371 (simp(r1), simp(r2)) match { |
|
372 case (ZERO, _) => ZERO |
|
373 case (_, ZERO) => ZERO |
|
374 case (ONE, r2s) => r2s |
|
375 case (r1s, ONE) => r1s |
|
376 case (r1s, r2s) => SEQ(r1s, r2s) |
|
377 } |
|
378 case ALTS(r1, r2) => { |
|
379 (simp(r1), simp(r2)) match { |
|
380 case (ZERO, r2s) => r2s |
|
381 case (r1s, ZERO) => r1s |
|
382 case (r1s, r2s) => |
|
383 if(r1s == r2s) r1s else ALTS(r1s, r2s) |
|
384 } |
|
385 } |
|
386 case r => r |
|
387 } |
|
388 \end{verbatim} |
|
389 If we repeatedly incorporate these |
|
390 rules during the matching algorithm, |
|
391 we have a lexer with simplification: |
|
392 \begin{verbatim} |
|
393 def ders_simp(s: List[Char], r: Rexp) : Rexp = s match { |
|
394 case Nil => simp(r) |
|
395 case c :: cs => ders_simp(cs, simp(der(c, r))) |
|
396 } |
|
397 |
|
398 def simp_matcher(s: String, r: Rexp) : Boolean = |
|
399 nullable(ders_simp(s.toList, r)) |
|
400 |
|
401 \end{verbatim} |
|
402 \noindent |
|
403 After putting in those rules, the example of \ref{NaiveMatcher} |
|
404 is now very tame in the length of inputs: |
|
405 |
|
406 |
|
407 \begin{tikzpicture} |
|
408 \begin{axis}[ |
|
409 xlabel={$n$}, |
|
410 ylabel={time in secs}, |
|
411 ymode = log, |
|
412 xmode = log, |
|
413 legend entries={Matcher With Simp}, |
|
414 legend pos=north west, |
|
415 legend cell align=left] |
|
416 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data}; |
|
417 \end{axis} |
|
418 \end{tikzpicture} \label{fig:BetterMatcher} |
|
419 |
|
420 |
|
421 \noindent |
|
422 Note how the x-axis is in logarithmic scale. |
337 Building derivatives and then testing the existence |
423 Building derivatives and then testing the existence |
338 of empty string in the resulting regular expression's language. |
424 of empty string in the resulting regular expression's language, |
|
425 and add simplification rules when necessary. |
339 So far, so good. But what if we want to |
426 So far, so good. But what if we want to |
340 do lexing instead of just getting a YES/NO answer? |
427 do lexing instead of just getting a YES/NO answer? |
341 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
428 \citeauthor{Sulzmann2014} first came up with a nice and |
342 elegant (arguably as beautiful as the definition of the original derivative) solution for this. |
429 elegant (arguably as beautiful as the definition of the original derivative) solution for this. |
343 |
430 |
344 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} |
431 \section{Values and the Lexing Algorithm by Sulzmann and Lu} |
345 Here we present the hybrid phases of a regular expression lexing |
432 Here we present the hybrid phases of a regular expression lexing |
346 algorithm using the function $\inj$, as given by Sulzmann and Lu. |
433 algorithm using the function $\inj$, as given by Sulzmann and Lu. |
347 They first defined the datatypes for storing the |
434 They first defined the datatypes for storing the |
348 lexing information called a \emph{value} or |
435 lexing information called a \emph{value} or |
349 sometimes also \emph{lexical value}. These values and regular |
436 sometimes also \emph{lexical value}. These values and regular |
428 given the same regular expression $r$ and string $s$, |
515 given the same regular expression $r$ and string $s$, |
429 one can always uniquely determine the $\POSIX$ value for it: |
516 one can always uniquely determine the $\POSIX$ value for it: |
430 \begin{lemma} |
517 \begin{lemma} |
431 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$ |
518 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$ |
432 \end{lemma} |
519 \end{lemma} |
|
520 \begin{proof} |
|
521 By induction on $s$, $r$ and $v_1$. The induction principle is |
|
522 the \POSIX rules. Each case is proven by a combination of |
|
523 the induction rules for $\POSIX$ values and the inductive hypothesis. |
|
524 Probably the most cumbersome cases are the sequence and star with non-empty iterations. |
|
525 |
|
526 We give the reasoning about the sequence case as follows: |
|
527 When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$, |
|
528 we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$ |
|
529 and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold. |
|
530 For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot |
|
531 possibly form a $\POSIX$ for $s$. |
|
532 If we have some other values $v_1'$ and $v_2'$ such that |
|
533 $(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$, |
|
534 Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$, |
|
535 which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$ |
|
536 is the same as $\Seq(v_1, v_2)$. |
|
537 \end{proof} |
433 Now we know what a $\POSIX$ value is, the problem is how do we achieve |
538 Now we know what a $\POSIX$ value is, the problem is how do we achieve |
434 such a value in a lexing algorithm, using derivatives? |
539 such a value in a lexing algorithm, using derivatives? |
435 |
540 |
436 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} |
541 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} |
437 |
542 |
438 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
543 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
439 algorithm by a second phase (the first phase being building successive |
544 algorithm by a second phase (the first phase being building successive |
440 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value |
545 derivatives---see \ref{graph:successive_ders}). In this second phase, a POSIX value |
441 is generated if the regular expression matches the string. |
546 is generated if the regular expression matches the string. |
442 Two functions are involved: $\inj$ and $\mkeps$. |
547 Two functions are involved: $\inj$ and $\mkeps$. |
443 The function $\mkeps$ constructs a value from the last |
548 The function $\mkeps$ constructs a value from the last |
444 one of all the successive derivatives: |
549 one of all the successive derivatives: |
445 \begin{ceqn} |
550 \begin{ceqn} |
532 that had just been unfolded. This value is followed by the already |
637 that had just been unfolded. This value is followed by the already |
533 matched star iterations we collected before. So we inject the character |
638 matched star iterations we collected before. So we inject the character |
534 back to the first value and form a new value with this latest iteration |
639 back to the first value and form a new value with this latest iteration |
535 being added to the previous list of iterations, all under the $\Stars$ |
640 being added to the previous list of iterations, all under the $\Stars$ |
536 top level. |
641 top level. |
537 The POSIX value is maintained throughout the process. |
642 The POSIX value is maintained throughout the process: |
538 \begin{lemma} |
643 \begin{lemma} |
539 $(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$ |
644 $(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$ |
540 \end{lemma}\label{injPosix} |
645 \end{lemma}\label{injPosix} |
541 |
646 |
542 |
647 |
543 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together, |
648 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together, |
544 and taking into consideration the possibility of a non-match, |
649 and taking into consideration the possibility of a non-match, |
545 we have a lexer with the following recursive definition: |
650 we have a lexer with the following recursive definition: |
546 \begin{center} |
651 \begin{center} |
547 \begin{tabular}{lcr} |
652 \begin{tabular}{lcl} |
548 $\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ |
653 $\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ |
549 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\ |
654 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\ |
550 & & $\None \implies \None$\\ |
655 & & $\quad \None \implies \None$\\ |
551 & & $\mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ |
656 & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ |
552 \end{tabular} |
657 \end{tabular} |
553 \end{center} |
658 \end{center} |
554 \noindent |
659 \noindent |
555 The central property of the $\lexer$ is that it gives the correct result by |
660 The central property of the $\lexer$ is that it gives the correct result by |
556 $\POSIX$ standards: |
661 $\POSIX$ standards: |
566 By induction on $s$. $r$ is allowed to be an arbitrary regular expression. |
671 By induction on $s$. $r$ is allowed to be an arbitrary regular expression. |
567 The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case |
672 The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case |
568 by lemma \ref{injPosix}. |
673 by lemma \ref{injPosix}. |
569 \end{proof} |
674 \end{proof} |
570 |
675 |
571 For convenience, we shall employ the following notations: the regular |
676 |
|
677 Pictorially, the algorithm is as follows ( |
|
678 For convenience, we employ the following notations: the regular |
572 expression we start with is $r_0$, and the given string $s$ is composed |
679 expression we start with is $r_0$, and the given string $s$ is composed |
573 of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
680 of characters $c_0 c_1 \ldots c_{n-1}$. The |
574 left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
681 values built incrementally by \emph{injecting} back the characters into the |
575 to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
682 earlier values are $v_n, \ldots, v_0$. Corresponding values and characters |
576 the derivative $r_n$. We test whether this derivative is |
683 are always in the same subscript, i.e. $\vdash v_i : r_i$): |
577 $\textit{nullable}$ or not. If not, we know the string does not match |
|
578 $r$, and no value needs to be generated. If yes, we start building the |
|
579 values incrementally by \emph{injecting} back the characters into the |
|
580 earlier values $v_n, \ldots, v_0$. |
|
581 Pictorially, the algorithm is as follows: |
|
582 |
684 |
583 \begin{ceqn} |
685 \begin{ceqn} |
584 \begin{equation}\label{graph:2} |
686 \begin{equation}\label{graph:2} |
585 \begin{tikzcd} |
687 \begin{tikzcd} |
586 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
688 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
587 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
689 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
588 \end{tikzcd} |
690 \end{tikzcd} |
589 \end{equation} |
691 \end{equation} |
590 \end{ceqn} |
692 \end{ceqn} |
591 |
693 |
592 |
694 \noindent |
593 \noindent |
695 As we did earlier in this chapter on the matcher, one can |
594 This is the second phase of the |
696 introduce simplification on the regex. |
595 algorithm from right to left. For the first value $v_n$, we call the |
697 However, now we need to do a backward phase and make sure |
596 function $\textit{mkeps}$, which builds a POSIX lexical value |
698 the values align with the regular expressions. |
597 for how the empty string has been matched by the (nullable) regular |
699 Therefore one has to |
598 expression $r_n$. This function is defined as |
|
599 |
|
600 |
|
601 |
|
602 We have mentioned before that derivatives without simplification |
|
603 can get clumsy, and this is true for values as well--they reflect |
|
604 the size of the regular expression by definition. |
|
605 |
|
606 One can introduce simplification on the regex and values but have to |
|
607 be careful not to break the correctness, as the injection |
700 be careful not to break the correctness, as the injection |
608 function heavily relies on the structure of the regexes and values |
701 function heavily relies on the structure of the regexes and values |
609 being correct and matching each other. |
702 being correct and matching each other. |
610 It can be achieved by recording some extra rectification functions |
703 It can be achieved by recording some extra rectification functions |
611 during the derivatives step, and applying these rectifications in |
704 during the derivatives step, and applying these rectifications in |
612 each run during the injection phase. |
705 each run during the injection phase. |
|
706 |
|
707 \ChristianComment{Do I introduce the lexer with rectification here?} |
613 And we can prove that the POSIX value of how |
708 And we can prove that the POSIX value of how |
614 regular expressions match strings will not be affected---although it is much harder |
709 regular expressions match strings will not be affected---although it is much harder |
615 to establish. |
710 to establish. |
616 Some initial results in this regard have been |
711 Some initial results in this regard have been |
617 obtained in \cite{AusafDyckhoffUrban2016}. |
712 obtained in \cite{AusafDyckhoffUrban2016}. |
618 |
713 |
619 |
714 However, even with these simplification rules, we could still end up in |
620 |
715 trouble, when we encounter cases that require more involved and aggressive |
621 %Brzozowski, after giving the derivatives and simplification, |
716 simplifications. |
622 %did not explore lexing with simplification, or he may well be |
717 \section{A Case Requring More Aggressive Simplification} |
623 %stuck on an efficient simplification with proof. |
718 For example, when starting with the regular |
624 %He went on to examine the use of derivatives together with |
719 expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10) |
625 %automaton, and did not try lexing using products. |
720 w.r.t.~the character $a$, one obtains a derivative regular expression |
626 |
721 with more than 9000 nodes (when viewed as a tree) |
627 We want to get rid of the complex and fragile rectification of values. |
722 even with simplification. |
628 Can we not create those intermediate values $v_1,\ldots v_n$, |
723 \begin{figure} |
629 and get the lexing information that should be already there while |
724 \begin{tikzpicture} |
630 doing derivatives in one pass, without a second injection phase? |
725 \begin{axis}[ |
631 In the meantime, can we make sure that simplifications |
726 xlabel={$n$}, |
632 are easily handled without breaking the correctness of the algorithm? |
727 ylabel={size}, |
633 |
728 legend entries={Naive Matcher}, |
634 Sulzmann and Lu solved this problem by |
729 legend pos=north west, |
635 introducing additional information to the |
730 legend cell align=left] |
636 regular expressions called \emph{bitcodes}. |
731 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data}; |
637 |
732 \end{axis} |
638 |
733 \end{tikzpicture} |
639 |
734 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$} |
640 |
735 \end{figure}\label{fig:BetterWaterloo} |
641 |
736 |
642 With the formally-specified rules for what a POSIX matching is, |
737 That is because our lexing algorithm currently keeps a lot of |
643 they proved in Isabelle/HOL that the algorithm gives correct results. |
738 "useless values that will never not be used. |
644 But having a correct result is still not enough, |
739 These different ways of matching will grow exponentially with the string length. |
645 we want at least some degree of $\mathbf{efficiency}$. |
740 |
646 |
741 For $r= (a^*\cdot a^*)^*$ and |
647 |
742 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$, |
648 A pair of regular expression and string can have multiple lexical values. |
743 if we do not allow any empty iterations in its lexical values, |
649 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
744 there will be $n - 1$ "splitting points" on $s$ we can independently choose to |
650 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
651 If we do not allow any empty iterations in its lexical values, |
|
652 there will be $n - 1$ "splitting points" on $s$ we can choose to |
|
653 split or not so that each sub-string |
745 split or not so that each sub-string |
654 segmented by those chosen splitting points will form different iterations: |
746 segmented by those chosen splitting points will form different iterations. |
|
747 For example when $n=4$, |
655 \begin{center} |
748 \begin{center} |
656 \begin{tabular}{lcr} |
749 \begin{tabular}{lcr} |
657 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$\\ |
750 $aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration, this iteration will be divided between the inner sequence $a^*\cdot a^*$)\\ |
658 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$\\ |
751 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$ (two iterations)\\ |
659 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$\\ |
752 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$ (two iterations)\\ |
|
753 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\ |
|
754 $a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\ |
660 & $\textit{etc}.$ & |
755 & $\textit{etc}.$ & |
661 \end{tabular} |
756 \end{tabular} |
662 \end{center} |
757 \end{center} |
663 \noindent |
758 \noindent |
664 And for each iteration, there are still multiple ways to split |
759 And for each iteration, there are still multiple ways to split |
668 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
763 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
669 A lexer to keep all the possible values will naturally |
764 A lexer to keep all the possible values will naturally |
670 have an exponential runtime on ambiguous regular expressions. |
765 have an exponential runtime on ambiguous regular expressions. |
671 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values |
766 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values |
672 of a match. This means Sulzmann and Lu's injection-based algorithm |
767 of a match. This means Sulzmann and Lu's injection-based algorithm |
673 will be exponential by nature. |
768 exponential by nature. |
674 Somehow one has to decide which lexical value to keep and |
769 Somehow one has to make sure which |
675 output in a lexing algorithm. |
770 lexical values are $\POSIX$ and need to be kept in a lexing algorithm. |
676 |
771 |
677 |
772 |
678 For example, the above $r= (a^*\cdot a^*)^*$ and |
773 For example, the above $r= (a^*\cdot a^*)^*$ and |
679 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value |
774 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value |
680 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
775 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
681 The output of an algorithm we want would be a POSIX matching |
776 We want to keep this value only, and remove all the regular expression subparts |
682 encoded as a value. |
777 not corresponding to this value during lexing. |
683 |
778 To do this, a two-phase algorithm with rectification is a bit too fragile. |
684 |
779 Can we not create those intermediate values $v_1,\ldots v_n$, |
685 |
780 and get the lexing information that should be already there while |
686 |
781 doing derivatives in one pass, without a second injection phase? |
687 |
782 In the meantime, can we make sure that simplifications |
688 %kind of redundant material |
783 are easily handled without breaking the correctness of the algorithm? |
689 |
784 |
690 |
785 Sulzmann and Lu solved this problem by |
691 where we start with a regular expression $r_0$, build successive |
786 introducing additional information to the |
692 derivatives until we exhaust the string and then use \textit{nullable} |
787 regular expressions called \emph{bitcodes}. |
693 to test whether the result can match the empty string. It can be |
788 |
694 relatively easily shown that this matcher is correct (that is given |
789 |
695 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
790 |
696 |
791 |
697 Beautiful and simple definition. |
|
698 |
|
699 If we implement the above algorithm naively, however, |
|
700 the algorithm can be excruciatingly slow. |
|
701 |
|
702 |
|
703 \begin{figure} |
|
704 \centering |
|
705 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
706 \begin{tikzpicture} |
|
707 \begin{axis}[ |
|
708 xlabel={$n$}, |
|
709 x label style={at={(1.05,-0.05)}}, |
|
710 ylabel={time in secs}, |
|
711 enlargelimits=false, |
|
712 xtick={0,5,...,30}, |
|
713 xmax=33, |
|
714 ymax=10000, |
|
715 ytick={0,1000,...,10000}, |
|
716 scaled ticks=false, |
|
717 axis lines=left, |
|
718 width=5cm, |
|
719 height=4cm, |
|
720 legend entries={JavaScript}, |
|
721 legend pos=north west, |
|
722 legend cell align=left] |
|
723 \addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data}; |
|
724 \end{axis} |
|
725 \end{tikzpicture}\\ |
|
726 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
|
727 of the form $\underbrace{aa..a}_{n}$.} |
|
728 \end{tabular} |
|
729 \caption{EightThousandNodes} \label{fig:EightThousandNodes} |
|
730 \end{figure} |
|
731 |
|
732 |
|
733 (8000 node data to be added here) |
|
734 For example, when starting with the regular |
|
735 expression $(a + aa)^*$ and building a few successive derivatives (around 10) |
|
736 w.r.t.~the character $a$, one obtains a derivative regular expression |
|
737 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}. |
|
738 The reason why $(a + aa) ^*$ explodes so drastically is that without |
|
739 pruning, the algorithm will keep records of all possible ways of matching: |
|
740 \begin{center} |
|
741 $(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$ |
|
742 \end{center} |
|
743 |
|
744 \noindent |
|
745 Each of the above alternative branches correspond to the match |
|
746 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete). |
|
747 These different ways of matching will grow exponentially with the string length, |
|
748 and without simplifications that throw away some of these very similar matchings, |
|
749 it is no surprise that these expressions grow so quickly. |
|
750 Operations like |
|
751 $\backslash$ and $\nullable$ need to traverse such trees and |
|
752 consequently the bigger the size of the derivative the slower the |
|
753 algorithm. |
|
754 |
|
755 Brzozowski was quick in finding that during this process a lot useless |
|
756 $\ONE$s and $\ZERO$s are generated and therefore not optimal. |
|
757 He also introduced some "similarity rules", such |
|
758 as $P+(Q+R) = (P+Q)+R$ to merge syntactically |
|
759 different but language-equivalent sub-regexes to further decrease the size |
|
760 of the intermediate regexes. |
|
761 |
|
762 More simplifications are possible, such as deleting duplicates |
|
763 and opening up nested alternatives to trigger even more simplifications. |
|
764 And suppose we apply simplification after each derivative step, and compose |
|
765 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn |
|
766 \textit{simp}(a \backslash c)$. Then we can build |
|
767 a matcher with simpler regular expressions. |
|
768 |
|
769 If we want the size of derivatives in the algorithm to |
|
770 stay even lower, we would need more aggressive simplifications. |
|
771 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
|
772 delete duplicates whenever possible. For example, the parentheses in |
|
773 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
|
774 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
|
775 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
|
776 $a^*+a+\ONE$. These more aggressive simplification rules are for |
|
777 a very tight size bound, possibly as low |
|
778 as that of the \emph{partial derivatives}\parencite{Antimirov1995}. |
|
779 |
|
780 |
|
781 |
|