395 } |
376 } |
396 } |
377 } |
397 \end{lstlisting} |
378 \end{lstlisting} |
398 \end{figure} |
379 \end{figure} |
399 \noindent |
380 \noindent |
|
381 We call this lexer $\blexerStrong$. |
|
382 $\blexerStrong$ is able to drastically reduce the |
|
383 internal data structure size which could |
|
384 trigger exponential behaviours in |
|
385 $\blexersimp$. |
|
386 \begin{figure}[H] |
|
387 \centering |
|
388 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
389 \begin{tikzpicture} |
|
390 \begin{axis}[ |
|
391 %xlabel={$n$}, |
|
392 myplotstyle, |
|
393 xlabel={input length}, |
|
394 ylabel={size}, |
|
395 width = 7cm, |
|
396 height = 5cm, |
|
397 ] |
|
398 \addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data}; |
|
399 \end{axis} |
|
400 \end{tikzpicture} |
|
401 & |
|
402 \begin{tikzpicture} |
|
403 \begin{axis}[ |
|
404 %xlabel={$n$}, |
|
405 myplotstyle, |
|
406 xlabel={input length}, |
|
407 ylabel={size}, |
|
408 width = 7cm, |
|
409 height = 5cm, |
|
410 ] |
|
411 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data}; |
|
412 \end{axis} |
|
413 \end{tikzpicture}\\ |
|
414 \multicolumn{2}{l}{} |
|
415 \end{tabular} |
|
416 \caption{Runtime for matching |
|
417 $\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings |
|
418 of the form $\protect\underbrace{aa..a}_{n}$.}\label{fig:aaaaaStarStar} |
|
419 \end{figure} |
|
420 \noindent |
400 We would like to preserve the correctness like the one |
421 We would like to preserve the correctness like the one |
401 we had for $\blexersimp$: |
422 we had for $\blexersimp$: |
402 \begin{conjecture}\label{cubicConjecture} |
423 \begin{conjecture}\label{cubicConjecture} |
403 $\blexerStrong \;r \; s = \blexer\; r\;s$ |
424 $\blexerStrong \;r \; s = \blexer\; r\;s$ |
404 \end{conjecture} |
425 \end{conjecture} |
405 \noindent |
426 \noindent |
406 We introduce the new rule \ref{cubicRule} |
427 The idea is to maintain key lemmas in |
407 and augment our proofs in chapter \ref{Bitcoded2}. |
428 chapter \ref{Bitcoded2} like |
408 The idea is to maintain the properties like |
429 $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$ |
409 $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$ etc. |
430 with the new rewriting rule \ref{cubicRule} . |
410 |
431 |
411 In the next section, |
432 In the next sub-section, |
412 we will describe why we |
433 we will describe why we |
413 believe a cubic bound can be achieved. |
434 believe a cubic bound can be achieved. |
414 We give an introduction to the |
435 We give an introduction to the |
415 partial derivatives, |
436 partial derivatives, |
416 which was invented by Antimirov \cite{Antimirov95}, |
437 which was invented by Antimirov \cite{Antimirov95}, |
417 and then link it with the result of the function |
438 and then link it with the result of the function |
418 $\bdersStrongs$. |
439 $\bdersStrongs$. |
419 |
440 |
420 \section{Antimirov's partial derivatives} |
441 \subsection{Antimirov's partial derivatives} |
421 This suggests a link towrads "partial derivatives" |
442 Partial derivatives were first introduced by |
422 introduced The idea behind Antimirov's partial derivatives |
443 Antimirov \cite{Antimirov95}. |
423 is to do derivatives in a similar way as suggested by Brzozowski, |
444 It does derivatives in a similar way as suggested by Brzozowski, |
424 but maintain a set of regular expressions instead of a single one: |
445 but splits children of alternative regular expressions into |
425 |
446 multiple independent terms, causing the output to become a |
426 %TODO: antimirov proposition 3.1, needs completion |
447 set of regular expressions: |
427 \begin{center} |
448 \begin{center} |
428 \begin{tabular}{ccc} |
449 \begin{tabular}{lcl} |
429 $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\ |
450 $\partial_x \; (a \cdot b)$ & |
430 $\partial_x(\ONE)$ & $=$ & $\phi$ |
451 $\dn$ & $\partial_x \; a\cdot b \cup |
431 \end{tabular} |
452 \partial_x \; b \; \textit{if} \; \; \nullable\; a$\\ |
432 \end{center} |
453 & & $\partial_x \; a\cdot b \quad\quad |
433 |
454 \textit{otherwise}$\\ |
|
455 $\partial_x \; r^*$ & $\dn$ & $\partial_x \; r \cdot r^*$\\ |
|
456 $\partial_x \; c $ & $\dn$ & $\textit{if} \; x = c \; |
|
457 \textit{then} \; |
|
458 \{ \ONE\} \;\;\textit{else} \; \varnothing$\\ |
|
459 $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\ |
|
460 $\partial_x(\ONE)$ & $=$ & $\varnothing$\\ |
|
461 $\partial_x(\ZERO)$ & $\dn$ & $\varnothing$\\ |
|
462 \end{tabular} |
|
463 \end{center} |
|
464 \noindent |
|
465 The $\cdot$ between for example |
|
466 $\partial_x \; a\cdot b $ |
|
467 is a shorthand notation for the cartesian product |
|
468 $\partial_x \; a \times \{ b\}$. |
|
469 %Each element in the set generated by a partial derivative |
|
470 %corresponds to a (potentially partial) match |
|
471 %TODO: define derivatives w.r.t string s |
434 Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together |
472 Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together |
435 using the alternatives constructor, Antimirov cleverly chose to put them into |
473 using the $\sum$ constructor, Antimirov put them into |
436 a set instead. This breaks the terms in a derivative regular expression up, |
474 a set. This causes maximum de-duplication to happen, |
437 allowing us to understand what are the "atomic" components of it. |
475 allowing us to understand what are the "atomic" components of it. |
438 For example, To compute what regular expression $x^*(xx + y)^*$'s |
476 For example, To compute what regular expression $x^*(xx + y)^*$'s |
439 derivative against $x$ is made of, one can do a partial derivative |
477 derivative against $x$ is made of, one can do a partial derivative |
440 of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$ |
478 of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$ |
441 from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$. |
479 from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$. |
442 To get all the "atomic" components of a regular expression's possible derivatives, |
480 |
443 there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes |
481 The set of all possible partial derivatives is defined |
444 whatever character is available at the head of the string inside the language of a |
482 as the union of derivatives w.r.t all the strings in the universe: |
445 regular expression, and gives back the character and the derivative regular expression |
483 \begin{center} |
446 as a pair (which he called "monomial"): |
484 \begin{tabular}{lcl} |
447 \begin{center} |
485 $\textit{PDER}_{UNIV} \; r $ & $\dn $ & $\bigcup_{w \in A^*}\partial_w \; r$ |
448 \begin{tabular}{ccc} |
486 \end{tabular} |
449 $\lf(\ONE)$ & $=$ & $\phi$\\ |
487 \end{center} |
450 $\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\ |
488 \noindent |
451 $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ |
489 |
452 $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\ |
490 Back to our |
453 \end{tabular} |
491 \begin{center} |
454 \end{center} |
492 $((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$ |
455 %TODO: completion |
493 \end{center} |
456 |
494 example, if we denote this regular expression as $A$, |
457 There is a slight difference in the last three clauses compared |
495 we have that |
458 with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular |
496 \begin{center} |
459 expression $r$ with every element inside $\textit{rset}$ to create a set of |
497 $\textit{PDER}_{UNIV} \; A = |
460 sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates |
498 \bigcup_{i=1}^{n}\bigcup_{j=0}^{i-1} \{ |
461 on a set of monomials (which Antimirov called "linear form") and a regular |
499 (\underbrace{a \ldots a}_{\text{j a's}}\cdot |
462 expression, and returns a linear form: |
500 (\underbrace{a \ldots a}_{\text{i a's}})^*)\cdot A \}$, |
463 \begin{center} |
501 \end{center} |
464 \begin{tabular}{ccc} |
502 with exactly $n * (n + 1) / 2$ terms. |
465 $l \bigodot (\ZERO)$ & $=$ & $\phi$\\ |
503 This is in line with our speculation that only $n*(n+1)/2$ terms are |
466 $l \bigodot (\ONE)$ & $=$ & $l$\\ |
504 needed. We conjecture that $\bsimpStrong$ is also able to achieve this |
467 $\phi \bigodot t$ & $=$ & $\phi$\\ |
505 upper limit in general |
468 $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\ |
506 \begin{conjecture}\label{bsimpStrongInclusionPder} |
469 $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\ |
507 Using a suitable transformation $f$, we have |
470 $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\ |
508 \begin{center} |
471 $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ |
509 $\forall s.\; f \; (r \bdersStrong \; s) \subseteq |
472 $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\ |
510 \textit{PDER}_{UNIV} \; r$ |
473 \end{tabular} |
511 \end{center} |
474 \end{center} |
512 \end{conjecture} |
475 %TODO: completion |
513 \noindent |
476 |
514 because our \ref{cubicRule} will keep only one copy of each term, |
477 Some degree of simplification is applied when doing $\bigodot$, for example, |
515 where the function $\textit{prune}$ takes care of maintaining |
478 $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$, |
516 a set like structure similar to partial derivatives. |
479 and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and |
517 It is anticipated we might need to adjust $\textit{prune}$ |
480 $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$, |
518 slightly to make sure all duplicate terms are eliminated, |
481 and so on. |
519 which should be doable. |
482 |
520 |
483 With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with |
521 Antimirov had proven that the sum of all the partial derivative |
484 an iterative procedure: |
522 terms' sizes is bounded by the cubic of the size of that regular |
485 \begin{center} |
523 expression: |
486 \begin{tabular}{llll} |
524 \begin{property}\label{pderBound} |
487 $\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\ |
525 $\llbracket \textit{PDER}_{UNIV} \; r \rrbracket \leq O((\llbracket r \rrbracket)^3)$ |
488 & $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\ |
526 \end{property} |
489 & $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\ |
527 This property was formalised by Urban, and the details are in the PDERIVS.thy file |
490 $\partial_{UNIV}(r)$ & $=$ & $\PD$ & |
528 in our repository. |
491 \end{tabular} |
529 Once conjecture \ref{bsimpStrongInclusionPder} is proven, then property \ref{pderBound} |
492 \end{center} |
530 would yield us a cubic bound for our $\blexerStrong$ algorithm: |
493 |
531 \begin{conjecture}\label{strongCubic} |
494 |
532 $\llbracket r \bdersStrong\; s \rrbracket \leq \llbracket r \rrbracket^3$ |
495 $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, |
533 \end{conjecture} |
|
534 |
|
535 |
|
536 %To get all the "atomic" components of a regular expression's possible derivatives, |
|
537 %there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes |
|
538 %whatever character is available at the head of the string inside the language of a |
|
539 %regular expression, and gives back the character and the derivative regular expression |
|
540 %as a pair (which he called "monomial"): |
|
541 % \begin{center} |
|
542 % \begin{tabular}{ccc} |
|
543 % $\lf(\ONE)$ & $=$ & $\phi$\\ |
|
544 %$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\ |
|
545 % $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ |
|
546 % $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\ |
|
547 %\end{tabular} |
|
548 %\end{center} |
|
549 %%TODO: completion |
|
550 % |
|
551 %There is a slight difference in the last three clauses compared |
|
552 %with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular |
|
553 %expression $r$ with every element inside $\textit{rset}$ to create a set of |
|
554 %sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates |
|
555 %on a set of monomials (which Antimirov called "linear form") and a regular |
|
556 %expression, and returns a linear form: |
|
557 % \begin{center} |
|
558 % \begin{tabular}{ccc} |
|
559 % $l \bigodot (\ZERO)$ & $=$ & $\phi$\\ |
|
560 % $l \bigodot (\ONE)$ & $=$ & $l$\\ |
|
561 % $\phi \bigodot t$ & $=$ & $\phi$\\ |
|
562 % $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\ |
|
563 % $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\ |
|
564 % $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\ |
|
565 % $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ |
|
566 % $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\ |
|
567 %\end{tabular} |
|
568 %\end{center} |
|
569 %%TODO: completion |
|
570 % |
|
571 % Some degree of simplification is applied when doing $\bigodot$, for example, |
|
572 % $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$, |
|
573 % and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and |
|
574 % $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$, |
|
575 % and so on. |
|
576 % |
|
577 % With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with |
|
578 % an iterative procedure: |
|
579 % \begin{center} |
|
580 % \begin{tabular}{llll} |
|
581 %$\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\ |
|
582 % & $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\ |
|
583 % & $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\ |
|
584 %$\partial_{UNIV}(r)$ & $=$ & $\PD$ & |
|
585 %\end{tabular} |
|
586 %\end{center} |
|
587 % |
|
588 % |
|
589 % $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, |
496 |
590 |
497 |
591 |
498 |
592 |
499 |
593 |
500 %---------------------------------------------------------------------------------------- |
594 %---------------------------------------------------------------------------------------- |
756 \begin{proof} |
902 \begin{proof} |
757 By a reverse induction on $s$. |
903 By a reverse induction on $s$. |
758 For the inductive case, note that if $\cbn \; r$ holds, |
904 For the inductive case, note that if $\cbn \; r$ holds, |
759 then $\cbn \; (r\backslash_r c)$ holds. |
905 then $\cbn \; (r\backslash_r c)$ holds. |
760 \end{proof} |
906 \end{proof} |
|
907 \noindent |
761 In addition, for $\cbn$-shaped regular expressioins, one can flatten |
908 In addition, for $\cbn$-shaped regular expressioins, one can flatten |
762 them: |
909 them: |
763 \begin{lemma}\label{ntimesHfauPushin} |
910 \begin{lemma}\label{ntimesHfauPushin} |
764 If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} = |
911 If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} = |
765 \textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \; |
912 \textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \; |
766 (\hflataux{r})})$ |
913 (\hflataux{r})})$ |
767 \end{lemma} |
914 \end{lemma} |
768 \begin{proof} |
915 \begin{proof} |
769 By an induction on the inductive cases of $\cbn$. |
916 By an induction on the inductive cases of $\cbn$. |
770 \end{proof} |
917 \end{proof} |
771 |
918 \noindent |
772 |
|
773 |
|
774 This time we do not need to define the flattening functions for NTIMES only, |
919 This time we do not need to define the flattening functions for NTIMES only, |
775 because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already: |
920 because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already. |
776 \begin{lemma}\label{ntimesHfauInduct} |
921 \begin{lemma}\label{ntimesHfauInduct} |
777 $\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} = |
922 $\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} = |
778 \map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$ |
923 \map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$ |
779 \end{lemma} |
924 \end{lemma} |
780 \begin{proof} |
925 \begin{proof} |
781 By a reverse induction on $s$. |
926 By a reverse induction on $s$. |
782 The lemma \ref{ntimesDersCbn} is used. |
927 The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used. |
783 \end{proof} |
928 \end{proof} |
784 \noindent |
929 \noindent |
785 We have a recursive property for NTIMES with $\nupdate$ |
930 We have a recursive property for NTIMES with $\nupdate$ |
786 similar to that for STAR, |
931 similar to that for STAR, |
787 and one for $\nupdates $ as well: |
932 and one for $\nupdates $ as well: |
788 \begin{lemma}\label{nupdateInduct1} |
933 \begin{lemma}\label{nupdateInduct1} |
789 (i) $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( |
934 \mbox{} |
|
935 \begin{itemize} |
|
936 \item |
|
937 \begin{center} |
|
938 $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( |
790 \opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \; |
939 \opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \; |
791 c \; r \; Ss)$\\ |
940 c \; r \; Ss)$\\ |
792 (ii) $\textit{concat} \; (\map \; \hflataux{\_}\; |
941 \end{center} |
|
942 holds. |
|
943 \item |
|
944 \begin{center} |
|
945 $\textit{concat} \; (\map \; \hflataux{\_}\; |
793 \map \; (\_\backslash_r x) \; |
946 \map \; (\_\backslash_r x) \; |
794 (\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss))) = |
947 (\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\ |
795 \map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$ holds. |
948 $=$\\ |
|
949 $\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$ |
|
950 \end{center} |
|
951 holds. |
|
952 \end{itemize} |
796 \end{lemma} |
953 \end{lemma} |
797 \begin{proof} |
954 \begin{proof} |
798 (i) is by an induction on $Ss$. |
955 (i) is by an induction on $Ss$. |
799 (ii) is by an induction on $xs$. |
956 (ii) is by an induction on $xs$. |
800 \end{proof} |
957 \end{proof} |
801 |
958 \noindent |
802 The $\nString$ predicate is defined for conveniently |
959 The $\nString$ predicate is defined for conveniently |
803 expressing that there are no empty strings in the |
960 expressing that there are no empty strings in the |
804 $\Some \;(s, n)$ elements generated by $\nupdate$: |
961 $\Some \;(s, n)$ elements generated by $\nupdate$: |
805 \begin{center} |
962 \begin{center} |
806 \begin{tabular}{lcl} |
963 \begin{tabular}{lcl} |
916 In addition, we know that the list |
1083 In addition, we know that the list |
917 $\map \; (\optermsimp \; r) \; ( |
1084 $\map \; (\optermsimp \; r) \; ( |
918 \nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most |
1085 \nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most |
919 $c_N = \textit{card} \; |
1086 $c_N = \textit{card} \; |
920 (\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$. |
1087 (\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$. |
921 This gives us $\llbracket \llbracket r \backslash_{rsimps} \;s \llbracket_r |
1088 This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r |
922 \leq N * c_N$. |
1089 \leq N * c_N$. |
923 \end{proof} |
1090 \end{proof} |
924 |
1091 |
925 |
1092 We aim to formalise the correctness and size bound |
926 Assuming we have that |
1093 for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$ |
927 \begin{center} |
1094 and so on, which is still work in progress. |
928 $\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$. |
1095 They should more or less follow the same recipe described in this section. |
929 \end{center} |
1096 Once we know about how to deal with them recursively using suitable auxiliary |
930 holds. |
1097 definitions, we are able to routinely establish the proofs. |
931 The idea of $\starupdate$ and $\starupdates$ |
1098 |
932 is to update $Ss$ when another |
1099 |
933 derivative is taken on $\rderssimp{r^*}{s}$ |
1100 %The closed form for them looks like: |
934 w.r.t a character $c$ and a string $s'$ |
1101 %%\begin{center} |
935 respectively. |
1102 %% \begin{tabular}{llrclll} |
936 Both $\starupdate$ and $\starupdates$ take three arguments as input: |
1103 %% $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & \\ |
937 the new character $c$ or string $s$ to take derivative with, |
1104 %% $\textit{rsimp}$ & $($ & $ |
938 the regular expression |
1105 %% \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\ |
939 $r$ under the star $r^*$, and the |
1106 %% & & & & $\textit{nupdates} \;$ & |
940 list of strings $Ss$ for the derivative $r^* \backslash s$ |
1107 %% $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\ |
941 up until this point |
1108 %% & & & & $)$ &\\ |
942 such that |
1109 %% & & $)$ & & &\\ |
943 \begin{center} |
1110 %% & $)$ & & & &\\ |
944 $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ |
1111 %% \end{tabular} |
945 \end{center} |
1112 %%\end{center} |
946 is satisfied. |
|
947 |
|
948 Functions $\starupdate$ and $\starupdates$ characterise what the |
|
949 star derivatives will look like once ``straightened out'' into lists. |
|
950 The helper functions for such operations will be similar to |
|
951 $\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence. |
|
952 We use similar symbols to denote them, with a $*$ subscript to mark the difference. |
|
953 \begin{center} |
|
954 \begin{tabular}{lcl} |
|
955 $\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ |
|
956 $\hflataux{r}$ & $\dn$ & $[r]$ |
|
957 \end{tabular} |
|
958 \end{center} |
|
959 |
|
960 \begin{center} |
|
961 \begin{tabular}{lcl} |
|
962 $\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\ |
|
963 $\hflat{r}$ & $\dn$ & $r$ |
|
964 \end{tabular} |
|
965 \end{center} |
|
966 \noindent |
|
967 These definitions are tailor-made for dealing with alternatives that have |
|
968 originated from a star's derivatives. |
|
969 A typical star derivative always has the structure of a balanced binary tree: |
|
970 \begin{center} |
|
971 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + |
|
972 (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
|
973 \end{center} |
|
974 All of the nested structures of alternatives |
|
975 generated from derivatives are binary, and therefore |
|
976 $\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives. |
|
977 $\hflat{\_}$ ``untangles'' like the following: |
|
978 \[ |
|
979 \sum ((r_1 + r_2) + (r_3 + r_4)) + \ldots \; |
|
980 \stackrel{\hflat{\_}}{\longrightarrow} \; |
|
981 \RALTS{[r_1, r_2, \ldots, r_n]} |
|
982 \] |
|
983 Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$, |
|
984 with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists |
|
985 and merges each of the element lists to form a flattened list.}: |
|
986 \begin{lemma}\label{stupdateInduct1} |
|
987 \mbox |
|
988 For a list of strings $Ss$, the following hold. |
|
989 \begin{itemize} |
|
990 \item |
|
991 If we do a derivative on the terms |
|
992 $r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$), |
|
993 the result will be the same as if we apply $\starupdate$ to $Ss$. |
|
994 \begin{center} |
|
995 \begin{tabular}{c} |
|
996 $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x) |
|
997 \circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\; |
|
998 $\\ |
|
999 \\ |
|
1000 $=$ \\ |
|
1001 \\ |
|
1002 $\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \; |
|
1003 (\starupdate \; x \; r \; Ss)$ |
|
1004 \end{tabular} |
|
1005 \end{center} |
|
1006 \item |
|
1007 $\starupdates$ is ``composable'' w.r.t a derivative. |
|
1008 It piggybacks the character $x$ to the tail of the string |
|
1009 $xs$. |
|
1010 \begin{center} |
|
1011 \begin{tabular}{c} |
|
1012 $\textit{concat} \; (\map \; \hflataux{\_} \; |
|
1013 (\map \; (\_\backslash_r x) \; |
|
1014 (\map \; (\lambda s.\;\; (r \backslash_r s) \cdot |
|
1015 (r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\ |
|
1016 \\ |
|
1017 $=$\\ |
|
1018 \\ |
|
1019 $\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \; |
|
1020 (\starupdates \; (xs @ [x]) \; r \; Ss)$ |
|
1021 \end{tabular} |
|
1022 \end{center} |
|
1023 \end{itemize} |
|
1024 \end{lemma} |
|
1025 |
|
1026 \begin{proof} |
|
1027 Part 1 is by induction on $Ss$. |
|
1028 Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values. |
|
1029 \end{proof} |
|
1030 |
|
1031 |
|
1032 Like $\textit{createdBySequence}$, we need |
|
1033 a predicate for ``star-created'' regular expressions: |
|
1034 \begin{center} |
|
1035 \begin{mathpar} |
|
1036 \inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} } |
|
1037 |
|
1038 \inferrule{ \textit{createdByStar} \; r_1\; \land \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) } |
|
1039 \end{mathpar} |
|
1040 \end{center} |
|
1041 \noindent |
|
1042 All regular expressions created by taking derivatives of |
|
1043 $r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate: |
|
1044 \begin{lemma}\label{starDersCbs} |
|
1045 $\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds. |
|
1046 \end{lemma} |
|
1047 \begin{proof} |
|
1048 By a reverse induction on $s$. |
|
1049 \end{proof} |
|
1050 If a regular expression conforms to the shape of a star's derivative, |
|
1051 then we can push an application of $\hflataux{\_}$ inside a derivative of it: |
|
1052 \begin{lemma}\label{hfauPushin} |
|
1053 If $\textit{createdByStar} \; r$ holds, then |
|
1054 \begin{center} |
|
1055 $\hflataux{r \backslash_r c} = \textit{concat} \; ( |
|
1056 \map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$ |
|
1057 \end{center} |
|
1058 holds. |
|
1059 \end{lemma} |
|
1060 \begin{proof} |
|
1061 By an induction on the inductivev cases of $\textit{createdByStar}$. |
|
1062 \end{proof} |
|
1063 %This is not entirely true for annotated regular expressions: |
|
1064 %%TODO: bsimp bders \neq bderssimp |
|
1065 %\begin{center} |
1113 %\begin{center} |
1066 % $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ |
1114 % \begin{tabular}{llrcllrllll} |
1067 %\end{center} |
1115 % $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & &&&&\\ |
1068 %For bit-codes, the order in which simplification is applied |
1116 % &&&&$\textit{rsimp}$ & $($ & $ |
1069 %might cause a difference in the location they are placed. |
1117 % \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\ |
1070 %If we want something like |
1118 % &&&& & & & & $\;\; \textit{nupdates} \;$ & |
1071 %\begin{center} |
1119 % $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\ |
1072 % $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ |
1120 % &&&& & & & & $)$ &\\ |
1073 %\end{center} |
1121 % &&&& & & $)$ & & &\\ |
1074 %Some "canonicalization" procedure is required, |
1122 % &&&& & $)$ & & & &\\ |
1075 %which either pushes all the common bitcodes to nodes |
|
1076 %as senior as possible: |
|
1077 %\begin{center} |
|
1078 % $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ |
|
1079 %\end{center} |
|
1080 %or does the reverse. However bitcodes are not of interest if we are talking about |
|
1081 %the $\llbracket r \rrbracket$ size of a regex. |
|
1082 %Therefore for the ease and simplicity of producing a |
|
1083 %proof for a size bound, we are happy to restrict ourselves to |
|
1084 %unannotated regular expressions, and obtain such equalities as |
|
1085 %TODO: rsimp sflat |
|
1086 % The simplification of a flattened out regular expression, provided it comes |
|
1087 %from the derivative of a star, is the same as the one nested. |
|
1088 |
|
1089 |
|
1090 |
|
1091 Now we introduce an inductive property |
|
1092 for $\starupdate$ and $\hflataux{\_}$. |
|
1093 \begin{lemma}\label{starHfauInduct} |
|
1094 If we do derivatives of $r^*$ |
|
1095 with a string that starts with $c$, |
|
1096 then flatten it out, |
|
1097 we obtain a list |
|
1098 of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$, |
|
1099 where $sS = \starupdates \; s \; r \; [[c]]$. Namely, |
|
1100 \begin{center} |
|
1101 $\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = |
|
1102 \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; |
|
1103 (\starupdates \; s \; r_0 \; [[c]])$ |
|
1104 \end{center} |
|
1105 holds. |
|
1106 \end{lemma} |
|
1107 \begin{proof} |
|
1108 By an induction on $s$, the inductive cases |
|
1109 being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used. |
|
1110 \end{proof} |
|
1111 \noindent |
|
1112 |
|
1113 $\hflataux{\_}$ has a similar effect as $\textit{flatten}$: |
|
1114 \begin{lemma}\label{hflatauxGrewrites} |
|
1115 $a :: rs \grewrites \hflataux{a} @ rs$ |
|
1116 \end{lemma} |
|
1117 \begin{proof} |
|
1118 By induction on $a$. $rs$ is set to take arbitrary values. |
|
1119 \end{proof} |
|
1120 It is also not surprising that $\textit{rsimp}$ will cover |
|
1121 the effect of $\hflataux{\_}$: |
|
1122 \begin{lemma}\label{cbsHfauRsimpeq1} |
|
1123 $\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$ |
|
1124 \end{lemma} |
|
1125 |
|
1126 \begin{proof} |
|
1127 By using the rewriting relation $\rightsquigarrow$ |
|
1128 \end{proof} |
|
1129 And from this we obtain a proof that a star's derivative will be the same |
|
1130 as if it had all its nested alternatives created during deriving being flattened out: |
|
1131 For example, |
|
1132 \begin{lemma}\label{hfauRsimpeq2} |
|
1133 $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
|
1134 \end{lemma} |
|
1135 \begin{proof} |
|
1136 By structural induction on $r$, where the induction rules |
|
1137 are these of $\createdByStar{\_}$. |
|
1138 Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case. |
|
1139 \end{proof} |
|
1140 |
|
1141 |
|
1142 %Here is a corollary that states the lemma in |
|
1143 %a more intuitive way: |
|
1144 %\begin{corollary} |
|
1145 % $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot |
|
1146 % (r^*))\; (\starupdates \; c\; r\; [[c]])$ |
|
1147 %\end{corollary} |
|
1148 %\noindent |
|
1149 %Note that this is also agnostic of the simplification |
|
1150 %function we defined, and is therefore of more general interest. |
|
1151 |
|
1152 Together with the rewriting relation |
|
1153 \begin{lemma}\label{starClosedForm6Hrewrites} |
|
1154 We have the following set of rewriting relations or equalities: |
|
1155 \begin{itemize} |
|
1156 \item |
|
1157 $\textit{rsimp} \; (r^* \backslash_r (c::s)) |
|
1158 \sequal |
|
1159 \sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; ( |
|
1160 \starupdates \; s \; r \; [ c::[]] ) ) )$ |
|
1161 \item |
|
1162 $r \backslash_{rsimp} (c::s) = \textit{rsimp} \; ( ( |
|
1163 \sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \; |
|
1164 (\starupdates \;s \; r \; [ c::[] ])))))$ |
|
1165 \item |
|
1166 $\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss)) |
|
1167 \sequal |
|
1168 \sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \; |
|
1169 r^*) \; Ss) )$ |
|
1170 \item |
|
1171 $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss |
|
1172 \scfrewrites |
|
1173 \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$ |
|
1174 \item |
|
1175 $( ( \sum ( ( \map \ (\lambda s. \;\; |
|
1176 (\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \; |
|
1177 s \; r \; [ c::[] ])))))$\\ |
|
1178 $\sequal$\\ |
|
1179 $( ( \sum ( ( \map \ (\lambda s. \;\; |
|
1180 ( r \backslash_{rsimp} s)) \cdot r^*) \; (\starupdates \; |
|
1181 s \; r \; [ c::[] ]))))$\\ |
|
1182 \end{itemize} |
|
1183 \end{lemma} |
|
1184 \begin{proof} |
|
1185 Part 1 leads to part 2. |
|
1186 The rest of them are routine. |
|
1187 \end{proof} |
|
1188 \noindent |
|
1189 Next the closed form for star regular expressions can be derived: |
|
1190 \begin{theorem}\label{starClosedForm} |
|
1191 $\rderssimp{r^*}{c::s} = |
|
1192 \rsimp{ |
|
1193 (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; |
|
1194 (\starupdates \; s\; r \; [[c]]) |
|
1195 ) |
|
1196 ) |
|
1197 } |
|
1198 $ |
|
1199 \end{theorem} |
|
1200 \begin{proof} |
|
1201 By an induction on $s$. |
|
1202 The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites} |
|
1203 and \ref{hfauRsimpeq2} |
|
1204 are used. |
|
1205 In \ref{starClosedForm6Hrewrites}, the equalities are |
|
1206 used to link the LHS and RHS. |
|
1207 \end{proof} |
|
1208 |
|
1209 |
|
1210 |
|
1211 |
|
1212 The closed form for them looks like: |
|
1213 %\begin{center} |
|
1214 % \begin{tabular}{llrclll} |
|
1215 % $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & \\ |
|
1216 % $\textit{rsimp}$ & $($ & $ |
|
1217 % \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\ |
|
1218 % & & & & $\textit{nupdates} \;$ & |
|
1219 % $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\ |
|
1220 % & & & & $)$ &\\ |
|
1221 % & & $)$ & & &\\ |
|
1222 % & $)$ & & & &\\ |
|
1223 % \end{tabular} |
1123 % \end{tabular} |
1224 %\end{center} |
1124 %\end{center} |
1225 \begin{center} |
1125 %The $\textit{optermsimp}$ function with the argument $r$ |
1226 \begin{tabular}{llrcllrllll} |
1126 %chooses from two options: $\ZERO$ or |
1227 $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & &&&&\\ |
1127 %We define for the $r^{\{n\}}$ constructor something similar to $\starupdate$ |
1228 &&&&$\textit{rsimp}$ & $($ & $ |
1128 %and $\starupdates$: |
1229 \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\ |
1129 %\begin{center} |
1230 &&&& & & & & $\;\; \textit{nupdates} \;$ & |
1130 % \begin{tabular}{lcl} |
1231 $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\ |
1131 % $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ |
1232 &&&& & & & & $)$ &\\ |
1132 % $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\ |
1233 &&&& & & $)$ & & &\\ |
1133 % & & $\textit{if} \; |
1234 &&&& & $)$ & & & &\\ |
1134 % (\rnullable \; (\rders \; r \; s))$ \\ |
1235 \end{tabular} |
1135 % & & $\textit{then} \;\; (s @ [c]) :: [c] :: ( |
1236 \end{center} |
1136 % \starupdate \; c \; r \; Ss)$ \\ |
1237 The $\textit{optermsimp}$ function with the argument $r$ |
1137 % & & $\textit{else} \;\; (s @ [c]) :: ( |
1238 chooses from two options: $\ZERO$ or |
1138 % \starupdate \; c \; r \; Ss)$ |
1239 We define for the $r^{\{n\}}$ constructor something similar to $\starupdate$ |
1139 % \end{tabular} |
1240 and $\starupdates$: |
1140 %\end{center} |
1241 \begin{center} |
1141 %\noindent |
1242 \begin{tabular}{lcl} |
1142 %As a generalisation from characters to strings, |
1243 $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ |
1143 %$\starupdates$ takes a string instead of a character |
1244 $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\ |
1144 %as the first input argument, and is otherwise the same |
1245 & & $\textit{if} \; |
1145 %as $\starupdate$. |
1246 (\rnullable \; (\rders \; r \; s))$ \\ |
1146 %\begin{center} |
1247 & & $\textit{then} \;\; (s @ [c]) :: [c] :: ( |
1147 % \begin{tabular}{lcl} |
1248 \starupdate \; c \; r \; Ss)$ \\ |
1148 % $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\ |
1249 & & $\textit{else} \;\; (s @ [c]) :: ( |
1149 % $\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; ( |
1250 \starupdate \; c \; r \; Ss)$ |
1150 % \starupdate \; c \; r \; Ss)$ |
1251 \end{tabular} |
1151 % \end{tabular} |
1252 \end{center} |
1152 %\end{center} |
1253 \noindent |
1153 %\noindent |
1254 As a generalisation from characters to strings, |
1154 |
1255 $\starupdates$ takes a string instead of a character |
1155 |
1256 as the first input argument, and is otherwise the same |
1156 |
1257 as $\starupdate$. |
1157 %\section{Zippers} |
1258 \begin{center} |
1158 %Zipper is a data structure designed to operate on |
1259 \begin{tabular}{lcl} |
1159 %and navigate between local parts of a tree. |
1260 $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\ |
1160 %It was first formally described by Huet \cite{HuetZipper}. |
1261 $\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; ( |
1161 %Typical applications of zippers involve text editor buffers |
1262 \starupdate \; c \; r \; Ss)$ |
1162 %and proof system databases. |
1263 \end{tabular} |
1163 %In our setting, the idea is to compactify the representation |
1264 \end{center} |
1164 %of derivatives with zippers, thereby making our algorithm faster. |
1265 \noindent |
1165 %Some initial results |
1266 |
1166 %We first give a brief introduction to what zippers are, |
1267 |
1167 %and other works |
|
1168 %that apply zippers to derivatives |
|
1169 %When dealing with large trees, it would be a waste to |
|
1170 %traverse the entire tree if |
|
1171 %the operation only |
|
1172 %involves a small fraction of it. |
|
1173 %The idea is to put the focus on that subtree, turning other parts |
|
1174 %of the tree into a context |
|
1175 % |
|
1176 % |
|
1177 %One observation about our derivative-based lexing algorithm is that |
|
1178 %the derivative operation sometimes traverses the entire regular expression |
|
1179 %unnecessarily: |
1268 |
1180 |
1269 |
1181 |
1270 %---------------------------------------------------------------------------------------- |
1182 %---------------------------------------------------------------------------------------- |
1271 % SECTION 1 |
1183 % SECTION 1 |
1272 %---------------------------------------------------------------------------------------- |
1184 %---------------------------------------------------------------------------------------- |