changeset 424 | 2416fdec6396 |
parent 423 | b7199d6c672d |
child 425 | 14c558ae0b09 |
423:b7199d6c672d | 424:2416fdec6396 |
---|---|
62 |
62 |
63 code ("code _" [79] 74) and |
63 code ("code _" [79] 74) and |
64 intern ("_\<^latex>\<open>\\mbox{$^\\uparrow$}\<close>" [900] 80) and |
64 intern ("_\<^latex>\<open>\\mbox{$^\\uparrow$}\<close>" [900] 80) and |
65 erase ("_\<^latex>\<open>\\mbox{$^\\downarrow$}\<close>" [1000] 74) and |
65 erase ("_\<^latex>\<open>\\mbox{$^\\downarrow$}\<close>" [1000] 74) and |
66 bnullable ("bnullable _" [1000] 80) and |
66 bnullable ("bnullable _" [1000] 80) and |
67 bsimp_AALTs ("bsimpALT _ _" [10,1000] 80) and |
|
68 bsimp_ASEQ ("bsimpSEQ _ _ _" [10,1000,1000] 80) and |
|
67 bmkeps ("bmkeps _" [1000] 80) and |
69 bmkeps ("bmkeps _" [1000] 80) and |
68 |
70 |
69 srewrite ("_\<^latex>\<open>\\mbox{$\\,\\stackrel{s}{\\leadsto}$}\<close> _" [71, 71] 80) and |
71 srewrite ("_\<^latex>\<open>\\mbox{$\\,\\stackrel{s}{\\leadsto}$}\<close> _" [71, 71] 80) and |
70 rrewrites ("_ \<^latex>\<open>\\mbox{$\\,\\leadsto^*$}\<close> _" [71, 71] 80) and |
72 rrewrites ("_ \<^latex>\<open>\\mbox{$\\,\\leadsto^*$}\<close> _" [71, 71] 80) and |
71 blexer_simp ("blexer\<^sup>+" 1000) |
73 blexer_simp ("blexer\<^sup>+" 1000) |
145 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow |
147 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow |
146 r$. While such simple-minded simplifications have been proved in our |
148 r$. While such simple-minded simplifications have been proved in our |
147 earlier work to preserve the correctness of Sulzmann and Lu's |
149 earlier work to preserve the correctness of Sulzmann and Lu's |
148 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do |
150 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do |
149 \emph{not} help with limiting the growth of the derivatives shown |
151 \emph{not} help with limiting the growth of the derivatives shown |
150 above: the growth is slowed, but the derivatives can still grow quickly |
152 above: the growth is slowed, but the derivatives can still grow rather |
151 beyond any finite bound. |
153 quickly beyond any finite bound. |
152 |
|
153 |
154 |
154 |
155 |
155 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm |
156 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm |
156 \cite{Sulzmann2014} where they introduce bitcoded |
157 \cite{Sulzmann2014} where they introduce bitcoded |
157 regular expressions. In this version, POSIX values are |
158 regular expressions. In this version, POSIX values are |
212 consisting of just a single character $c$ is written $[c]$. |
213 consisting of just a single character $c$ is written $[c]$. |
213 Our regular expressions are defined as usual as the elements of the following inductive |
214 Our regular expressions are defined as usual as the elements of the following inductive |
214 datatype: |
215 datatype: |
215 |
216 |
216 \begin{center} |
217 \begin{center} |
217 @{text "r :="} |
218 @{text "r ::="} \; |
218 @{const "ZERO"} $\mid$ |
219 @{const "ZERO"} $\mid$ |
219 @{const "ONE"} $\mid$ |
220 @{const "ONE"} $\mid$ |
220 @{term "CH c"} $\mid$ |
221 @{term "CH c"} $\mid$ |
221 @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$ |
222 @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$ |
222 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ |
223 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ |
300 |
301 |
301 The novel idea of Sulzmann and Lu is to extend this algorithm for |
302 The novel idea of Sulzmann and Lu is to extend this algorithm for |
302 lexing, where it is important to find out which part of the string |
303 lexing, where it is important to find out which part of the string |
303 is matched by which part of the regular expression. |
304 is matched by which part of the regular expression. |
304 For this Sulzmann and Lu presented two lexing algorithms in their paper |
305 For this Sulzmann and Lu presented two lexing algorithms in their paper |
305 \cite{Sulzmann2014}. This first algorithm consists of two phases: first a |
306 \cite{Sulzmann2014}. The first algorithm consists of two phases: first a |
306 matching phase (which is Brzozowski's algorithm) and then a value |
307 matching phase (which is Brzozowski's algorithm) and then a value |
307 construction phase. The values encode \emph{how} a regular expression |
308 construction phase. The values encode \emph{how} a regular expression |
308 matches a string. \emph{Values} are defined as the inductive datatype |
309 matches a string. \emph{Values} are defined as the inductive datatype |
309 |
310 |
310 \begin{center} |
311 \begin{center} |
421 \end{tabular} |
422 \end{tabular} |
422 \end{tabular} |
423 \end{tabular} |
423 \end{center} |
424 \end{center} |
424 |
425 |
425 \noindent |
426 \noindent |
426 The function @{text mkeps} is called when the last derivative is nullable, that is |
427 The function @{text mkeps} is run when the last derivative is nullable, that is |
427 the string to be matched is in the language of the regular expression. It generates |
428 the string to be matched is in the language of the regular expression. It generates |
428 a value for how the last derivative can match the empty string. The injection function |
429 a value for how the last derivative can match the empty string. The injection function |
429 then calculates the corresponding value for each intermediate derivative until |
430 then calculates the corresponding value for each intermediate derivative until |
430 a value for the original regular expression is generated. |
431 a value for the original regular expression is generated. |
431 Graphically the algorithm by |
432 Graphically the algorithm by |
432 Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz} |
433 Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz} |
433 where the path from the left to the right involving @{term derivatives}/@{const |
434 where the path from the left to the right involving @{term derivatives}/@{const |
434 nullable} is the first phase of the algorithm (calculating successive |
435 nullable} is the first phase of the algorithm (calculating successive |
435 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to |
436 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to |
436 left, the second phase. This picture shows the steps required when a |
437 left, the second phase. The picture above shows the steps required when a |
437 regular expression, say @{text "r\<^sub>1"}, matches the string @{term |
438 regular expression, say @{text "r\<^sub>1"}, matches the string @{term |
438 "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as: |
439 "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as: |
439 |
440 |
440 \begin{figure}[t] |
441 \begin{figure}[t] |
441 \begin{center} |
442 \begin{center} |
486 & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"} |
487 & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"} |
487 \end{tabular} |
488 \end{tabular} |
488 \end{center} |
489 \end{center} |
489 |
490 |
490 |
491 |
491 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that the algorithm by Sulzmann and Lu |
492 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that |
492 is correct. The cenral property we established relates the derivative operation to the injection function. |
493 this algorithm is correct, that is it generates POSIX values. The |
494 cenral property we established relates the derivative operation to the |
|
495 injection function. |
|
493 |
496 |
494 \begin{proposition}\label{Posix2} |
497 \begin{proposition}\label{Posix2} |
495 \textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$. |
498 \textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$. |
496 \end{proposition} |
499 \end{proposition} |
497 |
500 |
508 |
511 |
509 \noindent |
512 \noindent |
510 In fact we have shown that in the success case the generated POSIX value $v$ is |
513 In fact we have shown that in the success case the generated POSIX value $v$ is |
511 unique and in the failure case that there is no POSIX value $v$ that satisfies |
514 unique and in the failure case that there is no POSIX value $v$ that satisfies |
512 $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly |
515 $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly |
513 slow in examples where the derivatives grow arbitrarily (see example from the |
516 slow in cases where the derivatives grow arbitrarily (see example from the |
514 Introduction). However it can be used as a conveninet reference point for the correctness |
517 Introduction). However it can be used as a convenient reference point for the correctness |
515 proof of the second algorithm by Sulzmann and Lu, which we shall describe next. |
518 proof of the second algorithm by Sulzmann and Lu, which we shall describe next. |
516 |
519 |
517 *} |
520 *} |
518 |
521 |
519 section {* Bitcoded Regular Expressions and Derivatives *} |
522 section {* Bitcoded Regular Expressions and Derivatives *} |
538 |
541 |
539 \noindent where @{text bs} stands for bitsequences; @{text r}, |
542 \noindent where @{text bs} stands for bitsequences; @{text r}, |
540 @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular |
543 @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular |
541 expressions; and @{text rs} for lists of bitcoded regular |
544 expressions; and @{text rs} for lists of bitcoded regular |
542 expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"} |
545 expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"} |
543 is just an abbreviation for @{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}. |
546 is just an abbreviation for \mbox{@{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}}. |
544 For bitsequences we just use lists made up of the |
547 For bitsequences we just use lists made up of the |
545 constants @{text Z} and @{text S}. The idea with bitcoded regular |
548 constants @{text Z} and @{text S}. The idea with bitcoded regular |
546 expressions is to incrementally generate the value information (for |
549 expressions is to incrementally generate the value information (for |
547 example @{text Left} and @{text Right}) as bitsequences. For this |
550 example @{text Left} and @{text Right}) as bitsequences. For this |
548 Sulzmann and Lu define a coding |
551 Sulzmann and Lu define a coding |
665 $\textit{STAR}\;[]\,r^\uparrow$\\ |
668 $\textit{STAR}\;[]\,r^\uparrow$\\ |
666 \end{tabular} |
669 \end{tabular} |
667 \end{center} |
670 \end{center} |
668 |
671 |
669 \noindent |
672 \noindent |
670 There is also an \emph{erase}-function, written $a^\downarrow$, which |
673 There is also an \emph{erase}-function, written $r^\downarrow$, which |
671 transforms a bitcoded regular expression into a (standard) regular |
674 transforms a bitcoded regular expression into a (standard) regular |
672 expression by just erasing the annotated bitsequences. We omit the |
675 expression by just erasing the annotated bitsequences. We omit the |
673 straightforward definition. For defining the algorithm, we also need |
676 straightforward definition. For defining the algorithm, we also need |
674 the functions \textit{bnullable} and \textit{bmkeps}, which are the |
677 the functions \textit{bnullable} and \textit{bmkeps}, which are the |
675 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on |
678 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on |
703 \end{tabular} |
706 \end{tabular} |
704 \end{center} |
707 \end{center} |
705 |
708 |
706 |
709 |
707 \noindent |
710 \noindent |
708 The key function in the bitcoded algorithm is the derivative of an |
711 The key function in the bitcoded algorithm is the derivative of a |
709 bitcoded regular expression. This derivative calculates the |
712 bitcoded regular expression. This derivative calculates the |
710 derivative but at the same time also the incremental part of bitsequences |
713 derivative but at the same time also the incremental part of bitsequences |
711 that contribute to constructing a POSIX value. |
714 that contribute to constructing a POSIX value. |
712 |
715 |
713 \begin{center} |
716 \begin{center} |
750 builds the bitcoded derivative according to $s$. If the derivative is |
753 builds the bitcoded derivative according to $s$. If the derivative is |
751 (b)nullable the string is in the language of $r$ and it extracts the bitsequence using the |
754 (b)nullable the string is in the language of $r$ and it extracts the bitsequence using the |
752 $\textit{bmkeps}$ function. Finally it decodes the bitsequence into a value. If |
755 $\textit{bmkeps}$ function. Finally it decodes the bitsequence into a value. If |
753 the derivative is \emph{not} nullable, then $\textit{None}$ is |
756 the derivative is \emph{not} nullable, then $\textit{None}$ is |
754 returned. We can show that this way of calculating a value |
757 returned. We can show that this way of calculating a value |
755 generates the same result as with \textit{lexer}. |
758 generates the same result as \textit{lexer}. |
756 |
759 |
757 Before we can proceed we need to define a helper function, called |
760 Before we can proceed we need to define a helper function, called |
758 \textit{retrieve}, which Sulzmann and Lu introduced for the correctness proof. |
761 \textit{retrieve}, which Sulzmann and Lu introduced for the correctness proof. |
759 |
762 |
760 \begin{center} |
763 \begin{center} |
771 \end{tabular} |
774 \end{tabular} |
772 \end{center} |
775 \end{center} |
773 |
776 |
774 \noindent |
777 \noindent |
775 The idea behind this function is to retrieve a possibly partial |
778 The idea behind this function is to retrieve a possibly partial |
776 bitcode from a bitcoded regular expression, where the retrieval is |
779 bitsequence from a bitcoded regular expression, where the retrieval is |
777 guided by a value. For example if the value is $\Left$ then we |
780 guided by a value. For example if the value is $\Left$ then we |
778 descend into the left-hand side of an alternative in order to |
781 descend into the left-hand side of an alternative in order to |
779 assemble the bitcode. Similarly for |
782 assemble the bitcode. Similarly for |
780 $\Right$. The property we can show is that for a given $v$ and $r$ |
783 $\Right$. The property we can show is that for a given $v$ and $r$ |
781 with $\vdash v : r$, the retrieved bitsequence from the internalised |
784 with $\vdash v : r$, the retrieved bitsequence from the internalised |
804 All properties are by induction on annotated regular expressions. There are no |
807 All properties are by induction on annotated regular expressions. There are no |
805 interesting cases. |
808 interesting cases. |
806 \end{proof} |
809 \end{proof} |
807 |
810 |
808 \noindent |
811 \noindent |
809 The only problem left for the correctness proof is that the bitcoded algorithm |
812 The only difficulty left for the correctness proof is that the bitcoded algorithm |
810 has only a ``forward phase'' where POSIX values are generated incrementally. |
813 has only a ``forward phase'' where POSIX values are generated incrementally. |
811 We can achive the same effect with @{text lexer} by stacking up injection |
814 We can achieve the same effect with @{text lexer} by stacking up injection |
812 functions in the during forward phase. An auxiliary function, called $\textit{flex}$, |
815 functions during the forward phase. An auxiliary function, called $\textit{flex}$, |
813 allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single |
816 allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single |
814 phase. |
817 phase and stacked up injection functions. |
815 |
818 |
816 \begin{center} |
819 \begin{center} |
817 \begin{tabular}{lcl} |
820 \begin{tabular}{lcl} |
818 $\textit{flex}\;r\,f\,[]$ & $\dn$ & $f$\\ |
821 $\textit{flex}\;r\,f\,[]$ & $\dn$ & $f$\\ |
819 $\textit{flex}\;r\,f\,(c\!::\!s)$ & $\dn$ & |
822 $\textit{flex}\;r\,f\,(c\!::\!s)$ & $\dn$ & |
822 \end{center} |
825 \end{center} |
823 |
826 |
824 \noindent |
827 \noindent |
825 The point of this function is that when |
828 The point of this function is that when |
826 reaching the end of the string, we just need to apply the stacked |
829 reaching the end of the string, we just need to apply the stacked |
827 injection functions to the value generated by $\mkeps$. |
830 injection functions to the value generated by @{text mkeps}. |
828 Using this function we can recast the success case in @{text lexer} |
831 Using this function we can recast the success case in @{text lexer} |
829 as follows: |
832 as follows: |
830 |
833 |
831 \begin{proposition}\label{flex} |
834 \begin{proposition}\label{flex} |
832 If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\, |
835 If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\, |
837 Note we did not redefine \textit{lexer}, we just established that the |
840 Note we did not redefine \textit{lexer}, we just established that the |
838 value generated by \textit{lexer} can also be obtained by a different |
841 value generated by \textit{lexer} can also be obtained by a different |
839 method. While this different method is not efficient (we essentially |
842 method. While this different method is not efficient (we essentially |
840 need to traverse the string $s$ twice, once for building the |
843 need to traverse the string $s$ twice, once for building the |
841 derivative $r\backslash s$ and another time for stacking up injection |
844 derivative $r\backslash s$ and another time for stacking up injection |
842 functions using \textit{flex}), it helped us with proving |
845 functions using \textit{flex}), it helps us with proving |
843 that incrementally building up values. |
846 that incrementally building up values generates the same result. |
844 |
847 |
845 This brings us to our main lemma in this section: if we build a |
848 This brings us to our main lemma in this section: if we calculate a |
846 derivative, say $r\backslash s$ and have a value, say $v$, inhabited |
849 derivative, say $r\backslash s$ and have a value, say $v$, inhabited |
847 by this derivative, then we can produce the result $\lexer$ generates |
850 by this derivative, then we can produce the result $\lexer$ generates |
848 by applying this value to the stacked-up injection functions |
851 by applying this value to the stacked-up injection functions |
849 $\textit{flex}$ assembles. The lemma establishes that this is the same |
852 that $\textit{flex}$ assembles. The lemma establishes that this is the same |
850 value as if we build the annotated derivative $r^\uparrow\backslash s$ |
853 value as if we build the annotated derivative $r^\uparrow\backslash s$ |
851 and then retrieve the corresponding bitcoded version, followed by a |
854 and then retrieve the corresponding bitcoded version, followed by a |
852 decoding step. |
855 decoding step. |
853 |
856 |
854 \begin{lemma}[Main Lemma]\label{mainlemma}\it |
857 \begin{lemma}[Main Lemma]\label{mainlemma}\it |
891 (s\,@\,[c]))\,v)\,r$ as required. The last rewrite step is possible |
894 (s\,@\,[c]))\,v)\,r$ as required. The last rewrite step is possible |
892 because we generalised over $v$ in our induction. |
895 because we generalised over $v$ in our induction. |
893 \end{proof} |
896 \end{proof} |
894 |
897 |
895 \noindent |
898 \noindent |
896 With this lemma in place, we can prove the correctness of \textit{blexer} such |
899 With this lemma in place, we can prove the correctness of \textit{blexer}---it indeed |
897 that it produces the same result as \textit{lexer}. |
900 produces the same result as \textit{lexer}. |
898 |
901 |
899 |
902 |
900 \begin{theorem} |
903 \begin{theorem} |
901 $\textit{lexer}\,r\,s = \textit{blexer}\,r\,s$ |
904 $\textit{lexer}\,r\,s = \textit{blexer}\,r\,s$ |
902 \end{theorem} |
905 \end{theorem} |
911 \nullable(r\backslash s) |
914 \nullable(r\backslash s) |
912 \] |
915 \] |
913 |
916 |
914 \noindent |
917 \noindent |
915 For the \textit{if}-branch suppose $r_d \dn r^\uparrow\backslash s$ and |
918 For the \textit{if}-branch suppose $r_d \dn r^\uparrow\backslash s$ and |
916 $d \dn r\backslash s$. We have (*) $\nullable\,d$. We can then show |
919 $d \dn r\backslash s$. We have (*) @{text "nullable d"}. We can then show |
917 by Lemma~\ref{bnullable}\textit{(3)} that |
920 by Lemma~\ref{bnullable}\textit{(3)} that |
918 % |
921 % |
919 \[ |
922 \[ |
920 \textit{decode}(\textit{bmkeps}\,r_d)\,r = |
923 \textit{decode}(\textit{bmkeps}\,r_d)\,r = |
921 \textit{decode}(\textit{retrieve}\,a\,(\textit{mkeps}\,d))\,r |
924 \textit{decode}(\textit{retrieve}\,a\,(\textit{mkeps}\,d))\,r |
929 \textit{if}-branches return the same value. In the |
932 \textit{if}-branches return the same value. In the |
930 \textit{else}-branches both \textit{lexer} and \textit{blexer} return |
933 \textit{else}-branches both \textit{lexer} and \textit{blexer} return |
931 \textit{None}. Therefore we can conclude the proof. |
934 \textit{None}. Therefore we can conclude the proof. |
932 \end{proof} |
935 \end{proof} |
933 |
936 |
934 \noindent |
937 \noindent This establishes that the bitcoded algorithm by Sulzmann and |
935 This establishes that the bitcoded algorithm by Sulzmann |
938 Lu \emph{without} simplification produces correct results. This was |
936 and Lu \emph{without} simplification produces correct results. This was |
939 only conjectured by Sulzmann and Lu in their paper |
937 only conjectured in their paper \cite{Sulzmann2014}. The next step |
940 \cite{Sulzmann2014}. The next step is to add simplifications. |
938 is to add simplifications. |
|
939 |
941 |
940 *} |
942 *} |
941 |
943 |
942 |
944 |
943 section {* Simplification *} |
945 section {* Simplification *} |
944 |
946 |
945 text {* |
947 text {* |
946 |
948 |
947 Derivatives as calculated by Brzozowski’s method are usually more |
949 Derivatives as calculated by Brzozowski’s method are usually more |
948 complex regular expressions than the initial one; the result is |
950 complex regular expressions than the initial one; the result is |
949 that the derivative-based matching and lexing algorithms are |
951 that derivative-based matching and lexing algorithms are |
950 often abysmally slow. However, as Sulzmann and Lu wrote, various |
952 often abysmally slow if the ``growth problem'' is not addressed. As Sulzmann and Lu wrote, various |
951 optimisations are possible, such as the simplifications |
953 optimisations are possible, such as the simplifications |
952 $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r \Rightarrow r$, |
954 $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r \Rightarrow r$, |
953 $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow r$. While these |
955 $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow r$. While these |
954 simplifications can speed up the algorithms considerably in many |
956 simplifications can considerably speed up the two algorithms in many |
955 cases, they do not solve fundamentally the ``growth problem'' with |
957 cases, they do not solve fundamentally the growth problem with |
956 derivatives. To see this let us return to the example |
958 derivatives. To see this let us return to the example from the |
959 Introduction that shows the derivatives for \mbox{@{text "(a + aa)\<^sup>*"}}. |
|
960 If we delete in the 3rd step all $\ZERO{}s$ and $\ONE$s according to |
|
961 the simplification rules shown above we obtain |
|
962 % |
|
963 \def\xll{\xrightarrow{\_\backslash{} [a, a, a]}}%% |
|
964 % |
|
965 \begin{equation}\label{derivex} |
|
966 (a + aa)^* \quad\xll\quad |
|
967 \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r} \;+\; |
|
968 ((a + aa)^* + \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r}) |
|
969 \end{equation} |
|
970 |
|
971 \noindent This is a simpler derivative, but unfortunately we |
|
972 cannot make further simplifications. This is a problem because |
|
973 the outermost alternatives contains two copies of the same |
|
974 regular expression (underlined with $r$). The copies will |
|
975 spawn new copies in later steps and they in turn further copies. This |
|
976 destroys an hope of taming the size of the derivatives. But the |
|
977 second copy of $r$ in \eqref{derivex} will never contribute to a |
|
978 value, because POSIX lexing will always prefer matching a string |
|
979 with the first copy. So in principle it could be removed. |
|
980 The dilemma with the simple-minded |
|
981 simplification rules above is that the rule $r + r \Rightarrow r$ |
|
982 will never be applicable because as can be seen in this example the |
|
983 regular expressions are separated by another sub-regular expression. |
|
984 |
|
985 But here is where Sulzmann and Lu's representation of generalised |
|
986 alternatives in the bitcoded algorithm shines: in @{term |
|
987 "ALTs bs rs"} we can define a more aggressive simplification by |
|
988 recursively simplifying all regular expressions in @{text rs} and |
|
989 then analyse the resulting list and remove any duplicates. |
|
990 Another advantage is that the bitsequences in bitcoded regular |
|
991 expressions can be easily modified such that simplification does not |
|
992 interfere with the value constructions. For example we can ``flatten'', or |
|
993 de-nest, @{text ALTs} as follows |
|
994 % |
|
995 \[ |
|
996 @{term "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) # rs\<^sub>1)"} |
|
997 \quad\xrightarrow{bsimp}\quad |
|
998 @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) # rs\<^sub>1)"} |
|
999 \] |
|
1000 |
|
1001 \noindent |
|
1002 where we just need to fuse the bitsequence that has accumulated in @{text "bs\<^sub>2"} |
|
1003 to the alternatives in @{text "rs\<^sub>2"}. As we shall show below this will |
|
1004 ensure that the correct value corresponding to the original (unsimplified) |
|
1005 regular expression can still be extracted. %In this way the value construction |
|
1006 %is not affected by simplification. |
|
1007 |
|
1008 However there is one problem with the definition for the more |
|
1009 aggressive simlification rules by Sulzmann and Lu. Recasting |
|
1010 their definition with our syntax they define the step of removing |
|
1011 duplicates as |
|
1012 % |
|
1013 \[ @{text "bsimp (ALTs bs rs)"} \dn @{text "ALTs |
|
1014 bs (nup (map bsimp rs))"} |
|
1015 \] |
|
1016 |
|
1017 \noindent where they first recursively simplify the regular |
|
1018 expressions in @{text rs} (using @{text map}) and then use |
|
1019 Haskell's @{text nub}-function to remove potential |
|
1020 duplicates. While this makes sense when considering the example |
|
1021 shown in \eqref{derivex}, @{text nub} is the inappropriate |
|
1022 function in the case of bitcoded regular expressions. The reason |
|
1023 is that in general the n elements in @{text rs} will have a |
|
1024 different bitsequence annotated to it and in this way @{text nub} |
|
1025 will never find a duplicate to be removed. The correct way to |
|
1026 handle this situation is to first \emph{erase} the regular |
|
1027 expressions when comparing potential duplicates. This is inspired |
|
1028 by Scala's list functions of the form \mbox{@{text "distinctBy rs f |
|
1029 acc"}} where a function is applied first before two elements |
|
1030 are compared. We define this function in Isabelle/HOL as |
|
1031 |
|
1032 \begin{center} |
|
1033 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
|
1034 @{thm (lhs) distinctBy.simps(1)} & $\dn$ & @{thm (rhs) distinctBy.simps(1)}\\ |
|
1035 @{thm (lhs) distinctBy.simps(2)} & $\dn$ & @{thm (rhs) distinctBy.simps(2)} |
|
1036 \end{tabular} |
|
1037 \end{center} |
|
1038 |
|
1039 \noindent where we scan the list from left to right (because we |
|
1040 have to remove later copies). In this function, @{text f} is a |
|
1041 function and @{text acc} is an accumulator for regular |
|
1042 expressions---essentially a set of elements we have already seen |
|
1043 while scanning the list. Therefore we delete an element, say @{text x}, |
|
1044 from the list provided @{text "f x"} is already in the accumulator; |
|
1045 otherwise we keep @{text x} and scan the rest of the list but now |
|
1046 add @{text "f x"} as another element to @{text acc}. We will use |
|
1047 @{term distinctBy} where @{text f} is our erase functions, @{term "erase (DUMMY)"}, |
|
1048 that deletes bitsequences from bitcoded regular expressions. |
|
1049 This is clearly a computationally more expensive operation, than @{text nub}, |
|
1050 but is needed in order to make the removal of unnecessary copies |
|
1051 to work. |
|
1052 |
|
1053 Our simplification function depends on three helper functions, one is called |
|
1054 @{text flts} and defined as follows: |
|
1055 |
|
1056 \begin{center} |
|
1057 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
|
1058 @{thm (lhs) flts.simps(1)} & $\dn$ & @{thm (rhs) flts.simps(1)}\\ |
|
1059 @{thm (lhs) flts.simps(2)} & $\dn$ & @{thm (rhs) flts.simps(2)}\\ |
|
1060 @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\ |
|
1061 \end{tabular} |
|
1062 \end{center} |
|
1063 |
|
1064 \noindent |
|
1065 The second clause removes all instances of @{text ZERO} in alternatives and |
|
1066 the second ``spills'' out nested alternatives (but retaining the |
|
1067 bitsequence @{text "bs'"} accumulated in the inner alternative). There are |
|
1068 some corner cases to be considered when the resulting list inside an alternative is |
|
1069 empty or a singleton list. We take care of those cases in the |
|
1070 @{text "bsimpALTs"} function; similarly we define a helper function that simplifies |
|
1071 sequences according to the usual rules about @{text ZERO}s and @{text ONE}s: |
|
1072 |
|
1073 \begin{center} |
|
1074 \begin{tabular}{c@ {\hspace{5mm}}c} |
|
1075 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
|
1076 @{text "bsimpALTs bs []"} & $\dn$ & @{text "ZERO"}\\ |
|
1077 @{text "bsimpALTs bs [r]"} & $\dn$ & @{text "fuse bs r"}\\ |
|
1078 @{text "bsimpALTs bs rs"} & $\dn$ & @{text "ALTs bs rs"}\\ |
|
1079 \mbox{}\\ |
|
1080 \end{tabular} |
|
1081 & |
|
1082 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
|
1083 @{text "bsimpSEQ bs _ ZERO"} & $\dn$ & @{text "ZERO"}\\ |
|
1084 @{text "bsimpSEQ bs ZERO _"} & $\dn$ & @{text "ZERO"}\\ |
|
1085 @{text "bsimpSEQ bs\<^sub>1 (ONE bs\<^sub>2) r\<^sub>2"} |
|
1086 & $\dn$ & @{text "fuse (bs\<^sub>1 @ bs\<^sub>2) r\<^sub>2"}\\ |
|
1087 @{text "bsimpSEQ bs r\<^sub>1 r\<^sub>2"} & $\dn$ & @{text "SEQ bs r\<^sub>1 r\<^sub>2"} |
|
1088 \end{tabular} |
|
1089 \end{tabular} |
|
1090 \end{center} |
|
1091 |
|
1092 \noindent |
|
1093 With this in place we can define our simlification function as |
|
1094 |
|
1095 \begin{center} |
|
1096 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
|
1097 @{thm (lhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & |
|
1098 @{thm (rhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]}\\ |
|
1099 @{thm (lhs) bsimp.simps(2)[of "bs" _]} & $\dn$ & @{thm (rhs) bsimp.simps(2)[of "bs" _]}\\ |
|
1100 @{text "bsimp r"} & $\dn$ & @{text r} |
|
1101 \end{tabular} |
|
1102 \end{center} |
|
1103 |
|
1104 \noindent |
|
1105 As far as we can see, our recursive function @{term bsimp} simplifies regular |
|
1106 expressions as intended by Sulzmann and Lu. There is no point to apply the |
|
1107 @{text bsimp} |
|
1108 function repeatedly (like the simplification in their paper which is applied |
|
1109 until a fixpoint is reached), because we can show that it is idempotent, that is |
|
1110 |
|
1111 \begin{proposition} |
|
1112 ??? |
|
1113 \end{proposition} |
|
957 |
1114 |
958 |
1115 |
959 \begin{lemma} |
1116 \begin{lemma} |
960 @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]} |
1117 @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]} |
961 \end{lemma} |
1118 \end{lemma} |
1018 \end{figure} |
1175 \end{figure} |
1019 *} |
1176 *} |
1020 |
1177 |
1021 section {* Bound - NO *} |
1178 section {* Bound - NO *} |
1022 |
1179 |
1023 section {* Bounded Regex / Not *} |
|
1024 |
1180 |
1025 section {* Conclusion *} |
1181 section {* Conclusion *} |
1026 |
1182 |
1027 text {* |
1183 text {* |
1028 |
1184 |
1029 \cite{AusafDyckhoffUrban2016} |
1185 |
1030 |
1186 |
1031 %%\bibliographystyle{plain} |
1187 %%\bibliographystyle{plain} |
1032 \bibliography{root} |
1188 \bibliography{root} |
1033 *} |
1189 *} |
1034 |
1190 |