thys2/Paper/Paper.thy
changeset 424 2416fdec6396
parent 423 b7199d6c672d
child 425 14c558ae0b09
equal deleted inserted replaced
423:b7199d6c672d 424:2416fdec6396
    62 
    62 
    63   code ("code _" [79] 74) and
    63   code ("code _" [79] 74) and
    64   intern ("_\<^latex>\<open>\\mbox{$^\\uparrow$}\<close>" [900] 80) and
    64   intern ("_\<^latex>\<open>\\mbox{$^\\uparrow$}\<close>" [900] 80) and
    65   erase ("_\<^latex>\<open>\\mbox{$^\\downarrow$}\<close>" [1000] 74) and
    65   erase ("_\<^latex>\<open>\\mbox{$^\\downarrow$}\<close>" [1000] 74) and
    66   bnullable ("bnullable _" [1000] 80) and
    66   bnullable ("bnullable _" [1000] 80) and
       
    67   bsimp_AALTs ("bsimpALT _ _" [10,1000] 80) and
       
    68   bsimp_ASEQ ("bsimpSEQ _ _ _" [10,1000,1000] 80) and
    67   bmkeps ("bmkeps _" [1000] 80) and
    69   bmkeps ("bmkeps _" [1000] 80) and
    68 
    70 
    69   srewrite ("_\<^latex>\<open>\\mbox{$\\,\\stackrel{s}{\\leadsto}$}\<close> _" [71, 71] 80) and
    71   srewrite ("_\<^latex>\<open>\\mbox{$\\,\\stackrel{s}{\\leadsto}$}\<close> _" [71, 71] 80) and
    70   rrewrites ("_ \<^latex>\<open>\\mbox{$\\,\\leadsto^*$}\<close> _" [71, 71] 80) and
    72   rrewrites ("_ \<^latex>\<open>\\mbox{$\\,\\leadsto^*$}\<close> _" [71, 71] 80) and
    71   blexer_simp ("blexer\<^sup>+" 1000) 
    73   blexer_simp ("blexer\<^sup>+" 1000) 
   145 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
   147 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
   146 r$. While such simple-minded simplifications have been proved in our
   148 r$. While such simple-minded simplifications have been proved in our
   147 earlier work to preserve the correctness of Sulzmann and Lu's
   149 earlier work to preserve the correctness of Sulzmann and Lu's
   148 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
   150 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
   149 \emph{not} help with limiting the growth of the derivatives shown
   151 \emph{not} help with limiting the growth of the derivatives shown
   150 above: the growth is slowed, but the derivatives can still grow quickly
   152 above: the growth is slowed, but the derivatives can still grow rather
   151 beyond any finite bound.
   153 quickly beyond any finite bound.
   152 
       
   153 
   154 
   154 
   155 
   155 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
   156 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
   156 \cite{Sulzmann2014} where they introduce bitcoded
   157 \cite{Sulzmann2014} where they introduce bitcoded
   157 regular expressions. In this version, POSIX values are
   158 regular expressions. In this version, POSIX values are
   212   consisting of just a single character $c$ is written $[c]$.   
   213   consisting of just a single character $c$ is written $[c]$.   
   213   Our regular expressions are defined as usual as the elements of the following inductive
   214   Our regular expressions are defined as usual as the elements of the following inductive
   214   datatype:
   215   datatype:
   215 
   216 
   216   \begin{center}
   217   \begin{center}
   217   @{text "r :="}
   218   @{text "r ::="} \;
   218   @{const "ZERO"} $\mid$
   219   @{const "ZERO"} $\mid$
   219   @{const "ONE"} $\mid$
   220   @{const "ONE"} $\mid$
   220   @{term "CH c"} $\mid$
   221   @{term "CH c"} $\mid$
   221   @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
   222   @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
   222   @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
   223   @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
   300 
   301 
   301 The novel idea of Sulzmann and Lu is to extend this algorithm for 
   302 The novel idea of Sulzmann and Lu is to extend this algorithm for 
   302 lexing, where it is important to find out which part of the string
   303 lexing, where it is important to find out which part of the string
   303 is matched by which part of the regular expression.
   304 is matched by which part of the regular expression.
   304 For this Sulzmann and Lu presented two lexing algorithms in their paper
   305 For this Sulzmann and Lu presented two lexing algorithms in their paper
   305   \cite{Sulzmann2014}. This first algorithm consists of two phases: first a
   306   \cite{Sulzmann2014}. The first algorithm consists of two phases: first a
   306   matching phase (which is Brzozowski's algorithm) and then a value
   307   matching phase (which is Brzozowski's algorithm) and then a value
   307   construction phase. The values encode \emph{how} a regular expression
   308   construction phase. The values encode \emph{how} a regular expression
   308   matches a string. \emph{Values} are defined as the inductive datatype
   309   matches a string. \emph{Values} are defined as the inductive datatype
   309 
   310 
   310   \begin{center}
   311   \begin{center}
   421   \end{tabular}
   422   \end{tabular}
   422   \end{tabular}
   423   \end{tabular}
   423   \end{center}
   424   \end{center}
   424 
   425 
   425   \noindent
   426   \noindent
   426   The function @{text mkeps} is called when the last derivative is nullable, that is
   427   The function @{text mkeps} is run when the last derivative is nullable, that is
   427   the string to be matched is in the language of the regular expression. It generates
   428   the string to be matched is in the language of the regular expression. It generates
   428   a value for how the last derivative can match the empty string. The injection function
   429   a value for how the last derivative can match the empty string. The injection function
   429   then calculates the corresponding value for each intermediate derivative until
   430   then calculates the corresponding value for each intermediate derivative until
   430   a value for the original regular expression is generated.
   431   a value for the original regular expression is generated.
   431   Graphically the algorithm by
   432   Graphically the algorithm by
   432   Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
   433   Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
   433   where the path from the left to the right involving @{term derivatives}/@{const
   434   where the path from the left to the right involving @{term derivatives}/@{const
   434   nullable} is the first phase of the algorithm (calculating successive
   435   nullable} is the first phase of the algorithm (calculating successive
   435   \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
   436   \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
   436   left, the second phase. This picture shows the steps required when a
   437   left, the second phase. The picture above shows the steps required when a
   437   regular expression, say @{text "r\<^sub>1"}, matches the string @{term
   438   regular expression, say @{text "r\<^sub>1"}, matches the string @{term
   438   "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as:
   439   "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as:
   439 
   440 
   440   \begin{figure}[t]
   441   \begin{figure}[t]
   441 \begin{center}
   442 \begin{center}
   486                      & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}                          
   487                      & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}                          
   487   \end{tabular}
   488   \end{tabular}
   488   \end{center}
   489   \end{center}
   489 
   490 
   490 
   491 
   491 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that the algorithm by Sulzmann and Lu
   492 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
   492 is correct. The cenral property we established relates the derivative operation to the injection function.
   493 this algorithm is correct, that is it generates POSIX values. The
       
   494 cenral property we established relates the derivative operation to the
       
   495 injection function.
   493 
   496 
   494   \begin{proposition}\label{Posix2}
   497   \begin{proposition}\label{Posix2}
   495 	\textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$. 
   498 	\textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$. 
   496 \end{proposition}
   499 \end{proposition}
   497 
   500 
   508 
   511 
   509   \noindent
   512   \noindent
   510   In fact we have shown that in the success case the generated POSIX value $v$ is
   513   In fact we have shown that in the success case the generated POSIX value $v$ is
   511   unique and in the failure case that there is no POSIX value $v$ that satisfies
   514   unique and in the failure case that there is no POSIX value $v$ that satisfies
   512   $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly
   515   $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly
   513   slow in examples where the derivatives grow arbitrarily (see example from the
   516   slow in cases where the derivatives grow arbitrarily (see example from the
   514   Introduction). However it can be used as a conveninet reference point for the correctness
   517   Introduction). However it can be used as a convenient reference point for the correctness
   515   proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
   518   proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
   516   
   519   
   517 *}
   520 *}
   518 
   521 
   519 section {* Bitcoded Regular Expressions and Derivatives *}
   522 section {* Bitcoded Regular Expressions and Derivatives *}
   538 
   541 
   539   \noindent where @{text bs} stands for bitsequences; @{text r},
   542   \noindent where @{text bs} stands for bitsequences; @{text r},
   540   @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
   543   @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
   541   expressions; and @{text rs} for lists of bitcoded regular
   544   expressions; and @{text rs} for lists of bitcoded regular
   542   expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
   545   expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
   543   is just an abbreviation for @{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}. 
   546   is just an abbreviation for \mbox{@{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}}. 
   544   For bitsequences we just use lists made up of the
   547   For bitsequences we just use lists made up of the
   545   constants @{text Z} and @{text S}.  The idea with bitcoded regular
   548   constants @{text Z} and @{text S}.  The idea with bitcoded regular
   546   expressions is to incrementally generate the value information (for
   549   expressions is to incrementally generate the value information (for
   547   example @{text Left} and @{text Right}) as bitsequences. For this 
   550   example @{text Left} and @{text Right}) as bitsequences. For this 
   548   Sulzmann and Lu define a coding
   551   Sulzmann and Lu define a coding
   665          $\textit{STAR}\;[]\,r^\uparrow$\\
   668          $\textit{STAR}\;[]\,r^\uparrow$\\
   666   \end{tabular}    
   669   \end{tabular}    
   667   \end{center}    
   670   \end{center}    
   668 
   671 
   669   \noindent
   672   \noindent
   670   There is also an \emph{erase}-function, written $a^\downarrow$, which
   673   There is also an \emph{erase}-function, written $r^\downarrow$, which
   671   transforms a bitcoded regular expression into a (standard) regular
   674   transforms a bitcoded regular expression into a (standard) regular
   672   expression by just erasing the annotated bitsequences. We omit the
   675   expression by just erasing the annotated bitsequences. We omit the
   673   straightforward definition. For defining the algorithm, we also need
   676   straightforward definition. For defining the algorithm, we also need
   674   the functions \textit{bnullable} and \textit{bmkeps}, which are the
   677   the functions \textit{bnullable} and \textit{bmkeps}, which are the
   675   ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
   678   ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
   703   \end{tabular}
   706   \end{tabular}
   704   \end{center}    
   707   \end{center}    
   705  
   708  
   706 
   709 
   707   \noindent
   710   \noindent
   708   The key function in the bitcoded algorithm is the derivative of an
   711   The key function in the bitcoded algorithm is the derivative of a
   709   bitcoded regular expression. This derivative calculates the
   712   bitcoded regular expression. This derivative calculates the
   710   derivative but at the same time also the incremental part of bitsequences
   713   derivative but at the same time also the incremental part of bitsequences
   711   that contribute to constructing a POSIX value.	
   714   that contribute to constructing a POSIX value.	
   712 
   715 
   713   \begin{center}
   716   \begin{center}
   750 builds the bitcoded derivative according to $s$. If the derivative is
   753 builds the bitcoded derivative according to $s$. If the derivative is
   751 (b)nullable the string is in the language of $r$ and it extracts the bitsequence using the
   754 (b)nullable the string is in the language of $r$ and it extracts the bitsequence using the
   752 $\textit{bmkeps}$ function. Finally it decodes the bitsequence into a value.  If
   755 $\textit{bmkeps}$ function. Finally it decodes the bitsequence into a value.  If
   753 the derivative is \emph{not} nullable, then $\textit{None}$ is
   756 the derivative is \emph{not} nullable, then $\textit{None}$ is
   754 returned. We can show that this way of calculating a value
   757 returned. We can show that this way of calculating a value
   755 generates the same result as with \textit{lexer}.
   758 generates the same result as \textit{lexer}.
   756 
   759 
   757 Before we can proceed we need to define a helper function, called
   760 Before we can proceed we need to define a helper function, called
   758 \textit{retrieve}, which Sulzmann and Lu introduced for the correctness proof.
   761 \textit{retrieve}, which Sulzmann and Lu introduced for the correctness proof.
   759 
   762 
   760 \begin{center}
   763 \begin{center}
   771   \end{tabular}
   774   \end{tabular}
   772   \end{center}
   775   \end{center}
   773 
   776 
   774 \noindent
   777 \noindent
   775 The idea behind this function is to retrieve a possibly partial
   778 The idea behind this function is to retrieve a possibly partial
   776 bitcode from a bitcoded regular expression, where the retrieval is
   779 bitsequence from a bitcoded regular expression, where the retrieval is
   777 guided by a value.  For example if the value is $\Left$ then we
   780 guided by a value.  For example if the value is $\Left$ then we
   778 descend into the left-hand side of an alternative in order to
   781 descend into the left-hand side of an alternative in order to
   779 assemble the bitcode. Similarly for
   782 assemble the bitcode. Similarly for
   780 $\Right$. The property we can show is that for a given $v$ and $r$
   783 $\Right$. The property we can show is that for a given $v$ and $r$
   781 with $\vdash v : r$, the retrieved bitsequence from the internalised
   784 with $\vdash v : r$, the retrieved bitsequence from the internalised
   804   All properties are by induction on annotated regular expressions. There are no
   807   All properties are by induction on annotated regular expressions. There are no
   805   interesting cases.
   808   interesting cases.
   806 \end{proof}
   809 \end{proof}
   807 
   810 
   808 \noindent
   811 \noindent
   809 The only problem left for the correctness proof is that the bitcoded algorithm
   812 The only difficulty left for the correctness proof is that the bitcoded algorithm
   810 has only a ``forward phase'' where POSIX values are generated incrementally.
   813 has only a ``forward phase'' where POSIX values are generated incrementally.
   811 We can achive the same effect with @{text lexer} by stacking up injection
   814 We can achieve the same effect with @{text lexer} by stacking up injection
   812 functions in the during forward phase. An auxiliary function, called $\textit{flex}$,
   815 functions during the forward phase. An auxiliary function, called $\textit{flex}$,
   813 allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single
   816 allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single
   814 phase.
   817 phase and stacked up injection functions.
   815 
   818 
   816 \begin{center}
   819 \begin{center}
   817 \begin{tabular}{lcl}
   820 \begin{tabular}{lcl}
   818   $\textit{flex}\;r\,f\,[]$ & $\dn$ & $f$\\
   821   $\textit{flex}\;r\,f\,[]$ & $\dn$ & $f$\\
   819   $\textit{flex}\;r\,f\,(c\!::\!s)$ & $\dn$ &
   822   $\textit{flex}\;r\,f\,(c\!::\!s)$ & $\dn$ &
   822 \end{center}    
   825 \end{center}    
   823 
   826 
   824 \noindent
   827 \noindent
   825 The point of this function is that when
   828 The point of this function is that when
   826 reaching the end of the string, we just need to apply the stacked
   829 reaching the end of the string, we just need to apply the stacked
   827 injection functions to the value generated by $\mkeps$.
   830 injection functions to the value generated by @{text mkeps}.
   828 Using this function we can recast the success case in @{text lexer} 
   831 Using this function we can recast the success case in @{text lexer} 
   829 as follows:
   832 as follows:
   830 
   833 
   831 \begin{proposition}\label{flex}
   834 \begin{proposition}\label{flex}
   832 If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\,
   835 If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\,
   837 Note we did not redefine \textit{lexer}, we just established that the
   840 Note we did not redefine \textit{lexer}, we just established that the
   838 value generated by \textit{lexer} can also be obtained by a different
   841 value generated by \textit{lexer} can also be obtained by a different
   839 method. While this different method is not efficient (we essentially
   842 method. While this different method is not efficient (we essentially
   840 need to traverse the string $s$ twice, once for building the
   843 need to traverse the string $s$ twice, once for building the
   841 derivative $r\backslash s$ and another time for stacking up injection
   844 derivative $r\backslash s$ and another time for stacking up injection
   842 functions using \textit{flex}), it helped us with proving
   845 functions using \textit{flex}), it helps us with proving
   843 that incrementally building up values.
   846 that incrementally building up values generates the same result.
   844 
   847 
   845 This brings us to our main lemma in this section: if we build a
   848 This brings us to our main lemma in this section: if we calculate a
   846 derivative, say $r\backslash s$ and have a value, say $v$, inhabited
   849 derivative, say $r\backslash s$ and have a value, say $v$, inhabited
   847 by this derivative, then we can produce the result $\lexer$ generates
   850 by this derivative, then we can produce the result $\lexer$ generates
   848 by applying this value to the stacked-up injection functions
   851 by applying this value to the stacked-up injection functions
   849 $\textit{flex}$ assembles. The lemma establishes that this is the same
   852 that $\textit{flex}$ assembles. The lemma establishes that this is the same
   850 value as if we build the annotated derivative $r^\uparrow\backslash s$
   853 value as if we build the annotated derivative $r^\uparrow\backslash s$
   851 and then retrieve the corresponding bitcoded version, followed by a
   854 and then retrieve the corresponding bitcoded version, followed by a
   852 decoding step.
   855 decoding step.
   853 
   856 
   854 \begin{lemma}[Main Lemma]\label{mainlemma}\it
   857 \begin{lemma}[Main Lemma]\label{mainlemma}\it
   891   (s\,@\,[c]))\,v)\,r$ as required. The last rewrite step is possible
   894   (s\,@\,[c]))\,v)\,r$ as required. The last rewrite step is possible
   892   because we generalised over $v$ in our induction.
   895   because we generalised over $v$ in our induction.
   893 \end{proof}  
   896 \end{proof}  
   894 
   897 
   895 \noindent
   898 \noindent
   896 With this lemma in place, we can prove the correctness of \textit{blexer} such
   899 With this lemma in place, we can prove the correctness of \textit{blexer}---it indeed
   897 that it produces the same result as \textit{lexer}.
   900 produces the same result as \textit{lexer}.
   898 
   901 
   899 
   902 
   900 \begin{theorem}
   903 \begin{theorem}
   901 $\textit{lexer}\,r\,s = \textit{blexer}\,r\,s$
   904 $\textit{lexer}\,r\,s = \textit{blexer}\,r\,s$
   902 \end{theorem}  
   905 \end{theorem}  
   911     \nullable(r\backslash s)
   914     \nullable(r\backslash s)
   912   \]
   915   \]
   913 
   916 
   914   \noindent
   917   \noindent
   915   For the \textit{if}-branch suppose $r_d \dn r^\uparrow\backslash s$ and
   918   For the \textit{if}-branch suppose $r_d \dn r^\uparrow\backslash s$ and
   916   $d \dn r\backslash s$. We have (*) $\nullable\,d$. We can then show
   919   $d \dn r\backslash s$. We have (*) @{text "nullable d"}. We can then show
   917   by Lemma~\ref{bnullable}\textit{(3)} that
   920   by Lemma~\ref{bnullable}\textit{(3)} that
   918   %
   921   %
   919   \[
   922   \[
   920     \textit{decode}(\textit{bmkeps}\,r_d)\,r =
   923     \textit{decode}(\textit{bmkeps}\,r_d)\,r =
   921     \textit{decode}(\textit{retrieve}\,a\,(\textit{mkeps}\,d))\,r
   924     \textit{decode}(\textit{retrieve}\,a\,(\textit{mkeps}\,d))\,r
   929   \textit{if}-branches return the same value. In the
   932   \textit{if}-branches return the same value. In the
   930   \textit{else}-branches both \textit{lexer} and \textit{blexer} return
   933   \textit{else}-branches both \textit{lexer} and \textit{blexer} return
   931   \textit{None}. Therefore we can conclude the proof.
   934   \textit{None}. Therefore we can conclude the proof.
   932 \end{proof}  
   935 \end{proof}  
   933 
   936 
   934 \noindent
   937 \noindent This establishes that the bitcoded algorithm by Sulzmann and
   935 This establishes that the bitcoded algorithm by Sulzmann
   938 Lu \emph{without} simplification produces correct results. This was
   936 and Lu \emph{without} simplification produces correct results. This was
   939 only conjectured by Sulzmann and Lu in their paper
   937 only conjectured in their paper \cite{Sulzmann2014}. The next step
   940 \cite{Sulzmann2014}. The next step is to add simplifications.
   938 is to add simplifications.
       
   939 
   941 
   940 *}
   942 *}
   941 
   943 
   942 
   944 
   943 section {* Simplification *}
   945 section {* Simplification *}
   944 
   946 
   945 text {*
   947 text {*
   946 
   948 
   947      Derivatives as calculated by Brzozowski’s method are usually more
   949      Derivatives as calculated by Brzozowski’s method are usually more
   948      complex regular expressions than the initial one; the result is
   950      complex regular expressions than the initial one; the result is
   949      that the derivative-based matching and lexing algorithms are
   951      that derivative-based matching and lexing algorithms are
   950      often abysmally slow. However, as Sulzmann and Lu wrote, various
   952      often abysmally slow if the ``growth problem'' is not addressed. As Sulzmann and Lu wrote, various
   951      optimisations are possible, such as the simplifications
   953      optimisations are possible, such as the simplifications
   952      $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r \Rightarrow r$,
   954      $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r \Rightarrow r$,
   953      $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow r$. While these
   955      $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow r$. While these
   954      simplifications can speed up the algorithms considerably in many
   956      simplifications can considerably speed up the two algorithms  in many
   955      cases, they do not solve fundamentally the ``growth problem'' with
   957      cases, they do not solve fundamentally the growth problem with
   956      derivatives. To see this let us return to the example 
   958      derivatives. To see this let us return to the example from the
       
   959      Introduction that shows the derivatives for \mbox{@{text "(a + aa)\<^sup>*"}}.
       
   960      If we delete in the 3rd step all $\ZERO{}s$ and $\ONE$s according to
       
   961      the simplification rules shown above we obtain
       
   962      %
       
   963      \def\xll{\xrightarrow{\_\backslash{} [a, a, a]}}%%
       
   964      %
       
   965      \begin{equation}\label{derivex}
       
   966      (a + aa)^* \quad\xll\quad
       
   967       \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r} \;+\;
       
   968      ((a + aa)^* + \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r})
       
   969      \end{equation}
       
   970 
       
   971      \noindent This is a simpler derivative, but unfortunately we
       
   972      cannot make further simplifications. This is a problem because
       
   973      the outermost alternatives contains two copies of the same
       
   974      regular expression (underlined with $r$). The copies will
       
   975      spawn new copies in later steps and they in turn further copies. This
       
   976      destroys an hope of taming the size of the derivatives.  But the
       
   977      second copy of $r$ in \eqref{derivex} will never contribute to a
       
   978      value, because POSIX lexing will always prefer matching a string
       
   979      with the first copy. So in principle it could be removed.
       
   980      The dilemma with the simple-minded
       
   981      simplification rules above is that the rule $r + r \Rightarrow r$
       
   982      will never be applicable because as can be seen in this example the
       
   983      regular expressions are separated by another sub-regular expression.
       
   984 
       
   985      But here is where Sulzmann and Lu's representation of generalised
       
   986      alternatives in the bitcoded algorithm shines: in @{term
       
   987      "ALTs bs rs"} we can define a more aggressive simplification by
       
   988      recursively simplifying all regular expressions in @{text rs} and
       
   989      then analyse the resulting list and remove any duplicates.
       
   990      Another advantage is that the bitsequences in  bitcoded regular
       
   991      expressions can be easily modified such that simplification does not
       
   992      interfere with the value constructions. For example we can ``flatten'', or
       
   993      de-nest, @{text ALTs} as follows
       
   994      %
       
   995      \[
       
   996      @{term "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) # rs\<^sub>1)"}
       
   997      \quad\xrightarrow{bsimp}\quad
       
   998      @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) # rs\<^sub>1)"}
       
   999      \]
       
  1000 
       
  1001      \noindent
       
  1002      where we just need to fuse the bitsequence that has accumulated in @{text "bs\<^sub>2"}
       
  1003      to the alternatives in @{text "rs\<^sub>2"}. As we shall show below this will
       
  1004      ensure that the correct value corresponding to the original (unsimplified)
       
  1005      regular expression can still be extracted. %In this way the value construction
       
  1006      %is not affected by simplification. 
       
  1007 
       
  1008      However there is one problem with the definition for the more
       
  1009      aggressive simlification rules by Sulzmann and Lu. Recasting
       
  1010      their definition with our syntax they define the step of removing
       
  1011      duplicates as
       
  1012      %
       
  1013      \[ @{text "bsimp (ALTs bs rs)"} \dn @{text "ALTs
       
  1014      bs (nup (map bsimp rs))"}
       
  1015      \]
       
  1016 
       
  1017      \noindent where they first recursively simplify the regular
       
  1018      expressions in @{text rs} (using @{text map}) and then use
       
  1019      Haskell's @{text nub}-function to remove potential
       
  1020      duplicates. While this makes sense when considering the example
       
  1021      shown in \eqref{derivex}, @{text nub} is the inappropriate
       
  1022      function in the case of bitcoded regular expressions. The reason
       
  1023      is that in general the n elements in @{text rs} will have a
       
  1024      different bitsequence annotated to it and in this way @{text nub}
       
  1025      will never find a duplicate to be removed. The correct way to
       
  1026      handle this situation is to first \emph{erase} the regular
       
  1027      expressions when comparing potential duplicates. This is inspired
       
  1028      by Scala's list functions of the form \mbox{@{text "distinctBy rs f
       
  1029      acc"}} where a function is applied first before two elements
       
  1030      are compared. We define this function in Isabelle/HOL as
       
  1031 
       
  1032      \begin{center}
       
  1033      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
       
  1034      @{thm (lhs) distinctBy.simps(1)} & $\dn$ & @{thm (rhs) distinctBy.simps(1)}\\
       
  1035      @{thm (lhs) distinctBy.simps(2)} & $\dn$ & @{thm (rhs) distinctBy.simps(2)}
       
  1036      \end{tabular}
       
  1037      \end{center}
       
  1038 
       
  1039      \noindent where we scan the list from left to right (because we
       
  1040      have to remove later copies). In this function, @{text f} is a
       
  1041      function and @{text acc} is an accumulator for regular
       
  1042      expressions---essentially a set of elements we have already seen
       
  1043      while scanning the list. Therefore we delete an element, say @{text x},
       
  1044      from the list provided @{text "f x"} is already in the accumulator;
       
  1045      otherwise we keep @{text x} and scan the rest of the list but now
       
  1046      add @{text "f x"} as another element to @{text acc}. We will use
       
  1047      @{term distinctBy} where @{text f} is our erase functions, @{term "erase (DUMMY)"},
       
  1048      that deletes bitsequences from bitcoded regular expressions.
       
  1049      This is clearly a computationally more expensive operation, than @{text nub},
       
  1050      but is needed in order to make the removal of unnecessary copies
       
  1051      to work.
       
  1052 
       
  1053      Our simplification function depends on three helper functions, one is called
       
  1054      @{text flts} and defined as follows:
       
  1055 
       
  1056      \begin{center}
       
  1057      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
       
  1058      @{thm (lhs) flts.simps(1)} & $\dn$ & @{thm (rhs) flts.simps(1)}\\
       
  1059      @{thm (lhs) flts.simps(2)} & $\dn$ & @{thm (rhs) flts.simps(2)}\\
       
  1060      @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
       
  1061      \end{tabular}
       
  1062      \end{center}
       
  1063 
       
  1064      \noindent
       
  1065      The second clause removes all instances of @{text ZERO} in alternatives and
       
  1066      the second ``spills'' out nested alternatives (but retaining the
       
  1067      bitsequence @{text "bs'"} accumulated in the inner alternative). There are
       
  1068      some corner cases to be considered when the resulting list inside an alternative is
       
  1069      empty or a singleton list. We take care of those cases in the
       
  1070      @{text "bsimpALTs"} function; similarly we define a helper function that simplifies
       
  1071      sequences according to the usual rules about @{text ZERO}s and @{text ONE}s:
       
  1072 
       
  1073      \begin{center}
       
  1074      \begin{tabular}{c@ {\hspace{5mm}}c}
       
  1075      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
       
  1076      @{text "bsimpALTs bs []"}  & $\dn$ & @{text "ZERO"}\\
       
  1077      @{text "bsimpALTs bs [r]"} & $\dn$ & @{text "fuse bs r"}\\
       
  1078      @{text "bsimpALTs bs rs"}  & $\dn$ & @{text "ALTs bs rs"}\\
       
  1079      \mbox{}\\
       
  1080      \end{tabular}
       
  1081      &
       
  1082      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
       
  1083      @{text "bsimpSEQ bs _ ZERO"}  & $\dn$ & @{text "ZERO"}\\
       
  1084      @{text "bsimpSEQ bs ZERO _"} & $\dn$ & @{text "ZERO"}\\
       
  1085      @{text "bsimpSEQ bs\<^sub>1 (ONE bs\<^sub>2) r\<^sub>2"}
       
  1086          & $\dn$ & @{text "fuse (bs\<^sub>1 @ bs\<^sub>2) r\<^sub>2"}\\
       
  1087      @{text "bsimpSEQ bs r\<^sub>1 r\<^sub>2"} & $\dn$ &  @{text "SEQ bs r\<^sub>1 r\<^sub>2"}
       
  1088      \end{tabular}
       
  1089      \end{tabular}
       
  1090      \end{center}
       
  1091 
       
  1092      \noindent
       
  1093      With this in place we can define our simlification function as
       
  1094 
       
  1095      \begin{center}
       
  1096      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
       
  1097      @{thm (lhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]} & $\dn$ &
       
  1098          @{thm (rhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]}\\
       
  1099      @{thm (lhs) bsimp.simps(2)[of "bs" _]} & $\dn$ & @{thm (rhs) bsimp.simps(2)[of "bs" _]}\\
       
  1100      @{text "bsimp r"} & $\dn$ & @{text r}
       
  1101      \end{tabular}
       
  1102      \end{center}
       
  1103 
       
  1104      \noindent
       
  1105      As far as we can see, our recursive function @{term bsimp} simplifies regular
       
  1106      expressions as intended by Sulzmann and Lu. There is no point to apply the
       
  1107      @{text bsimp}
       
  1108      function repeatedly (like the simplification in their paper which is applied
       
  1109      until a fixpoint is reached), because we can show that it is idempotent, that is
       
  1110 
       
  1111      \begin{proposition}
       
  1112      ???
       
  1113      \end{proposition}
   957 
  1114 
   958 
  1115 
   959      \begin{lemma}
  1116      \begin{lemma}
   960      @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]}
  1117      @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]}
   961      \end{lemma}
  1118      \end{lemma}
  1018   \end{figure}
  1175   \end{figure}
  1019 *}
  1176 *}
  1020 
  1177 
  1021 section {* Bound - NO *}
  1178 section {* Bound - NO *}
  1022 
  1179 
  1023 section {* Bounded Regex / Not *}
       
  1024 
  1180 
  1025 section {* Conclusion *}
  1181 section {* Conclusion *}
  1026 
  1182 
  1027 text {*
  1183 text {*
  1028 
  1184 
  1029 \cite{AusafDyckhoffUrban2016}
  1185 
  1030 
  1186 
  1031 %%\bibliographystyle{plain}
  1187 %%\bibliographystyle{plain}
  1032 \bibliography{root}
  1188 \bibliography{root}
  1033 *}
  1189 *}
  1034 
  1190