updated
authorChristian Urban <christian.urban@kcl.ac.uk>
Wed, 09 Feb 2022 00:29:04 +0000
changeset 424 2416fdec6396
parent 423 b7199d6c672d
child 425 14c558ae0b09
child 427 ec08181c1f42
updated
thys2/Paper/Paper.thy
thys2/Paper/document/root.tex
thys2/paper.pdf
--- a/thys2/Paper/Paper.thy	Tue Feb 08 14:29:41 2022 +0000
+++ b/thys2/Paper/Paper.thy	Wed Feb 09 00:29:04 2022 +0000
@@ -64,6 +64,8 @@
   intern ("_\<^latex>\<open>\\mbox{$^\\uparrow$}\<close>" [900] 80) and
   erase ("_\<^latex>\<open>\\mbox{$^\\downarrow$}\<close>" [1000] 74) and
   bnullable ("bnullable _" [1000] 80) and
+  bsimp_AALTs ("bsimpALT _ _" [10,1000] 80) and
+  bsimp_ASEQ ("bsimpSEQ _ _ _" [10,1000,1000] 80) and
   bmkeps ("bmkeps _" [1000] 80) and
 
   srewrite ("_\<^latex>\<open>\\mbox{$\\,\\stackrel{s}{\\leadsto}$}\<close> _" [71, 71] 80) and
@@ -147,9 +149,8 @@
 earlier work to preserve the correctness of Sulzmann and Lu's
 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
 \emph{not} help with limiting the growth of the derivatives shown
-above: the growth is slowed, but the derivatives can still grow quickly
-beyond any finite bound.
-
+above: the growth is slowed, but the derivatives can still grow rather
+quickly beyond any finite bound.
 
 
 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
@@ -214,7 +215,7 @@
   datatype:
 
   \begin{center}
-  @{text "r :="}
+  @{text "r ::="} \;
   @{const "ZERO"} $\mid$
   @{const "ONE"} $\mid$
   @{term "CH c"} $\mid$
@@ -302,7 +303,7 @@
 lexing, where it is important to find out which part of the string
 is matched by which part of the regular expression.
 For this Sulzmann and Lu presented two lexing algorithms in their paper
-  \cite{Sulzmann2014}. This first algorithm consists of two phases: first a
+  \cite{Sulzmann2014}. The first algorithm consists of two phases: first a
   matching phase (which is Brzozowski's algorithm) and then a value
   construction phase. The values encode \emph{how} a regular expression
   matches a string. \emph{Values} are defined as the inductive datatype
@@ -423,7 +424,7 @@
   \end{center}
 
   \noindent
-  The function @{text mkeps} is called when the last derivative is nullable, that is
+  The function @{text mkeps} is run when the last derivative is nullable, that is
   the string to be matched is in the language of the regular expression. It generates
   a value for how the last derivative can match the empty string. The injection function
   then calculates the corresponding value for each intermediate derivative until
@@ -433,7 +434,7 @@
   where the path from the left to the right involving @{term derivatives}/@{const
   nullable} is the first phase of the algorithm (calculating successive
   \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
-  left, the second phase. This picture shows the steps required when a
+  left, the second phase. The picture above shows the steps required when a
   regular expression, say @{text "r\<^sub>1"}, matches the string @{term
   "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as:
 
@@ -488,8 +489,10 @@
   \end{center}
 
 
-We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that the algorithm by Sulzmann and Lu
-is correct. The cenral property we established relates the derivative operation to the injection function.
+We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
+this algorithm is correct, that is it generates POSIX values. The
+cenral property we established relates the derivative operation to the
+injection function.
 
   \begin{proposition}\label{Posix2}
 	\textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$. 
@@ -510,8 +513,8 @@
   In fact we have shown that in the success case the generated POSIX value $v$ is
   unique and in the failure case that there is no POSIX value $v$ that satisfies
   $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly
-  slow in examples where the derivatives grow arbitrarily (see example from the
-  Introduction). However it can be used as a conveninet reference point for the correctness
+  slow in cases where the derivatives grow arbitrarily (see example from the
+  Introduction). However it can be used as a convenient reference point for the correctness
   proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
   
 *}
@@ -540,7 +543,7 @@
   @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
   expressions; and @{text rs} for lists of bitcoded regular
   expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
-  is just an abbreviation for @{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}. 
+  is just an abbreviation for \mbox{@{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}}. 
   For bitsequences we just use lists made up of the
   constants @{text Z} and @{text S}.  The idea with bitcoded regular
   expressions is to incrementally generate the value information (for
@@ -667,7 +670,7 @@
   \end{center}    
 
   \noindent
-  There is also an \emph{erase}-function, written $a^\downarrow$, which
+  There is also an \emph{erase}-function, written $r^\downarrow$, which
   transforms a bitcoded regular expression into a (standard) regular
   expression by just erasing the annotated bitsequences. We omit the
   straightforward definition. For defining the algorithm, we also need
@@ -705,7 +708,7 @@
  
 
   \noindent
-  The key function in the bitcoded algorithm is the derivative of an
+  The key function in the bitcoded algorithm is the derivative of a
   bitcoded regular expression. This derivative calculates the
   derivative but at the same time also the incremental part of bitsequences
   that contribute to constructing a POSIX value.	
@@ -752,7 +755,7 @@
 $\textit{bmkeps}$ function. Finally it decodes the bitsequence into a value.  If
 the derivative is \emph{not} nullable, then $\textit{None}$ is
 returned. We can show that this way of calculating a value
-generates the same result as with \textit{lexer}.
+generates the same result as \textit{lexer}.
 
 Before we can proceed we need to define a helper function, called
 \textit{retrieve}, which Sulzmann and Lu introduced for the correctness proof.
@@ -773,7 +776,7 @@
 
 \noindent
 The idea behind this function is to retrieve a possibly partial
-bitcode from a bitcoded regular expression, where the retrieval is
+bitsequence from a bitcoded regular expression, where the retrieval is
 guided by a value.  For example if the value is $\Left$ then we
 descend into the left-hand side of an alternative in order to
 assemble the bitcode. Similarly for
@@ -806,12 +809,12 @@
 \end{proof}
 
 \noindent
-The only problem left for the correctness proof is that the bitcoded algorithm
+The only difficulty left for the correctness proof is that the bitcoded algorithm
 has only a ``forward phase'' where POSIX values are generated incrementally.
-We can achive the same effect with @{text lexer} by stacking up injection
-functions in the during forward phase. An auxiliary function, called $\textit{flex}$,
+We can achieve the same effect with @{text lexer} by stacking up injection
+functions during the forward phase. An auxiliary function, called $\textit{flex}$,
 allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single
-phase.
+phase and stacked up injection functions.
 
 \begin{center}
 \begin{tabular}{lcl}
@@ -824,7 +827,7 @@
 \noindent
 The point of this function is that when
 reaching the end of the string, we just need to apply the stacked
-injection functions to the value generated by $\mkeps$.
+injection functions to the value generated by @{text mkeps}.
 Using this function we can recast the success case in @{text lexer} 
 as follows:
 
@@ -839,14 +842,14 @@
 method. While this different method is not efficient (we essentially
 need to traverse the string $s$ twice, once for building the
 derivative $r\backslash s$ and another time for stacking up injection
-functions using \textit{flex}), it helped us with proving
-that incrementally building up values.
+functions using \textit{flex}), it helps us with proving
+that incrementally building up values generates the same result.
 
-This brings us to our main lemma in this section: if we build a
+This brings us to our main lemma in this section: if we calculate a
 derivative, say $r\backslash s$ and have a value, say $v$, inhabited
 by this derivative, then we can produce the result $\lexer$ generates
 by applying this value to the stacked-up injection functions
-$\textit{flex}$ assembles. The lemma establishes that this is the same
+that $\textit{flex}$ assembles. The lemma establishes that this is the same
 value as if we build the annotated derivative $r^\uparrow\backslash s$
 and then retrieve the corresponding bitcoded version, followed by a
 decoding step.
@@ -893,8 +896,8 @@
 \end{proof}  
 
 \noindent
-With this lemma in place, we can prove the correctness of \textit{blexer} such
-that it produces the same result as \textit{lexer}.
+With this lemma in place, we can prove the correctness of \textit{blexer}---it indeed
+produces the same result as \textit{lexer}.
 
 
 \begin{theorem}
@@ -913,7 +916,7 @@
 
   \noindent
   For the \textit{if}-branch suppose $r_d \dn r^\uparrow\backslash s$ and
-  $d \dn r\backslash s$. We have (*) $\nullable\,d$. We can then show
+  $d \dn r\backslash s$. We have (*) @{text "nullable d"}. We can then show
   by Lemma~\ref{bnullable}\textit{(3)} that
   %
   \[
@@ -931,11 +934,10 @@
   \textit{None}. Therefore we can conclude the proof.
 \end{proof}  
 
-\noindent
-This establishes that the bitcoded algorithm by Sulzmann
-and Lu \emph{without} simplification produces correct results. This was
-only conjectured in their paper \cite{Sulzmann2014}. The next step
-is to add simplifications.
+\noindent This establishes that the bitcoded algorithm by Sulzmann and
+Lu \emph{without} simplification produces correct results. This was
+only conjectured by Sulzmann and Lu in their paper
+\cite{Sulzmann2014}. The next step is to add simplifications.
 
 *}
 
@@ -946,14 +948,169 @@
 
      Derivatives as calculated by Brzozowski’s method are usually more
      complex regular expressions than the initial one; the result is
-     that the derivative-based matching and lexing algorithms are
-     often abysmally slow. However, as Sulzmann and Lu wrote, various
+     that derivative-based matching and lexing algorithms are
+     often abysmally slow if the ``growth problem'' is not addressed. As Sulzmann and Lu wrote, various
      optimisations are possible, such as the simplifications
      $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r \Rightarrow r$,
      $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow r$. While these
-     simplifications can speed up the algorithms considerably in many
-     cases, they do not solve fundamentally the ``growth problem'' with
-     derivatives. To see this let us return to the example 
+     simplifications can considerably speed up the two algorithms  in many
+     cases, they do not solve fundamentally the growth problem with
+     derivatives. To see this let us return to the example from the
+     Introduction that shows the derivatives for \mbox{@{text "(a + aa)\<^sup>*"}}.
+     If we delete in the 3rd step all $\ZERO{}s$ and $\ONE$s according to
+     the simplification rules shown above we obtain
+     %
+     \def\xll{\xrightarrow{\_\backslash{} [a, a, a]}}%%
+     %
+     \begin{equation}\label{derivex}
+     (a + aa)^* \quad\xll\quad
+      \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r} \;+\;
+     ((a + aa)^* + \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r})
+     \end{equation}
+
+     \noindent This is a simpler derivative, but unfortunately we
+     cannot make further simplifications. This is a problem because
+     the outermost alternatives contains two copies of the same
+     regular expression (underlined with $r$). The copies will
+     spawn new copies in later steps and they in turn further copies. This
+     destroys an hope of taming the size of the derivatives.  But the
+     second copy of $r$ in \eqref{derivex} will never contribute to a
+     value, because POSIX lexing will always prefer matching a string
+     with the first copy. So in principle it could be removed.
+     The dilemma with the simple-minded
+     simplification rules above is that the rule $r + r \Rightarrow r$
+     will never be applicable because as can be seen in this example the
+     regular expressions are separated by another sub-regular expression.
+
+     But here is where Sulzmann and Lu's representation of generalised
+     alternatives in the bitcoded algorithm shines: in @{term
+     "ALTs bs rs"} we can define a more aggressive simplification by
+     recursively simplifying all regular expressions in @{text rs} and
+     then analyse the resulting list and remove any duplicates.
+     Another advantage is that the bitsequences in  bitcoded regular
+     expressions can be easily modified such that simplification does not
+     interfere with the value constructions. For example we can ``flatten'', or
+     de-nest, @{text ALTs} as follows
+     %
+     \[
+     @{term "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) # rs\<^sub>1)"}
+     \quad\xrightarrow{bsimp}\quad
+     @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) # rs\<^sub>1)"}
+     \]
+
+     \noindent
+     where we just need to fuse the bitsequence that has accumulated in @{text "bs\<^sub>2"}
+     to the alternatives in @{text "rs\<^sub>2"}. As we shall show below this will
+     ensure that the correct value corresponding to the original (unsimplified)
+     regular expression can still be extracted. %In this way the value construction
+     %is not affected by simplification. 
+
+     However there is one problem with the definition for the more
+     aggressive simlification rules by Sulzmann and Lu. Recasting
+     their definition with our syntax they define the step of removing
+     duplicates as
+     %
+     \[ @{text "bsimp (ALTs bs rs)"} \dn @{text "ALTs
+     bs (nup (map bsimp rs))"}
+     \]
+
+     \noindent where they first recursively simplify the regular
+     expressions in @{text rs} (using @{text map}) and then use
+     Haskell's @{text nub}-function to remove potential
+     duplicates. While this makes sense when considering the example
+     shown in \eqref{derivex}, @{text nub} is the inappropriate
+     function in the case of bitcoded regular expressions. The reason
+     is that in general the n elements in @{text rs} will have a
+     different bitsequence annotated to it and in this way @{text nub}
+     will never find a duplicate to be removed. The correct way to
+     handle this situation is to first \emph{erase} the regular
+     expressions when comparing potential duplicates. This is inspired
+     by Scala's list functions of the form \mbox{@{text "distinctBy rs f
+     acc"}} where a function is applied first before two elements
+     are compared. We define this function in Isabelle/HOL as
+
+     \begin{center}
+     \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+     @{thm (lhs) distinctBy.simps(1)} & $\dn$ & @{thm (rhs) distinctBy.simps(1)}\\
+     @{thm (lhs) distinctBy.simps(2)} & $\dn$ & @{thm (rhs) distinctBy.simps(2)}
+     \end{tabular}
+     \end{center}
+
+     \noindent where we scan the list from left to right (because we
+     have to remove later copies). In this function, @{text f} is a
+     function and @{text acc} is an accumulator for regular
+     expressions---essentially a set of elements we have already seen
+     while scanning the list. Therefore we delete an element, say @{text x},
+     from the list provided @{text "f x"} is already in the accumulator;
+     otherwise we keep @{text x} and scan the rest of the list but now
+     add @{text "f x"} as another element to @{text acc}. We will use
+     @{term distinctBy} where @{text f} is our erase functions, @{term "erase (DUMMY)"},
+     that deletes bitsequences from bitcoded regular expressions.
+     This is clearly a computationally more expensive operation, than @{text nub},
+     but is needed in order to make the removal of unnecessary copies
+     to work.
+
+     Our simplification function depends on three helper functions, one is called
+     @{text flts} and defined as follows:
+
+     \begin{center}
+     \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+     @{thm (lhs) flts.simps(1)} & $\dn$ & @{thm (rhs) flts.simps(1)}\\
+     @{thm (lhs) flts.simps(2)} & $\dn$ & @{thm (rhs) flts.simps(2)}\\
+     @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
+     \end{tabular}
+     \end{center}
+
+     \noindent
+     The second clause removes all instances of @{text ZERO} in alternatives and
+     the second ``spills'' out nested alternatives (but retaining the
+     bitsequence @{text "bs'"} accumulated in the inner alternative). There are
+     some corner cases to be considered when the resulting list inside an alternative is
+     empty or a singleton list. We take care of those cases in the
+     @{text "bsimpALTs"} function; similarly we define a helper function that simplifies
+     sequences according to the usual rules about @{text ZERO}s and @{text ONE}s:
+
+     \begin{center}
+     \begin{tabular}{c@ {\hspace{5mm}}c}
+     \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+     @{text "bsimpALTs bs []"}  & $\dn$ & @{text "ZERO"}\\
+     @{text "bsimpALTs bs [r]"} & $\dn$ & @{text "fuse bs r"}\\
+     @{text "bsimpALTs bs rs"}  & $\dn$ & @{text "ALTs bs rs"}\\
+     \mbox{}\\
+     \end{tabular}
+     &
+     \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+     @{text "bsimpSEQ bs _ ZERO"}  & $\dn$ & @{text "ZERO"}\\
+     @{text "bsimpSEQ bs ZERO _"} & $\dn$ & @{text "ZERO"}\\
+     @{text "bsimpSEQ bs\<^sub>1 (ONE bs\<^sub>2) r\<^sub>2"}
+         & $\dn$ & @{text "fuse (bs\<^sub>1 @ bs\<^sub>2) r\<^sub>2"}\\
+     @{text "bsimpSEQ bs r\<^sub>1 r\<^sub>2"} & $\dn$ &  @{text "SEQ bs r\<^sub>1 r\<^sub>2"}
+     \end{tabular}
+     \end{tabular}
+     \end{center}
+
+     \noindent
+     With this in place we can define our simlification function as
+
+     \begin{center}
+     \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+     @{thm (lhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]} & $\dn$ &
+         @{thm (rhs) bsimp.simps(1)[of "bs" "r\<^sub>1" "r\<^sub>2"]}\\
+     @{thm (lhs) bsimp.simps(2)[of "bs" _]} & $\dn$ & @{thm (rhs) bsimp.simps(2)[of "bs" _]}\\
+     @{text "bsimp r"} & $\dn$ & @{text r}
+     \end{tabular}
+     \end{center}
+
+     \noindent
+     As far as we can see, our recursive function @{term bsimp} simplifies regular
+     expressions as intended by Sulzmann and Lu. There is no point to apply the
+     @{text bsimp}
+     function repeatedly (like the simplification in their paper which is applied
+     until a fixpoint is reached), because we can show that it is idempotent, that is
+
+     \begin{proposition}
+     ???
+     \end{proposition}
 
 
      \begin{lemma}
@@ -1020,13 +1177,12 @@
 
 section {* Bound - NO *}
 
-section {* Bounded Regex / Not *}
 
 section {* Conclusion *}
 
 text {*
 
-\cite{AusafDyckhoffUrban2016}
+
 
 %%\bibliographystyle{plain}
 \bibliography{root}
--- a/thys2/Paper/document/root.tex	Tue Feb 08 14:29:41 2022 +0000
+++ b/thys2/Paper/document/root.tex	Wed Feb 09 00:29:04 2022 +0000
@@ -63,7 +63,7 @@
 \author{Chengsong Tan}{King's College London}{chengsong.tan@kcl.ac.uk}{}{}
 \author{Christian Urban}{King's College London}{christian.urban@kcl.ac.uk}{}{}
 \authorrunning{C.~Tan and C.~Urban}
-\keywords{POSIX matching and lexing, Derivatives of Regular Expressions, Isabelle/HOL}
+\keywords{POSIX matching and lexing, derivatives of regular expressions, Isabelle/HOL}
 \category{}
 \ccsdesc[100]{Design and analysis of algorithms}
 \ccsdesc[100]{Formal languages and automata theory}
@@ -88,7 +88,7 @@
   derivatives finite. Without simplification the size of some derivatives can grow
   arbitrarily big resulting in an extremely slow lexing algorithm.  In this
   paper we describe a variant of Sulzmann and Lu's algorithm: Our
-  algorithm is a recursive functional program, whereas Sulzmann
+  variant is a recursive functional program, whereas Sulzmann
   and Lu's version involves a fixpoint construction. We \textit{(i)}
   prove in Isabelle/HOL that our algorithm is correct and generates
   unique POSIX values; we also \textit{(ii)} establish a finite
Binary file thys2/paper.pdf has changed