lexing: changeset 429:afd8eff20402

--- a/thys2/Paper/Paper.thy	Wed Feb 09 18:21:01 2022 +0000
+++ b/thys2/Paper/Paper.thy	Wed Feb 09 18:21:15 2022 +0000
@@ -27,7 +27,7 @@
   ders_syn ("_\\_" [79, 1000] 76) and
   bder_syn ("_\\_" [79, 1000] 76) and
   bders ("_\\_" [79, 1000] 76) and
-  bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and
+  bders_simp ("_\\\<^sub>b\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and
 
   ZERO ("\<^bold>0" 81) and 
   ONE ("\<^bold>1" 81) and 
@@ -70,6 +70,7 @@
 
   srewrite ("_\<^latex>\<open>\\mbox{$\\,\\stackrel{s}{\\leadsto}$}\<close> _" [71, 71] 80) and
   rrewrites ("_ \<^latex>\<open>\\mbox{$\\,\\leadsto^*$}\<close> _" [71, 71] 80) and
+  srewrites ("_ \<^latex>\<open>\\mbox{$\\,\\stackrel{s}{\\leadsto}^*$}\<close> _" [71, 71] 80) and
   blexer_simp ("blexer\<^sup>+" 1000) 
 
 
@@ -172,14 +173,14 @@
   method [..] in combination with the simplification steps [..]
   yields POSIX parse trees. We have tested this claim
   extensively [..] but yet
-  have to work out all proof details.''
+  have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
 \end{quote}  
 
 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL
 the derivative-based lexing algorithm of Sulzmann and Lu
 \cite{Sulzmann2014} where regular expressions are annotated with
 bitsequences. We define the crucial simplification function as a
-recursive function, instead of a fix-point operation. One objective of
+recursive function, without the need of a fix-point operation. One objective of
 the simplification function is to remove duplicates of regular
 expressions.  For this Sulzmann and Lu use in their paper the standard
 @{text nub} function from Haskell's list library, but this function
@@ -188,7 +189,7 @@
 different bitcode annotation---so @{text nub} would never ``fire''.
 Inspired by Scala's library for lists, we shall instead use a @{text
 distinctBy} function that finds duplicates under an erasing function
-that deletes bitcodes.
+which deletes bitcodes.
 We shall also introduce our own argument and definitions for
 establishing the correctness of the bitcoded algorithm when 
 simplifications are included.\medskip
@@ -297,7 +298,7 @@
 \end{proposition}
 
 \noindent
-It is a fun exercise to formaly prove this property in a theorem prover.
+It is a fun exercise to formally prove this property in a theorem prover.
 
 The novel idea of Sulzmann and Lu is to extend this algorithm for 
 lexing, where it is important to find out which part of the string
@@ -358,7 +359,7 @@
 	Unix-like operating systems \cite{POSIX}.} Sometimes these
   informal rules are called \emph{maximal much rule} and \emph{rule priority}.
   One contribution of our earlier paper is to give a convenient
- specification for what a POSIX value is (the inductive rules are shown in
+ specification for what POSIX values are (the inductive rules are shown in
   Figure~\ref{POSIXrules}).
 
 \begin{figure}[t]
@@ -436,7 +437,7 @@
   \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
   left, the second phase. The picture above shows the steps required when a
   regular expression, say @{text "r\<^sub>1"}, matches the string @{term
-  "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as:
+  "[a,b,c]"}. The first lexing algorithm by Sulzmann and Lu can be defined as:
 
   \begin{figure}[t]
 \begin{center}
@@ -491,7 +492,7 @@
 
 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
 this algorithm is correct, that is it generates POSIX values. The
-cenral property we established relates the derivative operation to the
+central property we established relates the derivative operation to the
 injection function.
 
   \begin{proposition}\label{Posix2}
@@ -512,8 +513,8 @@
   \noindent
   In fact we have shown that in the success case the generated POSIX value $v$ is
   unique and in the failure case that there is no POSIX value $v$ that satisfies
-  $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly
-  slow in cases where the derivatives grow arbitrarily (see example from the
+  $(s, r) \rightarrow v$. While the algorithm is correct, it is excruciatingly
+  slow in cases where the derivatives grow arbitrarily (recall the example from the
   Introduction). However it can be used as a convenient reference point for the correctness
   proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
   
@@ -525,7 +526,7 @@
 
   In the second part of their paper \cite{Sulzmann2014},
   Sulzmann and Lu describe another algorithm that also generates POSIX
-  values but dispences with the second phase where characters are
+  values but dispenses with the second phase where characters are
   injected ``back'' into values. For this they annotate bitcodes to
   regular expressions, which we define in Isabelle/HOL as the datatype
 
@@ -544,7 +545,7 @@
   expressions; and @{text rs} for lists of bitcoded regular
   expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
   is just an abbreviation for \mbox{@{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}}. 
-  For bitsequences we just use lists made up of the
+  For bitsequences we use lists made up of the
   constants @{text Z} and @{text S}.  The idea with bitcoded regular
   expressions is to incrementally generate the value information (for
   example @{text Left} and @{text Right}) as bitsequences. For this 
@@ -629,7 +630,7 @@
 \end{proof}  
 
   Sulzmann and Lu define the function \emph{internalise}
-  in order to transform standard regular expressions into annotated
+  in order to transform (standard) regular expressions into annotated
   regular expressions. We write this operation as $r^\uparrow$.
   This internalisation uses the following
   \emph{fuse} function.	     
@@ -652,7 +653,7 @@
 
   \noindent
   A regular expression can then be \emph{internalised} into a bitcoded
-  regular expression as follows.
+  regular expression as follows:
 
   \begin{center}
   \begin{tabular}{lcl}
@@ -676,7 +677,7 @@
   straightforward definition. For defining the algorithm, we also need
   the functions \textit{bnullable} and \textit{bmkeps}, which are the
   ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
-  bitcoded regular expressions, instead of regular expressions.
+  bitcoded regular expressions.
   %
   \begin{center}
   \begin{tabular}{@ {}c@ {}c@ {}}
@@ -710,7 +711,7 @@
   \noindent
   The key function in the bitcoded algorithm is the derivative of a
   bitcoded regular expression. This derivative calculates the
-  derivative but at the same time also the incremental part of bitsequences
+  derivative but at the same time also the incremental part of the bitsequences
   that contribute to constructing a POSIX value.	
 
   \begin{center}
@@ -811,9 +812,9 @@
 \noindent
 The only difficulty left for the correctness proof is that the bitcoded algorithm
 has only a ``forward phase'' where POSIX values are generated incrementally.
-We can achieve the same effect with @{text lexer} by stacking up injection
+We can achieve the same effect with @{text lexer} (which has two phases) by stacking up injection
 functions during the forward phase. An auxiliary function, called $\textit{flex}$,
-allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single
+allows us to recast the rules of $\lexer$ in terms of a single
 phase and stacked up injection functions.
 
 \begin{center}
@@ -826,7 +827,7 @@
 
 \noindent
 The point of this function is that when
-reaching the end of the string, we just need to apply the stacked
+reaching the end of the string, we just need to apply the stacked up
 injection functions to the value generated by @{text mkeps}.
 Using this function we can recast the success case in @{text lexer} 
 as follows:
@@ -843,11 +844,11 @@
 need to traverse the string $s$ twice, once for building the
 derivative $r\backslash s$ and another time for stacking up injection
 functions using \textit{flex}), it helps us with proving
-that incrementally building up values generates the same result.
+that incrementally building up values in @{text blexer} generates the same result.
 
 This brings us to our main lemma in this section: if we calculate a
-derivative, say $r\backslash s$ and have a value, say $v$, inhabited
-by this derivative, then we can produce the result $\lexer$ generates
+derivative, say $r\backslash s$, and have a value, say $v$, inhabited
+by this derivative, then we can produce the result @{text lexer} generates
 by applying this value to the stacked-up injection functions
 that $\textit{flex}$ assembles. The lemma establishes that this is the same
 value as if we build the annotated derivative $r^\uparrow\backslash s$
@@ -900,7 +901,7 @@
 produces the same result as \textit{lexer}.
 
 
-\begin{theorem}
+\begin{theorem}\label{thmone}
 $\textit{lexer}\,r\,s = \textit{blexer}\,r\,s$
 \end{theorem}  
 
@@ -969,26 +970,26 @@
      \end{equation}
 
      \noindent This is a simpler derivative, but unfortunately we
-     cannot make further simplifications. This is a problem because
+     cannot make any further simplifications. This is a problem because
      the outermost alternatives contains two copies of the same
-     regular expression (underlined with $r$). The copies will
-     spawn new copies in later steps and they in turn further copies. This
-     destroys an hope of taming the size of the derivatives.  But the
+     regular expression (underlined with $r$). These copies will
+     spawn new copies in later derivative steps and they in turn even more copies. This
+     destroys any hope of taming the size of the derivatives.  But the
      second copy of $r$ in \eqref{derivex} will never contribute to a
      value, because POSIX lexing will always prefer matching a string
-     with the first copy. So in principle it could be removed.
+     with the first copy. So it could be safely removed without affecting the correctness of the algorithm.
      The dilemma with the simple-minded
      simplification rules above is that the rule $r + r \Rightarrow r$
      will never be applicable because as can be seen in this example the
-     regular expressions are separated by another sub-regular expression.
+     regular expressions are not next to each other but separated by another regular expression.
 
      But here is where Sulzmann and Lu's representation of generalised
      alternatives in the bitcoded algorithm shines: in @{term
      "ALTs bs rs"} we can define a more aggressive simplification by
      recursively simplifying all regular expressions in @{text rs} and
      then analyse the resulting list and remove any duplicates.
-     Another advantage is that the bitsequences in  bitcoded regular
-     expressions can be easily modified such that simplification does not
+     Another advantage with the bitsequences in  bitcoded regular
+     expressions is that they can be easily modified such that simplification does not
      interfere with the value constructions. For example we can ``flatten'', or
      de-nest, @{text ALTs} as follows
      %
@@ -1006,7 +1007,7 @@
      %is not affected by simplification. 
 
      However there is one problem with the definition for the more
-     aggressive simlification rules by Sulzmann and Lu. Recasting
+     aggressive simplification rules described by Sulzmann and Lu. Recasting
      their definition with our syntax they define the step of removing
      duplicates as
      %
@@ -1020,8 +1021,8 @@
      duplicates. While this makes sense when considering the example
      shown in \eqref{derivex}, @{text nub} is the inappropriate
      function in the case of bitcoded regular expressions. The reason
-     is that in general the n elements in @{text rs} will have a
-     different bitsequence annotated to it and in this way @{text nub}
+     is that in general the elements in @{text rs} will have a
+     different annotated bitsequence and in this way @{text nub}
      will never find a duplicate to be removed. The correct way to
      handle this situation is to first \emph{erase} the regular
      expressions when comparing potential duplicates. This is inspired
@@ -1037,21 +1038,22 @@
      \end{center}
 
      \noindent where we scan the list from left to right (because we
-     have to remove later copies). In this function, @{text f} is a
+     have to remove later copies). In @{text distinctBy}, @{text f} is a
      function and @{text acc} is an accumulator for regular
-     expressions---essentially a set of elements we have already seen
+     expressions---essentially a set of regular expression that we have already seen
      while scanning the list. Therefore we delete an element, say @{text x},
      from the list provided @{text "f x"} is already in the accumulator;
-     otherwise we keep @{text x} and scan the rest of the list but now
-     add @{text "f x"} as another element to @{text acc}. We will use
-     @{term distinctBy} where @{text f} is our erase functions, @{term "erase (DUMMY)"},
+     otherwise we keep @{text x} and scan the rest of the list but 
+     add @{text "f x"} as another ``seen'' element to @{text acc}. We will use
+     @{term distinctBy} where @{text f} is the erase functions, @{term "erase (DUMMY)"},
      that deletes bitsequences from bitcoded regular expressions.
      This is clearly a computationally more expensive operation, than @{text nub},
      but is needed in order to make the removal of unnecessary copies
-     to work.
+     to work properly.
 
      Our simplification function depends on three helper functions, one is called
-     @{text flts} and defined as follows:
+     @{text flts} and analyses lists of regular expressions coming from alternatives.
+     It is defined as follows:
 
      \begin{center}
      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
@@ -1062,7 +1064,7 @@
      \end{center}
 
      \noindent
-     The second clause removes all instances of @{text ZERO} in alternatives and
+     The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and
      the second ``spills'' out nested alternatives (but retaining the
      bitsequence @{text "bs'"} accumulated in the inner alternative). There are
      some corner cases to be considered when the resulting list inside an alternative is
@@ -1090,7 +1092,7 @@
      \end{center}
 
      \noindent
-     With this in place we can define our simlification function as
+     With this in place we can define our simplification function as
 
      \begin{center}
      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
@@ -1103,85 +1105,328 @@
 
      \noindent
      As far as we can see, our recursive function @{term bsimp} simplifies regular
-     expressions as intended by Sulzmann and Lu. There is no point to apply the
-     @{text bsimp}
-     function repeatedly (like the simplification in their paper which is applied
-     until a fixpoint is reached), because we can show that it is idempotent, that is
+     expressions as intended by Sulzmann and Lu. There is no point in applying the
+     @{text bsimp} function repeatedly (like the simplification in their paper which needs to be
+     applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent,
+     that is
 
      \begin{proposition}
-     ???
+     @{term "bsimp (bsimp r) = bsimp r"}
      \end{proposition}
 
+     \noindent
+     This can be proved by induction on @{text r} but requires a detailed analysis
+     that the de-nesting of alternatives always results in a flat list of regular
+     expressions. We omit the details since it does not concern the correctness proof.
+     
+     Next we can include simplification after each derivative step leading to the
+     following notion of bitcoded derivatives:
+     
+     \begin{center}
+      \begin{tabular}{cc}
+      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+      @{thm (lhs) bders_simp.simps(1)} & $\dn$ & @{thm (rhs) bders_simp.simps(1)}
+      \end{tabular}
+      &
+      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+      @{thm (lhs) bders_simp.simps(2)} & $\dn$ & @{thm (rhs) bders_simp.simps(2)}
+      \end{tabular}
+      \end{tabular}
+      \end{center}
+
+      \noindent
+      and use it in the improved lexing algorithm defined as
+
+     \begin{center}
+\begin{tabular}{lcl}
+  $\textit{blexer}^+\;r\,s$ & $\dn$ &
+      $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\                
+  & & $\;\;\;\;\textit{if}\; \textit{bnullable}(r_{der}) \;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r
+       \;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+
+       \noindent The remaining task is to show that @{term blexer} and
+       @{term "blexer_simp"} generate the same answers.
+
+       When we first
+       attempted this proof we encountered a problem with the idea
+       in Sulzmann and Lu's paper where the argument seems to be to appeal
+       again to the @{text retrieve}-function defined for the unsimplified version
+       of the algorithm. But
+       this does not work, because desirable properties such as
+     %
+     \[
+     @{text "retrieve r v = retrieve (bsimp r) v"}
+     \]
+
+     \noindent do not hold under simplification---this property
+     essentially purports that we can retrieve the same value from a
+     simplified version of the regular expression. To start with @{text retrieve}
+     depends on the fact that the value @{text v} correspond to the
+     structure of the regular expressions---but the whole point of simplification
+     is to ``destroy'' this structure by making the regular expression simpler.
+     To see this consider the regular expression @{text "r = r' + 0"} and a corresponding
+     value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then 
+     we can use @{text retrieve} and @{text v} in order to extract a corresponding
+     bitsequence. The reason that this works is that @{text r} is an alternative
+     regular expression and @{text v} a corresponding value. However, if we simplify
+     @{text r}, then @{text v} does not correspond to the shape of the regular 
+     expression anymore. So unless one can somehow
+     synchronise the change in the simplified regular expressions with
+     the original POSIX value, there is no hope of appealing to @{text retrieve} in the
+     correctness argument for @{term blexer_simp}.
+
+     We found it more helpful to introduce the rewriting systems shown in
+     Figure~\ref{SimpRewrites}. The idea is to generate 
+     simplified regular expressions in small steps (unlike the @{text bsimp}-function which
+     does the same in a big step), and show that each of
+     the small steps preserves the bitcodes that lead to the final POSIX value.
+     The rewrite system is organised such that $\leadsto$ is for bitcode regular
+     expressions and $\stackrel{s}{\leadsto}$ for lists of bitcoded regular
+     expressions. The former essentially implements the simplifications of
+     @{text "bsimpSEQ"} and @{text flts}; while the latter implements the
+     simplifications in @{text "bsimpALTs"}. We can show that any bitcoded
+     regular expression reduces in one or more steps to the simplified
+     regular expression generated by @{text bsimp}:
+
+     \begin{lemma}\label{lemone}
+     @{thm[mode=IfThen] rewrites_to_bsimp}
+     \end{lemma}
+
+     \begin{proof}
+     By induction on @{text r}. For this we can use the properties
+     @{thm fltsfrewrites} and @{thm ss6_stronger}. The latter uses
+     repeated applications of the $LD$ rule which allows the removal
+     of duplicates that can recognise the same strings. 
+     \end{proof}
+
+     \noindent
+     We can show that this rewrite system preserves @{term bnullable}, that 
+     is simplification, essentially, does not affect nullability:
 
      \begin{lemma}
      @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]}
      \end{lemma}
 
+     \begin{proof}
+     Straightforward mutual induction on the definition of $\leadsto$ and $\stackrel{s}{\leadsto}$.
+     The only interesting case is the rule $LD$ where the property holds since by the side-conditions of that rule the empty string will
+     be in both @{text "L (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ [r\<^sub>2] @ rs\<^sub>c)"} and
+     @{text "L (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ rs\<^sub>c)"}.
+     \end{proof}
 
-     \begin{lemma}
+     \noindent
+     From this, we can show that @{text bmkeps} will produce the same bitsequence
+     as long as one of the bitcoded regular expressions in $\leadsto$ is nullable (this lemma
+     establishes the missing fact we were not able to establish using @{text retrieve}, as suggested
+     in the paper by Sulzmannn and Lu). 
+
+
+     \begin{lemma}\label{lemthree}
      @{thm[mode=IfThen] rewrite_bmkeps_aux(1)[of "r\<^sub>1" "r\<^sub>2"]}
      \end{lemma}
 
-     \begin{lemma}
-     @{thm[mode=IfThen] rewrites_to_bsimp}
-     \end{lemma}
+     \begin{proof}
+     By straightforward mutual induction on the definition of $\leadsto$ and $\stackrel{s}{\leadsto}$.
+     Again the only interesting case is the rule $LD$ where we need to ensure that
+     \[
+     @{text "bmkeps (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ [r\<^sub>2] @ rs\<^sub>c) =
+        bmkeps (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ rs\<^sub>c)"}	
+     \]
+
+     \noindent holds. This is indeed the case because according to the POSIX rules the
+     generated bitsequence is determined by the first alternative that can match the
+     string (in this case being nullable).
+     \end{proof}
+
+     \noindent
+     Crucial is also the fact that derivative steps and simplification steps can be interleaved,
+     which is shown by the fact that $\leadsto$ is preserved under derivatives.
 
      \begin{lemma}
      @{thm[mode=IfThen] rewrite_preserves_bder(1)[of "r\<^sub>1" "r\<^sub>2"]}
      \end{lemma}
 
-     \begin{lemma}
+     \begin{proof}
+     By straightforward mutual induction on the definition of $\leadsto$ and $\stackrel{s}{\leadsto}$.
+     The case for $LD$ holds because @{term "L (erase (bder c r\<^sub>2)) \<subseteq> L (erase (bder c r\<^sub>1))"}
+     if and only if @{term "L (erase (r\<^sub>2)) \<subseteq> L (erase (r\<^sub>1))"}.
+     \end{proof}
+
+
+     \noindent
+     Using this fact together with Lemma~\ref{lemone} allows us to prove the central lemma
+     that the unsimplified
+     derivative (with a string @{term s}) reduces to the simplified derivative (with the same string).
+     
+     \begin{lemma}\label{lemtwo}
      @{thm[mode=IfThen] central}
      \end{lemma}
 
+     \begin{proof}
+     By reverse induction on @{term s} generalising over @{text r}.
+     \end{proof}
+
+     \noindent
+     With these lemmas in place we can finally establish that @{term "blexer_simp"} and @{term "blexer"}
+     generate the same value, and using Theorem~\ref{thmone} from the previous section that this value
+     is indeed the POSIX value.
+     
      \begin{theorem}
      @{thm[mode=IfThen] main_blexer_simp}
      \end{theorem}
 
-     Sulzmann \& Lu apply simplification via a fixpoint operation
-
-     ; also does not use erase to filter out duplicates.
-  
-   not direct correspondence with PDERs, because of example
-   problem with retrieve 
-
-   correctness
-
-   
-    
-
-
+     \begin{proof}
+     By unfolding the definitions and using Lemmas~\ref{lemtwo} and \ref{lemthree}. 	
+     \end{proof}
+     
+     \noindent
+     This completes the correctness proof for the second POSIX lexing algorithm by Sulzmann and Lu.
+     The interesting point of this algorithm is that the sizes of derivatives do not grow arbitrarily, which
+     we shall show next.
 
    \begin{figure}[t]
   \begin{center}
   \begin{tabular}{c}
-  @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}\qquad
-  @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}\qquad
-  @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}\\
-  @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}\qquad
-  @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}\\
-  @{thm[mode=Axiom] bs6}\qquad
-  @{thm[mode=Axiom] bs7}\\
-  @{thm[mode=Rule] bs8[of "rs\<^sub>1" "rs\<^sub>2"]}\\
-  %@ { t hm[mode=Axiom] ss1}\qquad
-  @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}\qquad
-  @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}\\
-  @{thm[mode=Axiom] ss4}\qquad
-  @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}\\
-  @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}\\
+  @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\qquad
+  @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\\
+  @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\
+  @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad
+  @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\
+  @{thm[mode=Axiom] bs6}$A0$\qquad
+  @{thm[mode=Axiom] bs7}$A1$\\
+  @{thm[mode=Rule] bs8[of "rs\<^sub>1" "rs\<^sub>2"]}$AL$\\
+  @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LH$\qquad
+  @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}$LT$\\
+  @{thm[mode=Axiom] ss4}$L\ZERO$\qquad
+  @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}$LS$\medskip\\
+  @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}$LD$\\
   \end{tabular}
   \end{center}
-  \caption{???}\label{SimpRewrites}
+  \caption{The rewrite rules that generate simplified regular expressions
+  in small steps: @{term "rrewrite r\<^sub>1 r\<^sub>2"} is for bitcoded regular
+  expressions and @{term "rrewrites rs\<^sub>1 rs\<^sub>2"} for \emph{lists} of bitcoded
+  regular expressions. Interesting is the $LD$ rule that allows copies of regular
+  expressions be removed provided a regular expression earlier in the list can
+  match the same strings.}\label{SimpRewrites}
   \end{figure}
 *}
 
-section {* Bound - NO *}
+section {* Finiteness of Derivatives *}
+
+text {*
+
+In this section let us sketch our argument on why the size of the simplified
+derivatives with the aggressive simplification function is finite. Suppose
+we have a size functions for bitcoded regular expressions, written
+@{text "|r|"}, which counts the number of nodes if we regard $r$ as a tree
+(we omit the precise definition). For this we show that for every $r$
+there exists a bound $N$
+such that 
+
+\begin{center}
+$\forall s. \; |@{term "bders_simp r s"}| < N$
+\end{center}
+
+\noindent
+We prove this by induction on $r$. The base cases for @{term AZERO},
+@{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is
+for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction
+hypotheses state $\forall s. \; |@{term "bders_simp r\<^sub>1 s"}| < N_1$ and
+$\forall s. \; |@{term "bders_simp r\<^sub>2 s"}| < N_2$. We can reason as follows
+
+\begin{center}
+\begin{tabular}{lcll}
+& & $ |@{term "bders_simp (ASEQ bs r\<^sub>1 r\<^sub>2) s"}|$\\
+& $ = $ & $|bsimp(ALTs\;bs\;((@{term "bders_simp r\<^sub>1 s"}) \cdot r_2) ::
+    [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| $ & (1) \\
+& $\leq$ &
+    $|distinctBy\,(flts\,((@{term "bders_simp r\<^sub>1 s "}) \cdot r_2) ::
+    [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (2) \\
+& $\leq$ & $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2| +
+             |distinctBy\,(flts\,   [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (3) \\
+& $\leq$ & $N_1 + |r_2| + 2 + |distinctBy\,(flts\,   [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])|$ & (4)\\
+& $\leq$ & $N_1 + |r_2| + 2 + l_{N_{2}} * N_{2}$ & (5)
+\end{tabular}
+\end{center}
+
+% tell Chengsong about Indian paper of closed forms of derivatives
+
+\noindent
+where in (1) the $Suf\!fix(s')$ are the suffixes where @{term "bders_simp r\<^sub>1 s''"} is nullable for
+@{text "s = s'' @ s'"}. In (3) we know that  $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2|$ is 
+bounded by $N_1 + |r_2|$. In (5) we know the list comprehension contains only regular expressions of size smaller
+than $N_2$. The list length after @{text distinctBy} is bounded by a number, which we call $l_{N_2}$. It stands
+for the number of distinct regular expressions with a maximum size $N_2$ (there can only be finitely many of them).
+We reason similarly in the @{text Star}-case.\medskip
+
+\noindent
+Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
+far from the actual bound we can expect. We can do better than this, but this does not improve
+the finiteness property we are proving. If we are interested in a polynomial bound,
+one would hope to obtain a similar tight bound as for partial
+derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
+@{text distinctBy} is to maintain a ``set'' of alternatives (like the sets in
+partial derivatives). Unfortunately to obtain the exact same bound would mean
+we need to introduce simplifications such as
+%
+\[ (r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)
+\]
+
+\noindent
+which exist for partial derivatives. However, if we introduce them in our
+setting we would lose the POSIX property of our calculated values. We leave better
+bounds for future work.
+
+*}
 
 
 section {* Conclusion *}
 
 text {*
 
+   We set out in this work to prove in Isabelle/HOL the correctness of
+   the second POSIX lexing algorithm by Sulzmann and Lu
+   \cite{Sulzmann2014}. This follows earlier work where we established
+   the correctness of the first algorithm
+   \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
+   introduce our own specification about what POSIX values are,
+   because the informal definition given by Sulzmann and Lu did not
+   stand up to a formal proof. Also for the second algorithm we needed
+   to introduce our own definitions and proof ideas in order to establish the
+   correctness.  Our interest in the second algorithm 
+   lies in the fact that by using bitcoded regular expressions and an aggressive
+   simplification method there is a chance that the the derivatives
+   can be kept universally small  (we established in this paper that
+   they can be kept finite for any string). This is important if one is after
+   an efficient POSIX lexing algorithm.
+
+   Having proved the correctness of the POSIX lexing algorithm, which
+   lessons have we learned? Well, we feel this is a very good example
+   where formal proofs give further insight into the matter at
+   hand. For example it is very hard to see a problem with @{text nub}
+   vs @{text distinctBy} with only experimental data---one would still
+   see the correct result but find that simplification does not simplify in well-chosen, but not
+   obscure, examples. We found that from an implementation
+   point-of-view it is really important to have the formal proofs of
+   the corresponding properties at hand. We have also developed a
+   healthy suspicion when experimental data is used to back up
+   efficiency claims. For example Sulzmann and Lu write about their
+   equivalent of @{term blexer_simp} ``...we can incrementally compute
+   bitcoded parse trees in linear time in the size of the input''
+   \cite[Page 14]{Sulzmann2014}. 
+   Given the growth of the
+   derivatives in some cases even after aggressive simplification, this
+   is a hard to believe fact. A similar claim of about a theoretical runtime
+   of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates POSIX matches and is based on
+   derivatives \cite{verbatim}. In this case derivatives are not simplified.
+   Clearly our result of having finite
+   derivatives is rather weak in this context but we think such effeciency claims
+   require further scrutiny.\medskip
+
+   \noindent
+   Our Isabelle/HOL code is available under \url{https://github.com/urbanchr/posix}.
 
 
 %%\bibliographystyle{plain}

--- a/thys2/Paper/document/root.bib	Wed Feb 09 18:21:01 2022 +0000
+++ b/thys2/Paper/document/root.bib	Wed Feb 09 18:21:15 2022 +0000
@@ -351,3 +351,11 @@
 }
 
 
+@INPROCEEDINGS{verbatim,
+  author={D.~Egolf and S.~Lasser and K.~Fisher},
+  booktitle={2021 IEEE Security and Privacy Workshops (SPW)}, 
+  title={{V}erbatim: {A} {V}erified {L}exer {G}enerator}, 
+  year={2021},
+  volume={},
+  number={},
+  pages={92--100}}

--- a/thys2/SizeBound4.thy	Wed Feb 09 18:21:01 2022 +0000
+++ b/thys2/SizeBound4.thy	Wed Feb 09 18:21:15 2022 +0000
@@ -467,7 +467,6 @@
 | "flts (r1 # rs) = r1 # flts rs"
 
 
-
 fun bsimp_ASEQ :: "bit list \<Rightarrow> arexp \<Rightarrow> arexp \<Rightarrow> arexp"
   where
   "bsimp_ASEQ _ AZERO _ = AZERO"
@@ -695,7 +694,7 @@
   using rs_in_rstar by force
 
 lemma ss6_stronger:
-  shows "rs1 s\<leadsto>* distinctBy rs1 erase {}"
+  shows "rs s\<leadsto>* distinctBy rs erase {}"
   using ss6_stronger_aux[of "[]" _] by auto
 
 lemma rewrite_preserves_fuse:

Binary file thys2/paper.pdf has changed

author	Chengsong
	Wed, 09 Feb 2022 18:21:15 +0000
changeset 429	afd8eff20402
parent 428	5dcecc92608e (current diff)
parent 426	5b77220fdf01 (diff)
child 430	579caa608a15