lexing: changeset 569:5af61c89f51e

--- a/thys3/Paper.thy	Sat Jul 16 18:34:46 2022 +0100
+++ b/thys3/Paper.thy	Sun Jul 17 13:07:05 2022 +0100
@@ -1,6 +1,7 @@
 (*<*)
 theory Paper
   imports
+   "Posix.LexerSimp" 
    "Posix.FBound" 
    "HOL-Library.LaTeXsugar"
 begin
@@ -29,17 +30,17 @@
   ZERO ("\<^bold>0" 81) and 
   ONE ("\<^bold>1" 81) and 
   CH ("_" [1000] 80) and
-  ALT ("_ + _" [77,77] 78) and
-  SEQ ("_ \<cdot> _" [77,77] 78) and
+  ALT ("_ + _" [76,76] 77) and
+  SEQ ("_ \<cdot> _" [78,78] 78) and
   STAR ("_\<^sup>*" [79] 78) and
-  NTIMES ("_\<^sup>{\<^sup>_\<^sup>}" [79] 78) and
+  NTIMES ("_\<^sup>{\<^bsup>_\<^esup>\<^sup>}" [79] 78) and
 
   val.Void ("Empty" 78) and
   val.Char ("Char _" [1000] 78) and
   val.Left ("Left _" [79] 78) and
   val.Right ("Right _" [1000] 78) and
   val.Seq ("Seq _ _" [79,79] 78) and
-  val.Stars ("Stars _" [79] 78) and
+  val.Stars ("Stars _" [1000] 78) and
 
   Prf ("\<turnstile> _ : _" [75,75] 75) and  
   Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
@@ -57,6 +58,7 @@
   AALTs ("ALTs _ _" [77,77] 78) and
   ASEQ ("SEQ _ _ _" [79, 79,79] 78) and
   ASTAR ("STAR _ _" [79, 79] 78) and
+  ANTIMES ("NT _ _ _" [79, 79, 79] 78) and
 
   code ("code _" [79] 74) and
   intern ("_\<^latex>\<open>\\mbox{$^\\uparrow$}\<close>" [900] 80) and
@@ -81,8 +83,10 @@
   by (metis list.exhaust retrieve.simps(5))
 
 
-
-
+lemma better_retrieve2:
+  shows "retrieve (ANTIMES bs r (n + 1)) (Stars (v#vs)) = 
+     bs @ [Z] @ retrieve r v @ retrieve (ANTIMES [] r n) (Stars vs)"
+by auto
 (*>*)
 
 section {* Introduction *}
@@ -91,27 +95,24 @@
 
 In the last fifteen or so years, Brzozowski's derivatives of regular
 expressions have sparked quite a bit of interest in the functional
-programming and theorem prover communities.  The beauty of
-Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly
-expressible in any functional language, and easily definable and
-reasoned about in theorem provers---the definitions just consist of
-inductive datatypes and simple recursive functions. Another neat
-feature of derivatives is that they can be easily extended to bounded
-regular expressions, such as @{term "NTIMES r n"}, where numbers or
-intervals specify how many times a regular expression should be used
-during matching.
-
+programming and theorem prover communities.
 Derivatives of a
 regular expression, written @{term "der c r"}, give a simple solution
 to the problem of matching a string @{term s} with a regular
 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
-then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
-of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
-Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
-of the work by Krauss and Nipkow~\cite{Krauss2011}.  And another one
-in Coq is given by Coquand and Siles \cite{Coquand2012}.
-Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}.
+then @{term r} matches @{term s} (and {\em vice versa}).  
+The beauty of
+Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly
+expressible in any functional language, and easily definable and
+reasoned about in theorem provers---the definitions just consist of
+inductive datatypes and simple recursive functions. Another attractive
+feature of derivatives is that they can be easily extended to \emph{bounded}
+regular expressions, such as @{term "r"}$^{\{@{text n}\}}$ or @{term r}$^{\{..@{text n}\}}$, where numbers or
+intervals of numbers specify how many times a regular expression should be used
+during matching.
+
+
 
 
 However, there are two difficulties with derivative-based matchers:
@@ -159,11 +160,11 @@
 earlier work to preserve the correctness of Sulzmann and Lu's
 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
 \emph{not} help with limiting the growth of the derivatives shown
-above: the growth is slowed, but the derivatives can still grow rather
+above: the growth is slowed, but some derivatives can still grow rather
 quickly beyond any finite bound.
 
 
-Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
+Sulzmann and Lu address this ``growth problem'' in a second algorithm
 \cite{Sulzmann2014} where they introduce bitcoded
 regular expressions. In this version, POSIX values are
 represented as bitsequences and such sequences are incrementally generated
@@ -189,7 +190,7 @@
 the derivative-based lexing algorithm of Sulzmann and Lu
 \cite{Sulzmann2014} where regular expressions are annotated with
 bitsequences. We define the crucial simplification function as a
-recursive function, without the need of a fix-point operation. One objective of
+recursive function, without the need of a fixpoint operation. One objective of
 the simplification function is to remove duplicates of regular
 expressions.  For this Sulzmann and Lu use in their paper the standard
 @{text nub} function from Haskell's list library, but this function
@@ -199,7 +200,7 @@
 Inspired by Scala's library for lists, we shall instead use a @{text
 distinctWith} function that finds duplicates under an ``erasing'' function
 which deletes bitcodes before comparing regular expressions.
-We shall also introduce our own argument and definitions for
+We shall also introduce our \emph{own} argument and definitions for
 establishing the correctness of the bitcoded algorithm when 
 simplifications are included. Finally we
 establish that the size of derivatives can be finitely bounded.\medskip%\footnote{ In this paper, we shall first briefly introduce the basic notions
@@ -228,7 +229,7 @@
   concatenation is $\_ \,@\, \_\,$. We often use the usual
   bracket notation for lists also for strings; for example a string
   consisting of just a single character $c$ is written $[c]$.   
-  Our regular expressions are defined as usual as the following inductive
+  Our regular expressions are defined as the following inductive
   datatype:\\[-4mm]
   %
   \begin{center}
@@ -242,20 +243,55 @@
   @{term "NTIMES r n"}
   \end{center}
 
-  \noindent where @{const ZERO} stands for the regular expression that does
-  not match any string, @{const ONE} for the regular expression that matches
-  only the empty string and @{term c} for matching a character literal.
-  The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively.
-  We sometimes omit the $\cdot$ in a sequence regular expression for brevity. 
-  In our work here we also add to the usual ``basic'' regular expressions 
-  the bounded regular expression @{term "NTIMES r n"} where the @{term n}
-  specifies that @{term r} should match exactly @{term n}-times. For
-  brevity we omit the other bounded regular expressions
-  @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$ and @{text "r"}$^{\{n..m\}}$
-  which specify an interval for how many times @{text r} should match. Our
-  results extend straightforwardly also to them. The
-  \emph{language} of a regular expression, written $L(r)$, is defined as usual
-  and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}).
+  \noindent where @{const ZERO} stands for the regular expression that
+  does not match any string, @{const ONE} for the regular expression
+  that matches only the empty string and @{term c} for matching a
+  character literal.  The constructors $+$ and $\cdot$ represent
+  alternatives and sequences, respectively.  We sometimes omit the
+  $\cdot$ in a sequence regular expression for brevity.  The
+  \emph{language} of a regular expression, written $L(r)$, is defined
+  as usual and we omit giving the definition here (see for example
+  \cite{AusafDyckhoffUrban2016}).
+  
+  In our work here we also add to the usual ``basic'' regular
+  expressions the \emph{bounded} regular expression @{term "NTIMES r
+  n"} where the @{term n} specifies that @{term r} should match
+  exactly @{term n}-times. Again for brevity we omit the other bounded
+  regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$
+  and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many
+  times @{text r} should match. The results presented in this paper
+  extend straightforwardly to them too. The importance of the bounded
+  regular expressions is that they are often used in practical
+  applications, such as Snort (a system for detecting network
+  intrusion) and also in XML Schema definitions. According to Bj\"{o}rklund et
+  al~\cite{BjorklundMartensTimm2015}, bounded regular expressions 
+  occur frequently in the latter and can have counters of up to
+  ten million.  The problem is that tools based on the classic notion
+  of automata need to expand @{text "r"}$^{\{n\}}$ into @{text n}
+  connected copies of the automaton for @{text r}. This leads to very
+  inefficient matching algorithms or algorithms that consume large
+  amounts of memory.  A classic example is the regular expression
+  \mbox{@{term "SEQ (SEQ (STAR (ALT a b)) a) (NTIMES (ALT a b) n)"}}
+  where the minimal DFA requires at least $2^{n + 1}$ states (see
+  \cite{CountingSet2020}). Therefore regular expression matching
+  libraries that rely on the classic notion of DFAs often impose 
+  adhoc limits for bounded regular expressions: For example in the
+  regular expression matching library in the Go language the regular expression
+  @{term "NTIMES a 1001"} is not permitted, because no counter can be
+  above 1000, and in the built-in Rust regular expression library
+  expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error
+  message for being too big.  These problems can of course be solved in matching
+  algorithms where automata go beyond the classic notion and for
+  instance include explicit counters (see~\cite{CountingSet2020}).
+  The point here is that Brzozowski derivatives and the algorithms by
+  Sulzmann and Lu can be straightforwardly extended to deal with
+  bounded regular expressions and moreover the resulting code
+  still consists of only simple recursive functions and inductive
+  datatypes. Finally, bounded regular expressions 
+  do not destroy our finite boundedness property, which we shall
+  prove later on.%, because during the lexing process counters will only be
+  %decremented.
+  
 
   Central to Brzozowski's regular expression matcher are two functions
   called @{text nullable} and \emph{derivative}. The latter is written
@@ -264,7 +300,7 @@
   regular expressions.  
 %
 \begin{center}
-\begin{tabular}{cc}
+\begin{tabular}{c @ {\hspace{-1mm}}c}
   \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
   @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
   @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
@@ -274,7 +310,8 @@
   & & @{text "then"} @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c r\<^sub>2)"}\\
   & & @{text "else"} @{term "SEQ (der c r\<^sub>1) r\<^sub>2"}\\
   % & & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
-  @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}
+  @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\\
+  @{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)}
   \end{tabular}
   &
   \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
@@ -283,7 +320,10 @@
   @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
   @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
   @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
-  @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}%\medskip\\
+  @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\
+  @{thm (lhs) nullable.simps(7)} & $\dn$ & @{text "if"} @{text "n = 0"}\\
+  & & @{text "then"} @{const "True"}\\
+  & & @{text "else"} @{term "nullable r"}
   \end{tabular}  
 \end{tabular}  
 \end{center}
@@ -307,6 +347,12 @@
 
 \noindent
 It is a fun exercise to formally prove this property in a theorem prover.
+We are aware
+of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
+Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
+of the work by Krauss and Nipkow~\cite{Krauss2011}.  And another one
+in Coq is given by Coquand and Siles \cite{Coquand2012}.
+Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}.
 
 The novel idea of Sulzmann and Lu is to extend this algorithm for 
 lexing, where it is important to find out which part of the string
@@ -334,26 +380,38 @@
 
 
   Sulzmann and Lu also define inductively an
-  inhabitation relation that associates values to regular expressions. This
-  is defined by the following six rules for the values:
+  inhabitation relation that associates values to regular expressions. Our
+  version of this relation is defined the following six rules for the values:
   %
   \begin{center}
-  \begin{tabular}{@ {\hspace{-12mm}}c@ {}}
   \begin{tabular}{@ {}c@ {}}
-  \\[-8mm]
-  @{thm[mode=Axiom] Prf.intros(4)}\\
-  @{thm[mode=Axiom] Prf.intros(5)[of "c"]}
-  \end{tabular}
-  \quad
-  @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad
-  @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\quad
-  @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\quad
-  @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\mbox{}\hspace{-10mm}\mbox{}
+  @{thm[mode=Axiom] Prf.intros(4)}\qquad
+  @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad
+  @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\qquad
+  @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\qquad
+  @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
+  @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad
+  $\mprset{flushleft}\inferrule{
+  @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
+  @{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\quad
+  @{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}
+  }
+  {@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}}
+  $
   \end{tabular}
   \end{center}
 
   \noindent Note that no values are associated with the regular expression
-  @{term ZERO}, since it cannot match any string.
+  @{term ZERO}, since it cannot match any string. Interesting is our version
+  of the rule for @{term "STAR r"} where we require that each value
+  in  @{term vs} flattens to a non-empty string. This means if @{term "STAR r"} ``fires''
+  one or more times, then each copy needs to match a non-empty string.
+  Similarly, in the rule
+  for @{term "NTIMES r n"} we require that the length of the list
+  @{term "vs\<^sub>1 @ vs\<^sub>2"} equals @{term n} (meaning the regular expression
+  @{text r} matches @{text n}-times) and that the first segment of this list
+  contains values that flatten to non-empty strings followed by a segment that
+  only contains values that flatten to the empty string. 
   It is routine to establish how values ``inhabiting'' a regular
   expression correspond to the language of a regular expression, namely
   @{thm L_flat_Prf}.
@@ -378,7 +436,7 @@
 
 \begin{figure}[t]
   \begin{center}\small%
-  \begin{tabular}{@ {}c@ {}}
+  \begin{tabular}{@ {\hspace{-2mm}}c@ {}}
   \\[-9mm]
   @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
   @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
@@ -397,13 +455,23 @@
     @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
     @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
     @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
-   {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\\[-4mm]
+   {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\smallskip\\
+  \mprset{sep=4mm} 
+  @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad 
+  $\mprset{flushleft}
+   \inferrule
+   {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
+    @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
+    @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\
+    @{thm (prem 4) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]}}
+   {@{thm (concl) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]}}$\<open>Pn+\<close>
+  \\[-4mm]   
   \end{tabular}
   \end{center}
   \caption{The inductive definition of POSIX values taken from our earlier paper \cite{AusafDyckhoffUrban2016}. The ternary relation, written $(s, r) \rightarrow v$, formalises the notion
   of given a string $s$ and a regular
   expression $r$ what is the unique value $v$ that satisfies the informal POSIX constraints for
-  regular expression matching.}\label{POSIXrules}
+  regular expression matching.\smallskip}\label{POSIXrules}
   \end{figure}
 
   The clever idea by Sulzmann and Lu \cite{Sulzmann2014} in their first algorithm is to define
@@ -419,7 +487,8 @@
   @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
   @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
   @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
-  \end{tabular}\smallskip\\
+  @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
+  \end{tabular}\medskip\\
 
   \begin{tabular}{@ {}cc@ {}}
   \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
@@ -429,16 +498,18 @@
   @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
       @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
   @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ 
-      & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}
+      & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
+  @{thm (lhs) injval.simps(8)[of "r" "n" "c" "v" "vs"]} & $\dn$ 
+      & @{thm (rhs) injval.simps(8)[of "r" "n" "c" "v" "vs"]}
   \end{tabular}
   &
-  \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
+  \begin{tabular}{@ {\hspace{-3mm}}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
   @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
       & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
   @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
       & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
   @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$\\ 
-  \multicolumn{3}{@ {}r@ {}}{@{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}}
+  \multicolumn{3}{@ {}r@ {}}{@{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}}\\ \mbox{}\\ \mbox{}
   \end{tabular}
   \end{tabular}
   \end{tabular}\smallskip
@@ -447,7 +518,10 @@
   \noindent
   The function @{text mkeps} is run when the last derivative is nullable, that is
   the string to be matched is in the language of the regular expression. It generates
-  a value for how the last derivative can match the empty string. The injection function
+  a value for how the last derivative can match the empty string. In case
+  of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate
+  a list of exactly @{term n} copies, which is the length of the list we expect in this
+  case.  The injection function
   then calculates the corresponding value for each intermediate derivative until
   a value for the original regular expression is generated.
   Graphically the algorithm by
@@ -586,6 +660,7 @@
                $\;\mid\;$ @{term "AALTs bs rs"}
                $\;\mid\;$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}
                $\;\mid\;$ @{term "ASTAR bs r"}
+	       $\;\mid\;$ @{term "ANTIMES bs r n"}
   \end{center}
 
   \noindent where @{text bs} stands for bitsequences; @{text r},
@@ -654,12 +729,17 @@
         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$
         \hspace{2mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
-  $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
-  $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & 
+  $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+  $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & 
          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$
-        \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\medskip\\
-  \multicolumn{3}{l}{$\textit{decode}\,bs\,r$ $\dn$ 
+        \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\\
+  $\textit{decode}'\,(\S\!::\!bs)\,(r^{\{n\}})$ & $\dn$ & $(\Stars\,[], bs)$\\
+  $\textit{decode}'\,(\Z\!::\!bs)\,(r^{\{n\}})$ & $\dn$ & 
+         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
+  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^{\{n - 1\}}$
+        \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\medskip\\	
+  \multicolumn{3}{@ {}l}{$\textit{decode}\,bs\,r$ $\dn$ 
      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$
      $\;\,\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
        \textit{else}\;\textit{None}$}\\[-4mm]   
@@ -704,7 +784,9 @@
   $\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,r_1\,r_2)$ & $\dn$ &
      $\textit{SEQ}\,(bs\,@\,bs')\,r_1\,r_2$\\
   $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,r)$ & $\dn$ &
-     $\textit{STAR}\,(bs\,@\,bs')\,r$
+     $\textit{STAR}\,(bs\,@\,bs')\,r$\\
+  $\textit{fuse}\,bs\,(\textit{NT}\,bs'\,r\,n)$ & $\dn$ &
+     $\textit{NT}\,(bs\,@\,bs')\,r\,n$   
   \end{tabular}
   \end{tabular}
   \end{center}    
@@ -722,7 +804,9 @@
   &
   \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
   $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
-  $(r^*)^\uparrow$ & $\dn$ & $\textit{STAR}\;[]\,r^\uparrow$
+  $(r^*)^\uparrow$ & $\dn$ & $\textit{STAR}\;[]\,r^\uparrow$\\
+  $(r^{\{n\}})^\uparrow$ & $\dn$ &
+        $\textit{NT}\;[]\,r^\uparrow\,n$
   \end{tabular}
   &
   \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
@@ -745,8 +829,8 @@
   bitcoded regular expressions.
   %
   \begin{center}
-  \begin{tabular}{@ {}c@ {\hspace{2mm}}c@ {}}
-  \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+  \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{0mm}}c@ {}}
+  \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l}
   $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\
   $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\
   $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\
@@ -755,21 +839,27 @@
   $\textit{bnullable}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ &
      $\textit{bnullable}\,r_1\wedge \textit{bnullable}\,r_2$\\
   $\textit{bnullable}\,(\textit{STAR}\,bs\,r)$ & $\dn$ &
-     $\textit{True}$
+     $\textit{True}$\\
+  $\textit{bnullable}\,(\textit{NT}\,bs\,r\,n)$ & $\dn$ &\\
+  \multicolumn{3}{l}{$\textit{if}\;n = 0\;\textit{then}\;\textit{True}\;
+  \textit{else}\;\textit{bnullable}\,r$}\\
   \end{tabular}
   &
-  \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
+  \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l@ {}}
   $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
   $\textit{bmkeps}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &
   $bs\,@\,\textit{bmkepss}\,\rs$\\
   $\textit{bmkeps}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ &\\
-  \multicolumn{3}{r}{$bs \,@\,\textit{bmkeps}\,r_1\,@\, \textit{bmkeps}\,r_2$}\\
+  \multicolumn{3}{l}{$bs \,@\,\textit{bmkeps}\,r_1\,@\, \textit{bmkeps}\,r_2$}\\
   $\textit{bmkeps}\,(\textit{STAR}\,bs\,r)$ & $\dn$ &
      $bs \,@\, [\S]$\\
-  $\textit{bmkepss}\,(r\!::\!\rs)$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,r$\\
-  & &$\textit{then}\;\textit{bmkeps}\,r$\\
-  & &$\textit{else}\;\textit{bmkepss}\,\rs$
+   $\textit{bmkeps}\,(\textit{NT}\,bs\,r\,n)$ & $\dn$ &\\
+   \multicolumn{3}{l@ {}}{$\textit{if}\;n=0\;\textit{then}\;bs \,@\, [\S]$}\\
+   \multicolumn{3}{l@ {}}{$\textit{else}\,bs \,@\, [\Z] \,@\, \textit{bmkeps}\,r\,@\,
+        \textit{bmkeps}\,(@{term "ANTIMES [] r (n - 1)"})$}\\   
+  $\textit{bmkepss}\,(r\!::\!\rs)$ & $\dn$ &\\
+  \multicolumn{3}{l}{$\textit{if}\;\textit{bnullable}\,r\,\textit{then}\;\textit{bmkeps}\,r\;
+  \textit{else}\;\textit{bmkepss}\,\rs$}
   \end{tabular}
   \end{tabular}
   \end{center}    
@@ -783,8 +873,8 @@
   % 
   \begin{center}
   \begin{tabular}{@ {}lcl@ {}}
-  \multicolumn{3}{@ {}l}{$(\textit{ZERO})\backslash c$ $\;\dn\;$ $\textit{ZERO}$ \quad\qquad  
-  $(\textit{ONE}\;bs)\backslash c$ $\;\dn\;$ $\textit{ZERO}$}\\  
+  $(\textit{ZERO})\backslash c$ & $\;\dn\;$ & $\textit{ZERO}$\\ 
+  $(\textit{ONE}\;bs)\backslash c$ & $\;\dn\;$ & $\textit{ZERO}$\\  
   $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
         $\textit{if}\;c=d\; \;\textit{then}\;
          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
@@ -796,8 +886,12 @@
   $\;(\textit{fuse}\,(\textit{bmkeps}\,r_1)\,(r_2\backslash c))$\\
   & &$\textit{else}\;\textit{SEQ}\,bs\,(r_1\backslash c)\,r_2$\\
   $(\textit{STAR}\,bs\,r)\backslash c$ & $\dn$ &
-      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\,
-       (\textit{STAR}\,[]\,r)$
+      $\textit{SEQ}\;(bs\,@\,[\Z])\,(r\backslash c)\,
+       (\textit{STAR}\,[]\,r)$\\
+  $(\textit{NT}\,bs\,r\,n)\backslash c$ & $\dn$ &
+      $\textit{if}\;n = 0\; \textit{then}\;\textit{ZERO}\;\textit{else}\;
+      \textit{SEQ}\;(bs\,@\,[\Z])\,(r\backslash c)\,
+       (\textit{NT}\,[]\,r\,(n - 1))$     
   \end{tabular}    
   \end{center}
 
@@ -838,7 +932,9 @@
   @{thm (lhs) retrieve.simps(6)[of _ "r\<^sub>1" "r\<^sub>2" "v\<^sub>1" "v\<^sub>2"]}
       & $\dn$ & @{thm (rhs) retrieve.simps(6)[of _ "r\<^sub>1" "r\<^sub>2" "v\<^sub>1" "v\<^sub>2"]}\\
   @{thm (lhs) retrieve.simps(7)} & $\dn$ & @{thm (rhs) retrieve.simps(7)}\\
-  @{thm (lhs) retrieve.simps(8)} & $\dn$ & @{thm (rhs) retrieve.simps(8)}
+  @{thm (lhs) retrieve.simps(8)} & $\dn$ & @{thm (rhs) retrieve.simps(8)}\\
+  @{thm (lhs) retrieve.simps(9)} & $\dn$ & @{thm (rhs) retrieve.simps(9)}\\
+  @{thm (lhs) better_retrieve2} & $\dn$ & @{thm (rhs) better_retrieve2}
   \end{tabular}
   \end{center}
 
@@ -957,7 +1053,7 @@
 
 text {*
 
-     Derivatives as calculated by Brzozowski’s method are usually more
+     Derivatives as calculated by Brzozowski's method are usually more
      complex regular expressions than the initial one; the result is
      that derivative-based matching and lexing algorithms are
      often abysmally slow if the ``growth problem'' is not addressed. As Sulzmann and Lu wrote, various
@@ -1006,7 +1102,7 @@
      \[
      @{term "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) # rs\<^sub>1)"}
      \quad\xrightarrow{bsimp}\quad
-     @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) # rs\<^sub>1)"}
+     @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
      \]
 
      \noindent
@@ -1073,6 +1169,7 @@
      @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ &
      @{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}\\
      @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\
+     @{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\
      \end{tabular}
      \end{center}
 
@@ -1089,7 +1186,7 @@
      \begin{center}
      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
      \multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad
-     @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\\
+     @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\smallskip\\
      @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
      \end{tabular}
      \end{center}
@@ -1135,7 +1232,7 @@
      \end{center}
 
      \noindent
-     We believe our recursive function @{term bsimp} simplifies regular
+     We believe our recursive function @{term bsimp} simplifies bitcoded regular
      expressions as intended by Sulzmann and Lu. There is no point in applying the
      @{text bsimp} function repeatedly (like the simplification in their paper which needs to be
      applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent,
@@ -1149,7 +1246,9 @@
      This can be proved by induction on @{text r} but requires a detailed analysis
      that the de-nesting of alternatives always results in a flat list of regular
      expressions. We omit the details since it does not concern the correctness proof.
-     
+     %It might be interesting to not that we do not simplify inside @{term STAR} and
+     %@{text NT}: the reason is that we want to keep the
+
      Next we can include simplification after each derivative step leading to the
      following notion of bitcoded derivatives:
      
@@ -1228,7 +1327,7 @@
      
 
      \noindent
-     We can show that this rewrite system preserves @{term bnullable}, that 
+     We can also show that this rewrite system preserves @{term bnullable}, that 
      is simplification does not affect nullability:
 
      \begin{lemma}\label{lembnull}
@@ -1309,7 +1408,7 @@
   @{thm[mode=Axiom] bs6}$A0$\quad
   @{thm[mode=Axiom] bs7}$A1$\quad
   @{thm[mode=Rule] bs10[of "rs\<^sub>1" "rs\<^sub>2"]}$AL$\\
-  @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LT$\quad
+  @{thm[mode=Rule] rrewrite_srewrite.ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LT$\quad
   @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}$LH$\quad
   @{thm[mode=Axiom] ss4}$L\ZERO$\quad
   @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}$LS$\medskip\\
@@ -1342,7 +1441,8 @@
 \end{center}
 
 \noindent
-We prove this by induction on $r$. The base cases for @{term AZERO},
+Note that the bound $N$ is a bound for \emph{all} strings, no matter how long they are.
+We establish this bound by induction on $r$. The base cases for @{term AZERO},
 @{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is
 for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction
 hypotheses state $\exists N_1. \forall s. \; \llbracket{}@{term "bders_simp r\<^sub>1 s"}\rrbracket \leq N_1$ and
@@ -1385,8 +1485,27 @@
 we need to introduce simplifications, such as
  $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
 which exist for partial derivatives. However, if we introduce them in our
-setting we would lose the POSIX property of our calculated values. We leave better
-bounds for future work.\\[-6.5mm]
+setting we would lose the POSIX property of our calculated values. For example
+given the regular expressions @{term "SEQ (ALT a ab) (ALT b ONE)"} and the string $[a, b]$, then our
+algorithm generates the following correct POSIX value
+%
+\[
+@{term "Seq (Right (Seq (Char a) (Char b))) (Right Empty)"}
+\]
+
+\noindent
+Essentially it matches the string with the longer @{text "Right"}-alternative in the
+first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence). 
+If we add the simplification above, then we obtain the following value
+%
+\[
+@{term "Seq (Left (Char a)) (Left (Char b))"}
+\]
+
+\noindent
+where the @{text "Left"}-alternatives get priority. However, this violates
+the POSIX rules and we have not been able to
+reconcile this problem. Therefore we leave better bounds for future work.\\[-6.5mm]
 
 *}
 
@@ -1423,52 +1542,112 @@
    %point-of-view it is really important to have the formal proofs of
    %the corresponding properties at hand.
 
-   We have also developed a
-   healthy suspicion when experimental data is used to back up
+   We can of course only make a claim about the correctness and the sizes of the
+   derivatives, not about the efficiency or runtime of our version of
+   Sulzman and Lu's algorithm. But we found the size is an important
+   first indicator about efficiency: clearly if the derivatives can
+   grow to arbitrarily big sizes and the algorithm needs to traverse
+   the derivatives possibly several times, then the algorithm will be
+   slow---excruciatingly slow that is. Other works seems to make
+   stronger claims, but during our work we have developed a healthy
+   suspicion when for example experimental data is used to back up
    efficiency claims. For example Sulzmann and Lu write about their
-   equivalent of @{term blexer_simp} \textit{``...we can incrementally compute
-   bitcoded parse trees in linear time in the size of the input''}
-   \cite[Page 14]{Sulzmann2014}. 
-   Given the growth of the
-   derivatives in some cases even after aggressive simplification, this
-   is a hard to believe claim. A similar claim about a theoretical runtime
-   of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates
-   tokens according to POSIX rules~\cite{verbatim}. For this Verbatim uses Brzozowski's
-   derivatives like in our work. 
-   The authors write: \textit{``The results of our empirical tests [..] confirm that Verbatim has
-   @{text "O(n\<^sup>2)"} time complexity.''} \cite[Section~VII]{verbatim}.
-   While their correctness proof for Verbatim is formalised in Coq, the claim about
-   the runtime complexity is only supported by some emperical evidence obtained
-   by using the code extraction facilities of Coq.
-   Given our observation with the ``growth problem'' of derivatives,
-   we
-   tried out their extracted OCaml code with the example
-   \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a
-   string of 40 $a$'s and that increased to approximately 19 minutes when the
-   string is 50 $a$'s long. Taking into account that derivatives are not simplified in the Verbatim
-   lexer, such numbers are not surprising. 
-   Clearly our result of having finite
-   derivatives might sound rather weak in this context but we think such effeciency claims
-   really require further scrutiny.
+   equivalent of @{term blexer_simp} \textit{``...we can incrementally
+   compute bitcoded parse trees in linear time in the size of the
+   input''} \cite[Page 14]{Sulzmann2014}.  Given the growth of the
+   derivatives in some cases even after aggressive simplification,
+   this is a hard to believe claim. A similar claim about a
+   theoretical runtime of @{text "O(n\<^sup>2)"} is made for the
+   Verbatim lexer, which calculates tokens according to POSIX
+   rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's
+   derivatives like in our work.  The authors write: \textit{``The
+   results of our empirical tests [..] confirm that Verbatim has
+   @{text "O(n\<^sup>2)"} time complexity.''}
+   \cite[Section~VII]{verbatim}.  While their correctness proof for
+   Verbatim is formalised in Coq, the claim about the runtime
+   complexity is only supported by some emperical evidence obtained by
+   using the code extraction facilities of Coq.  Given our observation
+   with the ``growth problem'' of derivatives, we tried out their
+   extracted OCaml code with the example \mbox{@{text "(a +
+   aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5
+   minutes to tokenise a string of 40 $a$'s and that increased to
+   approximately 19 minutes when the string is 50 $a$'s long. Taking
+   into account that derivatives are not simplified in the Verbatim
+   lexer, such numbers are not surprising.  Clearly our result of
+   having finite derivatives might sound rather weak in this context
+   but we think such effeciency claims really require further
+   scrutiny.
+
+   The contribution of this paper is to make sure derivatives do not
+   grow arbitrarily big (universially) In the example \mbox{@{text "(a
+   + aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or
+   less. The result is that lexing a string of, say, 50\,000 a's with
+   the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes
+   approximately 10 seconds with our Scala implementation of the
+   presented algorithm.
 
-   The contribution of this paper is to make sure
-   derivatives do not grow arbitrarily big (universially) In the example \mbox{@{text "(a + aa)\<^sup>*"}},
-   \emph{all} derivatives have a size of 17 or less. The result is that
-   lexing a string of, say, 50\,000 a's with the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes approximately
-   10 seconds with our Scala implementation
-   of the presented algorithm. Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
+   Finally, let us come back to the point about bounded regular
+   expressions. We have in this paper only shown that @{term "NTIMES r
+   n"} can be included, but all our results extend straightforwardly
+   also to the other bounded regular expressions. We find bounded
+   regular expressions fit naturally into the setting of Brzozowski
+   derivatives and the bitcoded regular expressions by Sulzmann and Lu.
+   In contrast bounded regular expressions are often the Achilles'
+   heel in regular expression matchers that use the traditional
+   automata-based approach to lexing, primarily because they need to expand the
+   counters of bounded regular expressions into $n$-connected copies
+   of an automaton. This is not needed in Sulzmann and Lu's algorithm.
+   To see the difference consider for example the regular expression @{term "SEQ (NTIMES a
+   1001) (STAR a)"}, which is not permitted in the Go language because
+   the counter is too big. In contrast we have no problem with
+   matching this regular expression with, say 50\,000 a's, because the
+   counters can be kept compact. In fact, the overall size of the
+   derivatives is never greater than 5 in this example. Even in the
+   example from Section 2, where Rust raises an error message, namely
+   \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for
+   our derivatives is a moderate 14.
+
+   Let us also compare our work to the verified Verbatim++ lexer where
+   the authors of the Verbatim lexer introduced a number of
+   improvements and optimisations, for example memoisation
+   \cite{verbatimpp}. However, unlike Verbatim, which works with
+   derivatives like in our work, Verbatim++ compiles first a regular
+   expression into a DFA. While this makes lexing fast in many cases,
+   with examples of bounded regular expressions like
+   \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}}
+   one needs to represent them as
+   sequences of $a \cdot a \cdot \ldots \cdot a$ (500 a's in sequence). We have run
+   their extracted code with such a regular expression as a
+   single lexing rule and a string of 50\,000 a's---lexing in this case
+   takes approximately 5~minutes. We are not aware of any better
+   translation using the traditional notion of DFAs. Therefore we
+   prefer to stick with calculating derivatives, but attempt to make
+   this calculation (in the future) as fast as possible. What we can guaranty
+   with the presented work is that the maximum size of the derivatives
+   for this example is not bigger than 9. This means our Scala
+   implementation only needs a few seconds for this example.
+   %
+   %
+   %Possible ideas are
+   %zippers which have been used in the context of parsing of
+   %context-free grammars \cite{zipperparser}.
+   \medskip
+
+   \noindent
+   Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
    %\\[-10mm]
 
 
 %%\bibliographystyle{plain}
 \bibliography{root}
 
+\newpage
 \appendix
 
 \section{Some Proofs}
 
 While we have formalised \emph{all} results in Isabelle/HOL, below we shall give some
-rough details of our reasoning in ``English''.
+rough details of our reasoning in ``English'' if this is helpful for reviewing.
 
 \begin{proof}[Proof of Lemma~\ref{codedecode}]
   This follows from the property that

--- a/thys3/document/root.bib	Sat Jul 16 18:34:46 2022 +0100
+++ b/thys3/document/root.bib	Sun Jul 17 13:07:05 2022 +0100
@@ -8,6 +8,27 @@
   pages = {46--90}
 }
 
+@InProceedings{BjorklundMartensTimm2015,
+  author = {H.~Bj\"{o}rklund and W.~Martens and T.~Timm},
+  title = {{E}fficient {I}ncremental {E}valuation of {S}uccinct
+           {R}egular {E}xpressions},
+  booktitle = {Proc.~of the 24th ACM Conf.~on Information and Knowledge Management (CIKM)},	   
+  year = {2015},
+  pages = {1541--1550}
+}
+
+@Article{CountingSet2020,
+author = {L.~Turo\v{n}ov\'{a} and L.~Hol\'{\i}k
+          and O.~Leng\'{a}l and O.~Saarikivi
+	  and M.~Veanes and T.~Vojnar},
+title = {{R}egex {M}atching with {C}ounting-{S}et {A}utomata},
+year = {2020},
+journal = {Proceedings of the ACM on Programming Languages (PACMPL)},
+publisher = {ACM},
+volume = {4},
+pages     = {218:1--218:30}
+}
+
 
 @Misc{POSIX,
   title =     {{T}he {O}pen {G}roup {B}ase {S}pecification {I}ssue 6 {IEEE} {S}td 1003.1 2004 {E}dition},
@@ -359,3 +380,26 @@
   volume={},
   number={},
   pages={92--100}}
+
+
+@inproceedings{verbatimpp,
+  author    = {D.~Egolf and S.~Lasser and K.~Fisher},
+  title     = {{V}erbatim++: {V}erified, {O}ptimized, and
+               {S}emantically {R}ich {L}exing with
+               {D}derivatives},
+  booktitle = {Proc.~of the 11th {ACM} {SIGPLAN} Conference on Certified
+               Programs and Proofs (CPP)},
+  pages     = {27--39},
+  publisher = {{ACM}},
+  year      = {2022}
+}
+
+@article{zipperparser,
+ author = {P.~Darragh and M.~D.~Adams},
+ title = {{P}arsing with {Z}ippers ({F}unctional {P}earl)},
+ year = {2020},
+ issue_date = {August 2020},
+ publisher = {Association for Computing Machinery},
+ volume = {4},
+ journal = {Proc. ACM Program. Lang.},
+ number = {ICFP} }
\ No newline at end of file

--- a/thys3/document/root.tex	Sat Jul 16 18:34:46 2022 +0100
+++ b/thys3/document/root.tex	Sun Jul 17 13:07:05 2022 +0100
@@ -18,8 +18,6 @@
 %\setmainfont[Ligatures=TeX]{Palatino Linotype}
 
 
-
-
 \urlstyle{rm}
 \isabellestyle{it}
 \renewcommand{\isastyleminor}{\it}%

--- a/thys3/src/Blexer.thy	Sat Jul 16 18:34:46 2022 +0100
+++ b/thys3/src/Blexer.thy	Sun Jul 17 13:07:05 2022 +0100
@@ -50,7 +50,6 @@
                                     in (Stars_add v vs, bs''))"
 | "decode' [] (NTIMES r n) = (Void, [])"
 | "decode' (S # bs) (NTIMES r n) = (Stars [], bs)"
-(*| "decode' (Z # bs) (NTIMES r 0) = (undefined, bs)"*)
 | "decode' (Z # bs) (NTIMES r n) = (let (v, bs') = decode' bs r in
                                     let (vs, bs'') = decode' bs' (NTIMES r (n - 1)) 
                                     in (Stars_add v vs, bs''))"
@@ -215,8 +214,8 @@
 | "bmkeps(AALTs bs (r#rs)) = 
     (if bnullable(r) then (bs @ bmkeps r) else (bmkeps (AALTs bs rs)))"
 | "bmkeps(ASTAR bs r) = bs @ [S]"
-| "bmkeps(ANTIMES bs r 0) = bs @ [S]"
-| "bmkeps(ANTIMES bs r (Suc n)) = bs @ [Z] @ (bmkeps r) @ bmkeps(ANTIMES [] r n)"
+| "bmkeps(ANTIMES bs r n) = 
+    (if n = 0 then bs @ [S] else bs @ [Z] @ (bmkeps r) @ bmkeps(ANTIMES [] r (n - 1)))"
 apply(pat_completeness)
 apply(auto)
 done
@@ -428,7 +427,8 @@
         apply(auto)
   apply (simp add: retrieve_AALTs_bnullable1)
   using retrieve_AALTs_bnullable1 apply force
-  by (metis retrieve_AALTs_bnullable2)  
+  apply(metis retrieve_AALTs_bnullable2)
+  by (metis Cons_eq_appendI One_nat_def Suc_diff_1 eq_Nil_appendI replicate_Suc retrieve.simps(10))  
   
 
 lemma bder_retrieve:

--- a/thys3/src/BlexerSimp.thy	Sat Jul 16 18:34:46 2022 +0100
+++ b/thys3/src/BlexerSimp.thy	Sun Jul 17 13:07:05 2022 +0100
@@ -106,8 +106,6 @@
   apply(induct r rule: bnullable.induct) 
         apply(auto)
   apply (metis append.assoc bmkeps_retrieve bmkeps_retrieve_AALTs bnullable.simps(4))  
-   apply(case_tac n)
-  apply(auto)
   done
 
 lemma bmkepss_fuse:

--- a/thys3/src/Lexer.thy	Sat Jul 16 18:34:46 2022 +0100
+++ b/thys3/src/Lexer.thy	Sun Jul 17 13:07:05 2022 +0100
@@ -258,7 +258,6 @@
         "v = Seq v1 (Stars vs)" "s = s1 @ s2" 
         "s1 \<in> der c r \<rightarrow> v1" "s2 \<in> (NTIMES r (n - 1)) \<rightarrow> (Stars vs)" "0 < n"
         "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s2 \<and> s1 @ s\<^sub>3 \<in> L (der c r) \<and> s\<^sub>4 \<in> L (NTIMES r (n - 1)))" 
-    
     apply(auto elim: Posix_elims simp add: der_correctness Der_def intro: Posix.intros split: if_splits)
     apply(erule Posix_elims)
     apply(simp)
@@ -287,14 +286,8 @@
           then have "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s2 \<and> (c # s1) @ s\<^sub>3 \<in> L r \<and> s\<^sub>4 \<in> L (NTIMES r (n - 1)))"
             by (simp add: der_correctness Der_def)
         ultimately 
-        have "((c # s1) @ s2) \<in> NTIMES r n \<rightarrow> Stars (injval r c v1 # vs)" 
-           apply (rule_tac Posix.intros)
-               apply(simp_all)
-              apply(case_tac n)
-            apply(simp)
-           using Posix_elims(1) NTIMES.prems apply auto[1]
-             apply(simp)
-             done
+        have "((c # s1) @ s2) \<in> NTIMES r n \<rightarrow> Stars (injval r c v1 # vs)"
+          by (metis One_nat_def Posix_NTIMES1 Suc_eq_plus1 Suc_pred cons(5))
         then show "(c # s) \<in> NTIMES r n \<rightarrow> injval (NTIMES r n) c v" using cons by(simp)
       qed

--- a/thys3/src/PosixSpec.thy	Sat Jul 16 18:34:46 2022 +0100
+++ b/thys3/src/PosixSpec.thy	Sun Jul 17 13:07:05 2022 +0100
@@ -437,9 +437,9 @@
     \<not>(\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s2 \<and> (s1 @ s\<^sub>3) \<in> L r \<and> s\<^sub>4 \<in> L (STAR r))\<rbrakk>
     \<Longrightarrow> (s1 @ s2) \<in> STAR r \<rightarrow> Stars (v # vs)"
 | Posix_STAR2: "[] \<in> STAR r \<rightarrow> Stars []"
-| Posix_NTIMES1: "\<lbrakk>s1 \<in> r \<rightarrow> v; s2 \<in> NTIMES r (n - 1) \<rightarrow> Stars vs; flat v \<noteq> []; 0 < n;
-    \<not>(\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s2 \<and> (s1 @ s\<^sub>3) \<in> L r \<and> s\<^sub>4 \<in> L (NTIMES r (n - 1)))\<rbrakk>
-    \<Longrightarrow> (s1 @ s2) \<in> NTIMES r n \<rightarrow> Stars (v # vs)"
+| Posix_NTIMES1: "\<lbrakk>s1 \<in> r \<rightarrow> v; s2 \<in> NTIMES r n \<rightarrow> Stars vs; flat v \<noteq> [];
+    \<not>(\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s2 \<and> (s1 @ s\<^sub>3) \<in> L r \<and> s\<^sub>4 \<in> L (NTIMES r n))\<rbrakk>
+    \<Longrightarrow> (s1 @ s2) \<in> NTIMES r (n + 1) \<rightarrow> Stars (v # vs)"
 | Posix_NTIMES2: "\<lbrakk>\<forall>v \<in> set vs. [] \<in> r \<rightarrow> v; length vs = n\<rbrakk>
     \<Longrightarrow> [] \<in> NTIMES r n \<rightarrow> Stars vs"  
 
@@ -458,7 +458,6 @@
 using assms
   apply(induct s r v rule: Posix.induct)
   apply(auto simp add: pow_empty_iff)
-  apply (metis Suc_pred concI lang_pow.simps(2))
   by (meson ex_in_conv set_empty)
 
 lemma Posix1a:
@@ -566,17 +565,17 @@
     by (meson in_set_zipE)
 next
   case (Posix_NTIMES1 s1 r v s2 n vs)
-  have "(s1 @ s2) \<in> NTIMES r n \<rightarrow> v2" 
-       "s1 \<in> r \<rightarrow> v" "s2 \<in> NTIMES r (n - 1) \<rightarrow> Stars vs" "flat v \<noteq> []"
-       "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s2 \<and> s1 @ s\<^sub>3 \<in> L r \<and> s\<^sub>4 \<in> L (NTIMES r (n - 1 )))" by fact+
-  then obtain v' vs' where "v2 = Stars (v' # vs')" "s1 \<in> r \<rightarrow> v'" "s2 \<in> (NTIMES r (n - 1)) \<rightarrow> (Stars vs')"
+  have "(s1 @ s2) \<in> NTIMES r (n + 1) \<rightarrow> v2" 
+       "s1 \<in> r \<rightarrow> v" "s2 \<in> NTIMES r n \<rightarrow> Stars vs" "flat v \<noteq> []"
+       "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s2 \<and> s1 @ s\<^sub>3 \<in> L r \<and> s\<^sub>4 \<in> L (NTIMES r n))" by fact+
+  then obtain v' vs' where "v2 = Stars (v' # vs')" "s1 \<in> r \<rightarrow> v'" "s2 \<in> (NTIMES r n) \<rightarrow> (Stars vs')"
   apply(cases) apply (auto simp add: append_eq_append_conv2)
     using Posix1(1) apply fastforce
-    apply (metis One_nat_def Posix1(1) Posix_NTIMES1.hyps(7) append.right_neutral append_self_conv2)
+    apply (metis L.simps(7) Posix1(1) append.left_neutral append.right_neutral)
   using Posix1(2) by blast
   moreover
   have IHs: "\<And>v2. s1 \<in> r \<rightarrow> v2 \<Longrightarrow> v = v2"
-            "\<And>v2. s2 \<in> NTIMES r (n - 1) \<rightarrow> v2 \<Longrightarrow> Stars vs = v2" by fact+
+            "\<And>v2. s2 \<in> NTIMES r n \<rightarrow> v2 \<Longrightarrow> Stars vs = v2" by fact+
   ultimately show "Stars (v # vs) = v2" by auto
 qed
 
@@ -590,8 +589,8 @@
   shows "v \<in> LV r s"
   using assms unfolding LV_def
   apply(induct rule: Posix.induct)
-   apply(auto simp add: intro!: Prf.intros elim!: Prf_elims Posix1a)
-   apply (smt (verit, best) One_nat_def Posix1a Posix_NTIMES1 L.simps(7))
+          apply(auto simp add: intro!: Prf.intros elim!: Prf_elims Posix1a)
+  apply (smt (verit, ccfv_SIG) L.simps(7) Posix1a Posix_NTIMES1 Suc_eq_plus1)
   using Posix1a Posix_NTIMES2 by blast

author	Christian Urban <christian.urban@kcl.ac.uk>
	Sun, 17 Jul 2022 13:07:05 +0100
changeset 569	5af61c89f51e
parent 568	7a579f5533f8
child 570	3ed514e2d93c

thys3/Paper.thy		file \| annotate \| diff \| comparison \| revisions
thys3/document/root.bib		file \| annotate \| diff \| comparison \| revisions
thys3/document/root.tex		file \| annotate \| diff \| comparison \| revisions
thys3/src/Blexer.thy		file \| annotate \| diff \| comparison \| revisions
thys3/src/BlexerSimp.thy		file \| annotate \| diff \| comparison \| revisions
thys3/src/Lexer.thy		file \| annotate \| diff \| comparison \| revisions
thys3/src/PosixSpec.thy		file \| annotate \| diff \| comparison \| revisions