lexing: comparison thys3/Paper.thy

equal deleted inserted replaced

-:7a579f5533f8
+:5af61c89f51e
 (*<*)
 theory Paper
 imports
+"Posix.LexerSimp"
 "Posix.FBound"
 "HOL-Library.LaTeXsugar"
 begin
 declare [[show_question_marks = false]]
 bders_simp ("_\\\<^sub>b\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and
 ZERO ("\<^bold>0" 81) and
 ONE ("\<^bold>1" 81) and
 CH ("_" [1000] 80) and
-ALT ("_ + _" [77,77] 78) and
+ALT ("_ + _" [76,76] 77) and
-SEQ ("_ \<cdot> _" [77,77] 78) and
+SEQ ("_ \<cdot> _" [78,78] 78) and
 STAR ("_\<^sup>*" [79] 78) and
-NTIMES ("_\<^sup>{\<^sup>_\<^sup>}" [79] 78) and
+NTIMES ("_\<^sup>{\<^bsup>_\<^esup>\<^sup>}" [79] 78) and
 val.Void ("Empty" 78) and
 val.Char ("Char _" [1000] 78) and
 val.Left ("Left _" [79] 78) and
 val.Right ("Right _" [1000] 78) and
 val.Seq ("Seq _ _" [79,79] 78) and
-val.Stars ("Stars _" [79] 78) and
+val.Stars ("Stars _" [1000] 78) and
 Prf ("\<turnstile> _ : _" [75,75] 75) and
 Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
 flat ("|_|" [75] 74) and
 AONE ("ONE _" [79] 78) and
 ACHAR ("CHAR _ _" [79, 79] 80) and
 AALTs ("ALTs _ _" [77,77] 78) and
 ASEQ ("SEQ _ _ _" [79, 79,79] 78) and
 ASTAR ("STAR _ _" [79, 79] 78) and
+ANTIMES ("NT _ _ _" [79, 79, 79] 78) and
 code ("code _" [79] 74) and
 intern ("_\<^latex>\<open>\\mbox{$^\\uparrow$}\<close>" [900] 80) and
 erase ("_\<^latex>\<open>\\mbox{$^\\downarrow$}\<close>" [1000] 74) and
 bnullable ("bnullable _" [1000] 80) and
 and   "rs \<noteq> Nil ==> retrieve (AALTs bs (r#rs)) (Right v) = bs @ retrieve (AALTs [] rs) v"
 apply (metis list.exhaust retrieve.simps(4))
 by (metis list.exhaust retrieve.simps(5))
+lemma better_retrieve2:
+shows "retrieve (ANTIMES bs r (n + 1)) (Stars (v#vs)) =
+bs @ [Z] @ retrieve r v @ retrieve (ANTIMES [] r n) (Stars vs)"
+by auto
 (*>*)
 section {* Introduction *}
 text {*
 In the last fifteen or so years, Brzozowski's derivatives of regular
 expressions have sparked quite a bit of interest in the functional
-programming and theorem prover communities.  The beauty of
+programming and theorem prover communities.
-Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly
-expressible in any functional language, and easily definable and
-reasoned about in theorem provers---the definitions just consist of
-inductive datatypes and simple recursive functions. Another neat
-feature of derivatives is that they can be easily extended to bounded
-regular expressions, such as @{term "NTIMES r n"}, where numbers or
-intervals specify how many times a regular expression should be used
-during matching.
 Derivatives of a
 regular expression, written @{term "der c r"}, give a simple solution
 to the problem of matching a string @{term s} with a regular
 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
-then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
+then @{term r} matches @{term s} (and {\em vice versa}).
-of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
+The beauty of
-Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
+Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly
-of the work by Krauss and Nipkow~\cite{Krauss2011}.  And another one
+expressible in any functional language, and easily definable and
-in Coq is given by Coquand and Siles \cite{Coquand2012}.
+reasoned about in theorem provers---the definitions just consist of
-Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}.
+inductive datatypes and simple recursive functions. Another attractive
+feature of derivatives is that they can be easily extended to \emph{bounded}
+regular expressions, such as @{term "r"}$^{\{@{text n}\}}$ or @{term r}$^{\{..@{text n}\}}$, where numbers or
+intervals of numbers specify how many times a regular expression should be used
+during matching.
 However, there are two difficulties with derivative-based matchers:
 First, Brzozowski's original matcher only generates a yes/no answer
 for whether a regular expression matches a string or not.  This is too
 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
 r$. While such simple-minded simplifications have been proved in our
 earlier work to preserve the correctness of Sulzmann and Lu's
 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
 \emph{not} help with limiting the growth of the derivatives shown
-above: the growth is slowed, but the derivatives can still grow rather
+above: the growth is slowed, but some derivatives can still grow rather
 quickly beyond any finite bound.
-Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
+Sulzmann and Lu address this ``growth problem'' in a second algorithm
 \cite{Sulzmann2014} where they introduce bitcoded
 regular expressions. In this version, POSIX values are
 represented as bitsequences and such sequences are incrementally generated
 when derivatives are calculated. The compact representation
 of bitsequences and regular expressions allows them to define a more
 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL
 the derivative-based lexing algorithm of Sulzmann and Lu
 \cite{Sulzmann2014} where regular expressions are annotated with
 bitsequences. We define the crucial simplification function as a
-recursive function, without the need of a fix-point operation. One objective of
+recursive function, without the need of a fixpoint operation. One objective of
 the simplification function is to remove duplicates of regular
 expressions.  For this Sulzmann and Lu use in their paper the standard
 @{text nub} function from Haskell's list library, but this function
 does not achieve the intended objective with bitcoded regular expressions.  The
 reason is that in the bitcoded setting, each copy generally has a
 different bitcode annotation---so @{text nub} would never ``fire''.
 Inspired by Scala's library for lists, we shall instead use a @{text
 distinctWith} function that finds duplicates under an ``erasing'' function
 which deletes bitcodes before comparing regular expressions.
-We shall also introduce our own argument and definitions for
+We shall also introduce our \emph{own} argument and definitions for
 establishing the correctness of the bitcoded algorithm when
 simplifications are included. Finally we
 establish that the size of derivatives can be finitely bounded.\medskip%\footnote{ In this paper, we shall first briefly introduce the basic notions
 %of regular expressions and describe the definition
 %of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves
 the empty string being represented by the empty list, written $[]$,
 and list-cons being written as $\_\!::\!\_\,$; string
 concatenation is $\_ \,@\, \_\,$. We often use the usual
 bracket notation for lists also for strings; for example a string
 consisting of just a single character $c$ is written $[c]$.
-Our regular expressions are defined as usual as the following inductive
+Our regular expressions are defined as the following inductive
 datatype:\\[-4mm]
 %
 \begin{center}
 @{text "r ::="} \;
 @{const "ZERO"} $\mid$
 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
 @{term "STAR r"} $\mid$
 @{term "NTIMES r n"}
 \end{center}
-\noindent where @{const ZERO} stands for the regular expression that does
+\noindent where @{const ZERO} stands for the regular expression that
-not match any string, @{const ONE} for the regular expression that matches
+does not match any string, @{const ONE} for the regular expression
-only the empty string and @{term c} for matching a character literal.
+that matches only the empty string and @{term c} for matching a
-The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively.
+character literal.  The constructors $+$ and $\cdot$ represent
-We sometimes omit the $\cdot$ in a sequence regular expression for brevity.
+alternatives and sequences, respectively.  We sometimes omit the
-In our work here we also add to the usual ``basic'' regular expressions
+$\cdot$ in a sequence regular expression for brevity.  The
-the bounded regular expression @{term "NTIMES r n"} where the @{term n}
+\emph{language} of a regular expression, written $L(r)$, is defined
-specifies that @{term r} should match exactly @{term n}-times. For
+as usual and we omit giving the definition here (see for example
-brevity we omit the other bounded regular expressions
+\cite{AusafDyckhoffUrban2016}).
-@{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$ and @{text "r"}$^{\{n..m\}}$
-which specify an interval for how many times @{text r} should match. Our
+In our work here we also add to the usual ``basic'' regular
-results extend straightforwardly also to them. The
+expressions the \emph{bounded} regular expression @{term "NTIMES r
-\emph{language} of a regular expression, written $L(r)$, is defined as usual
+n"} where the @{term n} specifies that @{term r} should match
-and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}).
+exactly @{term n}-times. Again for brevity we omit the other bounded
+regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$
+and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many
+times @{text r} should match. The results presented in this paper
+extend straightforwardly to them too. The importance of the bounded
+regular expressions is that they are often used in practical
+applications, such as Snort (a system for detecting network
+intrusion) and also in XML Schema definitions. According to Bj\"{o}rklund et
+al~\cite{BjorklundMartensTimm2015}, bounded regular expressions
+occur frequently in the latter and can have counters of up to
+ten million.  The problem is that tools based on the classic notion
+of automata need to expand @{text "r"}$^{\{n\}}$ into @{text n}
+connected copies of the automaton for @{text r}. This leads to very
+inefficient matching algorithms or algorithms that consume large
+amounts of memory.  A classic example is the regular expression
+\mbox{@{term "SEQ (SEQ (STAR (ALT a b)) a) (NTIMES (ALT a b) n)"}}
+where the minimal DFA requires at least $2^{n + 1}$ states (see
+\cite{CountingSet2020}). Therefore regular expression matching
+libraries that rely on the classic notion of DFAs often impose
+adhoc limits for bounded regular expressions: For example in the
+regular expression matching library in the Go language the regular expression
+@{term "NTIMES a 1001"} is not permitted, because no counter can be
+above 1000, and in the built-in Rust regular expression library
+expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error
+message for being too big.  These problems can of course be solved in matching
+algorithms where automata go beyond the classic notion and for
+instance include explicit counters (see~\cite{CountingSet2020}).
+The point here is that Brzozowski derivatives and the algorithms by
+Sulzmann and Lu can be straightforwardly extended to deal with
+bounded regular expressions and moreover the resulting code
+still consists of only simple recursive functions and inductive
+datatypes. Finally, bounded regular expressions
+do not destroy our finite boundedness property, which we shall
+prove later on.%, because during the lexing process counters will only be
+%decremented.
 Central to Brzozowski's regular expression matcher are two functions
 called @{text nullable} and \emph{derivative}. The latter is written
 $r\backslash c$ for the derivative of the regular expression $r$
 w.r.t.~the character $c$. Both functions are defined by recursion over
 regular expressions.
 %
 \begin{center}
-\begin{tabular}{cc}
+\begin{tabular}{c @ {\hspace{-1mm}}c}
 \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
 @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
 @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
 @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{text "if"} @{term "nullable(r\<^sub>1)"}\\
 & & @{text "then"} @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c r\<^sub>2)"}\\
 & & @{text "else"} @{term "SEQ (der c r\<^sub>1) r\<^sub>2"}\\
 % & & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
-@{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}
+@{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\\
+@{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)}
 \end{tabular}
 &
 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 @{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\
 @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\
 @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
 @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
-@{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}%\medskip\\
+@{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\
+@{thm (lhs) nullable.simps(7)} & $\dn$ & @{text "if"} @{text "n = 0"}\\
+& & @{text "then"} @{const "True"}\\
+& & @{text "else"} @{term "nullable r"}
 \end{tabular}
 \end{tabular}
 \end{center}
 \noindent
 @{text "match s r"} \;\;\text{if and only if}\;\; $s \in L(r)$
 \end{proposition}
 \noindent
 It is a fun exercise to formally prove this property in a theorem prover.
+We are aware
+of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
+Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
+of the work by Krauss and Nipkow~\cite{Krauss2011}.  And another one
+in Coq is given by Coquand and Siles \cite{Coquand2012}.
+Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}.
 The novel idea of Sulzmann and Lu is to extend this algorithm for
 lexing, where it is important to find out which part of the string
 is matched by which part of the regular expression.
 For this Sulzmann and Lu presented two lexing algorithms in their paper
 function, written @{term "flat DUMMY"}. It traverses a value and
 collects the characters contained in it \cite{AusafDyckhoffUrban2016}.
 Sulzmann and Lu also define inductively an
-inhabitation relation that associates values to regular expressions. This
+inhabitation relation that associates values to regular expressions. Our
-is defined by the following six rules for the values:
+version of this relation is defined the following six rules for the values:
 %
 \begin{center}
-\begin{tabular}{@ {\hspace{-12mm}}c@ {}}
 \begin{tabular}{@ {}c@ {}}
-\\[-8mm]
+@{thm[mode=Axiom] Prf.intros(4)}\qquad
-@{thm[mode=Axiom] Prf.intros(4)}\\
+@{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad
-@{thm[mode=Axiom] Prf.intros(5)[of "c"]}
+@{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\qquad
-\end{tabular}
+@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\qquad
-\quad
+@{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
-@{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad
+@{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad
-@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\quad
+$\mprset{flushleft}\inferrule{
-@{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\quad
+@{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
-@{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\mbox{}\hspace{-10mm}\mbox{}
+@{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\quad
+@{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}
+}
+{@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}}
+$
 \end{tabular}
 \end{center}
 \noindent Note that no values are associated with the regular expression
-@{term ZERO}, since it cannot match any string.
+@{term ZERO}, since it cannot match any string. Interesting is our version
+of the rule for @{term "STAR r"} where we require that each value
+in  @{term vs} flattens to a non-empty string. This means if @{term "STAR r"} ``fires''
+one or more times, then each copy needs to match a non-empty string.
+Similarly, in the rule
+for @{term "NTIMES r n"} we require that the length of the list
+@{term "vs\<^sub>1 @ vs\<^sub>2"} equals @{term n} (meaning the regular expression
+@{text r} matches @{text n}-times) and that the first segment of this list
+contains values that flatten to non-empty strings followed by a segment that
+only contains values that flatten to the empty string.
 It is routine to establish how values ``inhabiting'' a regular
 expression correspond to the language of a regular expression, namely
 @{thm L_flat_Prf}.
 In general there is more than one value inhabited by a regular
 specification for what POSIX values are (the inductive rules are shown in
 Figure~\ref{POSIXrules}).
 \begin{figure}[t]
 \begin{center}\small%
-\begin{tabular}{@ {}c@ {}}
+\begin{tabular}{@ {\hspace{-2mm}}c@ {}}
 \\[-9mm]
 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
 \inferrule
 {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
 @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
-{@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\\[-4mm]
+{@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\smallskip\\
+\mprset{sep=4mm}
+@{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad
+$\mprset{flushleft}
+\inferrule
+{@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
+@{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
+@{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\
+@{thm (prem 4) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]}}
+{@{thm (concl) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]}}$\<open>Pn+\<close>
+\\[-4mm]
 \end{tabular}
 \end{center}
 \caption{The inductive definition of POSIX values taken from our earlier paper \cite{AusafDyckhoffUrban2016}. The ternary relation, written $(s, r) \rightarrow v$, formalises the notion
 of given a string $s$ and a regular
 expression $r$ what is the unique value $v$ that satisfies the informal POSIX constraints for
-regular expression matching.}\label{POSIXrules}
+regular expression matching.\smallskip}\label{POSIXrules}
 \end{figure}
 The clever idea by Sulzmann and Lu \cite{Sulzmann2014} in their first algorithm is to define
 an injection function on values that mirrors (but inverts) the
 construction of the derivative on regular expressions. Essentially it
 \begin{tabular}{@ {}lcl@ {}}
 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
-\end{tabular}\smallskip\\
+@{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
+\end{tabular}\medskip\\
 \begin{tabular}{@ {}cc@ {}}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
 @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
 @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ &
 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
 @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ &
 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
 @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$
-& @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}
+& @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
+@{thm (lhs) injval.simps(8)[of "r" "n" "c" "v" "vs"]} & $\dn$
+& @{thm (rhs) injval.simps(8)[of "r" "n" "c" "v" "vs"]}
 \end{tabular}
 &
-\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
+\begin{tabular}{@ {\hspace{-3mm}}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
 @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
 & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
 @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
 & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
 @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$\\
-\multicolumn{3}{@ {}r@ {}}{@{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}}
+\multicolumn{3}{@ {}r@ {}}{@{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}}\\ \mbox{}\\ \mbox{}
 \end{tabular}
 \end{tabular}
 \end{tabular}\smallskip
 \end{center}
 \noindent
 The function @{text mkeps} is run when the last derivative is nullable, that is
 the string to be matched is in the language of the regular expression. It generates
-a value for how the last derivative can match the empty string. The injection function
+a value for how the last derivative can match the empty string. In case
+of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate
+a list of exactly @{term n} copies, which is the length of the list we expect in this
+case.  The injection function
 then calculates the corresponding value for each intermediate derivative until
 a value for the original regular expression is generated.
 Graphically the algorithm by
 Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz}
 where the path from the left to the right involving @{term derivatives}/@{const
 $\;\mid\;$ @{term "AONE bs"}
 $\;\mid\;$ @{term "ACHAR bs c"}
 $\;\mid\;$ @{term "AALTs bs rs"}
 $\;\mid\;$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}
 $\;\mid\;$ @{term "ASTAR bs r"}
+	       $\;\mid\;$ @{term "ANTIMES bs r n"}
 \end{center}
 \noindent where @{text bs} stands for bitsequences; @{text r},
 @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
 expressions; and @{text rs} for lists of bitcoded regular
 (\Right\,v, bs_1)$\\
 $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
 $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
 & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$
 \hspace{2mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
-$\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+$\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
-$\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ &
+$\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ &
 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
 & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$
-\hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\medskip\\
+\hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\\
-\multicolumn{3}{l}{$\textit{decode}\,bs\,r$ $\dn$
+$\textit{decode}'\,(\S\!::\!bs)\,(r^{\{n\}})$ & $\dn$ & $(\Stars\,[], bs)$\\
+$\textit{decode}'\,(\Z\!::\!bs)\,(r^{\{n\}})$ & $\dn$ &
+$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
+& &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^{\{n - 1\}}$
+\hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\medskip\\
+\multicolumn{3}{@ {}l}{$\textit{decode}\,bs\,r$ $\dn$
 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$
 $\;\,\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
 \textit{else}\;\textit{None}$}\\[-4mm]
 \end{tabular}
 \end{center}
 $\textit{fuse}\,bs\,(\textit{ALTs}\,bs'\,rs)$ & $\dn$ &
 $\textit{ALTs}\,(bs\,@\,bs')\,rs$\\
 $\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,r_1\,r_2)$ & $\dn$ &
 $\textit{SEQ}\,(bs\,@\,bs')\,r_1\,r_2$\\
 $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,r)$ & $\dn$ &
-$\textit{STAR}\,(bs\,@\,bs')\,r$
+$\textit{STAR}\,(bs\,@\,bs')\,r$\\
+$\textit{fuse}\,bs\,(\textit{NT}\,bs'\,r\,n)$ & $\dn$ &
+$\textit{NT}\,(bs\,@\,bs')\,r\,n$
 \end{tabular}
 \end{tabular}
 \end{center}
 \noindent
 $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$
 \end{tabular}
 &
 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
-$(r^*)^\uparrow$ & $\dn$ & $\textit{STAR}\;[]\,r^\uparrow$
+$(r^*)^\uparrow$ & $\dn$ & $\textit{STAR}\;[]\,r^\uparrow$\\
+$(r^{\{n\}})^\uparrow$ & $\dn$ &
+$\textit{NT}\;[]\,r^\uparrow\,n$
 \end{tabular}
 &
 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 $(r_1 + r_2)^\uparrow$ & $\dn$ &
 $\textit{ALT}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\,
 the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the
 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
 bitcoded regular expressions.
 %
 \begin{center}
-\begin{tabular}{@ {}c@ {\hspace{2mm}}c@ {}}
+\begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{0mm}}c@ {}}
-\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+\begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l}
 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\
 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\
 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\
 $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &
 $\exists\, r \in \rs. \,\textit{bnullable}\,r$\\
 $\textit{bnullable}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ &
 $\textit{bnullable}\,r_1\wedge \textit{bnullable}\,r_2$\\
 $\textit{bnullable}\,(\textit{STAR}\,bs\,r)$ & $\dn$ &
-$\textit{True}$
+$\textit{True}$\\
+$\textit{bnullable}\,(\textit{NT}\,bs\,r\,n)$ & $\dn$ &\\
+\multicolumn{3}{l}{$\textit{if}\;n = 0\;\textit{then}\;\textit{True}\;
+\textit{else}\;\textit{bnullable}\,r$}\\
 \end{tabular}
 &
-\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
+\begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l@ {}}
 $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
 $\textit{bmkeps}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &
 $bs\,@\,\textit{bmkepss}\,\rs$\\
 $\textit{bmkeps}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ &\\
-\multicolumn{3}{r}{$bs \,@\,\textit{bmkeps}\,r_1\,@\, \textit{bmkeps}\,r_2$}\\
+\multicolumn{3}{l}{$bs \,@\,\textit{bmkeps}\,r_1\,@\, \textit{bmkeps}\,r_2$}\\
 $\textit{bmkeps}\,(\textit{STAR}\,bs\,r)$ & $\dn$ &
 $bs \,@\, [\S]$\\
-$\textit{bmkepss}\,(r\!::\!\rs)$ & $\dn$ &
+$\textit{bmkeps}\,(\textit{NT}\,bs\,r\,n)$ & $\dn$ &\\
-$\textit{if}\;\textit{bnullable}\,r$\\
+\multicolumn{3}{l@ {}}{$\textit{if}\;n=0\;\textit{then}\;bs \,@\, [\S]$}\\
-& &$\textit{then}\;\textit{bmkeps}\,r$\\
+\multicolumn{3}{l@ {}}{$\textit{else}\,bs \,@\, [\Z] \,@\, \textit{bmkeps}\,r\,@\,
-& &$\textit{else}\;\textit{bmkepss}\,\rs$
+\textit{bmkeps}\,(@{term "ANTIMES [] r (n - 1)"})$}\\
+$\textit{bmkepss}\,(r\!::\!\rs)$ & $\dn$ &\\
+\multicolumn{3}{l}{$\textit{if}\;\textit{bnullable}\,r\,\textit{then}\;\textit{bmkeps}\,r\;
+\textit{else}\;\textit{bmkepss}\,\rs$}
 \end{tabular}
 \end{tabular}
 \end{center}
 derivative but at the same time also the incremental part of the bitsequences
 that contribute to constructing a POSIX value.
 %
 \begin{center}
 \begin{tabular}{@ {}lcl@ {}}
-\multicolumn{3}{@ {}l}{$(\textit{ZERO})\backslash c$ $\;\dn\;$ $\textit{ZERO}$ \quad\qquad
+$(\textit{ZERO})\backslash c$ & $\;\dn\;$ & $\textit{ZERO}$\\
-$(\textit{ONE}\;bs)\backslash c$ $\;\dn\;$ $\textit{ZERO}$}\\
+$(\textit{ONE}\;bs)\backslash c$ & $\;\dn\;$ & $\textit{ZERO}$\\
 $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
 $\textit{if}\;c=d\; \;\textit{then}\;
 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
 $(\textit{ALTs}\;bs\,\rs)\backslash c$ & $\dn$ &
 $\textit{ALTs}\,bs\,(\mathit{map}\,(\_\backslash c)\,\rs)$\\
 $\textit{if}\;\textit{bnullable}\,r_1$\\
 & &$\textit{then}\;\textit{ALT}\,bs\;(\textit{SEQ}\,[]\,(r_1\backslash c)\,r_2)$
 $\;(\textit{fuse}\,(\textit{bmkeps}\,r_1)\,(r_2\backslash c))$\\
 & &$\textit{else}\;\textit{SEQ}\,bs\,(r_1\backslash c)\,r_2$\\
 $(\textit{STAR}\,bs\,r)\backslash c$ & $\dn$ &
-$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\,
+$\textit{SEQ}\;(bs\,@\,[\Z])\,(r\backslash c)\,
-(\textit{STAR}\,[]\,r)$
+(\textit{STAR}\,[]\,r)$\\
+$(\textit{NT}\,bs\,r\,n)\backslash c$ & $\dn$ &
+$\textit{if}\;n = 0\; \textit{then}\;\textit{ZERO}\;\textit{else}\;
+\textit{SEQ}\;(bs\,@\,[\Z])\,(r\backslash c)\,
+(\textit{NT}\,[]\,r\,(n - 1))$
 \end{tabular}
 \end{center}
 \noindent
 @{thm (lhs) better_retrieve(1)} & $\dn$ & @{thm (rhs) better_retrieve(1)}\\
 @{thm (lhs) better_retrieve(2)} & $\dn$ & @{thm (rhs) better_retrieve(2)}\\
 @{thm (lhs) retrieve.simps(6)[of _ "r\<^sub>1" "r\<^sub>2" "v\<^sub>1" "v\<^sub>2"]}
 & $\dn$ & @{thm (rhs) retrieve.simps(6)[of _ "r\<^sub>1" "r\<^sub>2" "v\<^sub>1" "v\<^sub>2"]}\\
 @{thm (lhs) retrieve.simps(7)} & $\dn$ & @{thm (rhs) retrieve.simps(7)}\\
-@{thm (lhs) retrieve.simps(8)} & $\dn$ & @{thm (rhs) retrieve.simps(8)}
+@{thm (lhs) retrieve.simps(8)} & $\dn$ & @{thm (rhs) retrieve.simps(8)}\\
+@{thm (lhs) retrieve.simps(9)} & $\dn$ & @{thm (rhs) retrieve.simps(9)}\\
+@{thm (lhs) better_retrieve2} & $\dn$ & @{thm (rhs) better_retrieve2}
 \end{tabular}
 \end{center}
 \noindent
 The idea behind this function is to retrieve a possibly partial
 section {* Simplification *}
 text {*
-Derivatives as calculated by Brzozowski’s method are usually more
+Derivatives as calculated by Brzozowski's method are usually more
 complex regular expressions than the initial one; the result is
 that derivative-based matching and lexing algorithms are
 often abysmally slow if the ``growth problem'' is not addressed. As Sulzmann and Lu wrote, various
 optimisations are possible, such as the simplifications
 $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r \Rightarrow r$,
 de-nest, or spill out, @{text ALTs} as follows
 %
 \[
 @{term "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) # rs\<^sub>1)"}
 \quad\xrightarrow{bsimp}\quad
-@{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) # rs\<^sub>1)"}
+@{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
 \]
 \noindent
 where we just need to fuse the bitsequence that has accumulated in @{text "bs\<^sub>2"}
 to the alternatives in @{text "rs\<^sub>2"}. As we shall show below this will
 @{thm (rhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]}\\
 @{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\
 @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ &
 @{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}\\
 @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\
+@{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\
 \end{tabular}
 \end{center}
 \noindent
 where all other cases are set to @{text False}.
 It is defined as follows:
 \begin{center}
 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 \multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad
-@{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\\
+@{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\smallskip\\
 @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
 \end{tabular}
 \end{center}
 \noindent
 @{text "bsimp r"} & $\dn$ & @{text r}
 \end{tabular}
 \end{center}
 \noindent
-We believe our recursive function @{term bsimp} simplifies regular
+We believe our recursive function @{term bsimp} simplifies bitcoded regular
 expressions as intended by Sulzmann and Lu. There is no point in applying the
 @{text bsimp} function repeatedly (like the simplification in their paper which needs to be
 applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent,
 that is
 \noindent
 This can be proved by induction on @{text r} but requires a detailed analysis
 that the de-nesting of alternatives always results in a flat list of regular
 expressions. We omit the details since it does not concern the correctness proof.
+%It might be interesting to not that we do not simplify inside @{term STAR} and
+%@{text NT}: the reason is that we want to keep the
 Next we can include simplification after each derivative step leading to the
 following notion of bitcoded derivatives:
 \begin{center}
 \begin{tabular}{cc}
 \end{lemma}
 \noindent
-We can show that this rewrite system preserves @{term bnullable}, that
+We can also show that this rewrite system preserves @{term bnullable}, that
 is simplification does not affect nullability:
 \begin{lemma}\label{lembnull}
 @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]}
 \end{lemma}
 @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad
 @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\
 @{thm[mode=Axiom] bs6}$A0$\quad
 @{thm[mode=Axiom] bs7}$A1$\quad
 @{thm[mode=Rule] bs10[of "rs\<^sub>1" "rs\<^sub>2"]}$AL$\\
-@{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LT$\quad
+@{thm[mode=Rule] rrewrite_srewrite.ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LT$\quad
 @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}$LH$\quad
 @{thm[mode=Axiom] ss4}$L\ZERO$\quad
 @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}$LS$\medskip\\
 @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}$LD$\\[-5mm]
 \end{tabular}
 \begin{center}
 $\forall s. \; \llbracket@{term "bders_simp r s"}\rrbracket \leq N$
 \end{center}
 \noindent
-We prove this by induction on $r$. The base cases for @{term AZERO},
+Note that the bound $N$ is a bound for \emph{all} strings, no matter how long they are.
+We establish this bound by induction on $r$. The base cases for @{term AZERO},
 @{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is
 for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction
 hypotheses state $\exists N_1. \forall s. \; \llbracket{}@{term "bders_simp r\<^sub>1 s"}\rrbracket \leq N_1$ and
 $\exists N_2. \forall s. \; \llbracket@{term "bders_simp r\<^sub>2 s"}\rrbracket \leq N_2$. We can reason as follows
 %
 @{text distinctWith} is to maintain a ``set'' of alternatives (like the sets in
 partial derivatives). Unfortunately to obtain the exact same bound would mean
 we need to introduce simplifications, such as
 $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
 which exist for partial derivatives. However, if we introduce them in our
-setting we would lose the POSIX property of our calculated values. We leave better
+setting we would lose the POSIX property of our calculated values. For example
-bounds for future work.\\[-6.5mm]
+given the regular expressions @{term "SEQ (ALT a ab) (ALT b ONE)"} and the string $[a, b]$, then our
+algorithm generates the following correct POSIX value
+%
+\[
+@{term "Seq (Right (Seq (Char a) (Char b))) (Right Empty)"}
+\]
+\noindent
+Essentially it matches the string with the longer @{text "Right"}-alternative in the
+first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence).
+If we add the simplification above, then we obtain the following value
+%
+\[
+@{term "Seq (Left (Char a)) (Left (Char b))"}
+\]
+\noindent
+where the @{text "Left"}-alternatives get priority. However, this violates
+the POSIX rules and we have not been able to
+reconcile this problem. Therefore we leave better bounds for future work.\\[-6.5mm]
 *}
 section {* Conclusion *}
 obscure, examples.
 %We found that from an implementation
 %point-of-view it is really important to have the formal proofs of
 %the corresponding properties at hand.
-We have also developed a
+We can of course only make a claim about the correctness and the sizes of the
-healthy suspicion when experimental data is used to back up
+derivatives, not about the efficiency or runtime of our version of
+Sulzman and Lu's algorithm. But we found the size is an important
+first indicator about efficiency: clearly if the derivatives can
+grow to arbitrarily big sizes and the algorithm needs to traverse
+the derivatives possibly several times, then the algorithm will be
+slow---excruciatingly slow that is. Other works seems to make
+stronger claims, but during our work we have developed a healthy
+suspicion when for example experimental data is used to back up
 efficiency claims. For example Sulzmann and Lu write about their
-equivalent of @{term blexer_simp} \textit{``...we can incrementally compute
+equivalent of @{term blexer_simp} \textit{``...we can incrementally
-bitcoded parse trees in linear time in the size of the input''}
+compute bitcoded parse trees in linear time in the size of the
-\cite[Page 14]{Sulzmann2014}.
+input''} \cite[Page 14]{Sulzmann2014}.  Given the growth of the
-Given the growth of the
+derivatives in some cases even after aggressive simplification,
-derivatives in some cases even after aggressive simplification, this
+this is a hard to believe claim. A similar claim about a
-is a hard to believe claim. A similar claim about a theoretical runtime
+theoretical runtime of @{text "O(n\<^sup>2)"} is made for the
-of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates
+Verbatim lexer, which calculates tokens according to POSIX
-tokens according to POSIX rules~\cite{verbatim}. For this Verbatim uses Brzozowski's
+rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's
-derivatives like in our work.
+derivatives like in our work.  The authors write: \textit{``The
-The authors write: \textit{``The results of our empirical tests [..] confirm that Verbatim has
+results of our empirical tests [..] confirm that Verbatim has
-@{text "O(n\<^sup>2)"} time complexity.''} \cite[Section~VII]{verbatim}.
+@{text "O(n\<^sup>2)"} time complexity.''}
-While their correctness proof for Verbatim is formalised in Coq, the claim about
+\cite[Section~VII]{verbatim}.  While their correctness proof for
-the runtime complexity is only supported by some emperical evidence obtained
+Verbatim is formalised in Coq, the claim about the runtime
-by using the code extraction facilities of Coq.
+complexity is only supported by some emperical evidence obtained by
-Given our observation with the ``growth problem'' of derivatives,
+using the code extraction facilities of Coq.  Given our observation
-we
+with the ``growth problem'' of derivatives, we tried out their
-tried out their extracted OCaml code with the example
+extracted OCaml code with the example \mbox{@{text "(a +
-\mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a
+aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5
-string of 40 $a$'s and that increased to approximately 19 minutes when the
+minutes to tokenise a string of 40 $a$'s and that increased to
-string is 50 $a$'s long. Taking into account that derivatives are not simplified in the Verbatim
+approximately 19 minutes when the string is 50 $a$'s long. Taking
-lexer, such numbers are not surprising.
+into account that derivatives are not simplified in the Verbatim
-Clearly our result of having finite
+lexer, such numbers are not surprising.  Clearly our result of
-derivatives might sound rather weak in this context but we think such effeciency claims
+having finite derivatives might sound rather weak in this context
-really require further scrutiny.
+but we think such effeciency claims really require further
+scrutiny.
-The contribution of this paper is to make sure
-derivatives do not grow arbitrarily big (universially) In the example \mbox{@{text "(a + aa)\<^sup>*"}},
+The contribution of this paper is to make sure derivatives do not
-\emph{all} derivatives have a size of 17 or less. The result is that
+grow arbitrarily big (universially) In the example \mbox{@{text "(a
-lexing a string of, say, 50\,000 a's with the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes approximately
++ aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or
-10 seconds with our Scala implementation
+less. The result is that lexing a string of, say, 50\,000 a's with
-of the presented algorithm. Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
+the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes
+approximately 10 seconds with our Scala implementation of the
+presented algorithm.
+Finally, let us come back to the point about bounded regular
+expressions. We have in this paper only shown that @{term "NTIMES r
+n"} can be included, but all our results extend straightforwardly
+also to the other bounded regular expressions. We find bounded
+regular expressions fit naturally into the setting of Brzozowski
+derivatives and the bitcoded regular expressions by Sulzmann and Lu.
+In contrast bounded regular expressions are often the Achilles'
+heel in regular expression matchers that use the traditional
+automata-based approach to lexing, primarily because they need to expand the
+counters of bounded regular expressions into $n$-connected copies
+of an automaton. This is not needed in Sulzmann and Lu's algorithm.
+To see the difference consider for example the regular expression @{term "SEQ (NTIMES a
+1001) (STAR a)"}, which is not permitted in the Go language because
+the counter is too big. In contrast we have no problem with
+matching this regular expression with, say 50\,000 a's, because the
+counters can be kept compact. In fact, the overall size of the
+derivatives is never greater than 5 in this example. Even in the
+example from Section 2, where Rust raises an error message, namely
+\mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for
+our derivatives is a moderate 14.
+Let us also compare our work to the verified Verbatim++ lexer where
+the authors of the Verbatim lexer introduced a number of
+improvements and optimisations, for example memoisation
+\cite{verbatimpp}. However, unlike Verbatim, which works with
+derivatives like in our work, Verbatim++ compiles first a regular
+expression into a DFA. While this makes lexing fast in many cases,
+with examples of bounded regular expressions like
+\mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}}
+one needs to represent them as
+sequences of $a \cdot a \cdot \ldots \cdot a$ (500 a's in sequence). We have run
+their extracted code with such a regular expression as a
+single lexing rule and a string of 50\,000 a's---lexing in this case
+takes approximately 5~minutes. We are not aware of any better
+translation using the traditional notion of DFAs. Therefore we
+prefer to stick with calculating derivatives, but attempt to make
+this calculation (in the future) as fast as possible. What we can guaranty
+with the presented work is that the maximum size of the derivatives
+for this example is not bigger than 9. This means our Scala
+implementation only needs a few seconds for this example.
+%
+%
+%Possible ideas are
+%zippers which have been used in the context of parsing of
+%context-free grammars \cite{zipperparser}.
+\medskip
+\noindent
+Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
 %\\[-10mm]
 %%\bibliographystyle{plain}
 \bibliography{root}
+\newpage
 \appendix
 \section{Some Proofs}
 While we have formalised \emph{all} results in Isabelle/HOL, below we shall give some
-rough details of our reasoning in ``English''.
+rough details of our reasoning in ``English'' if this is helpful for reviewing.
 \begin{proof}[Proof of Lemma~\ref{codedecode}]
 This follows from the property that
 $\textit{decode}'\,((\textit{code}\,v) \,@\, bs)\,r = (v, bs)$ holds
 for any bit-sequence $bs$ and $\vdash v : r$. This property can be

changeset 569	5af61c89f51e
parent 563	c92a41d9c4da
child 571	a76458dd9e4c