--- a/thys3/Paper.thy Sat Jul 16 18:34:46 2022 +0100
+++ b/thys3/Paper.thy Sun Jul 17 13:07:05 2022 +0100
@@ -1,6 +1,7 @@
(*<*)
theory Paper
imports
+ "Posix.LexerSimp"
"Posix.FBound"
"HOL-Library.LaTeXsugar"
begin
@@ -29,17 +30,17 @@
ZERO ("\<^bold>0" 81) and
ONE ("\<^bold>1" 81) and
CH ("_" [1000] 80) and
- ALT ("_ + _" [77,77] 78) and
- SEQ ("_ \<cdot> _" [77,77] 78) and
+ ALT ("_ + _" [76,76] 77) and
+ SEQ ("_ \<cdot> _" [78,78] 78) and
STAR ("_\<^sup>*" [79] 78) and
- NTIMES ("_\<^sup>{\<^sup>_\<^sup>}" [79] 78) and
+ NTIMES ("_\<^sup>{\<^bsup>_\<^esup>\<^sup>}" [79] 78) and
val.Void ("Empty" 78) and
val.Char ("Char _" [1000] 78) and
val.Left ("Left _" [79] 78) and
val.Right ("Right _" [1000] 78) and
val.Seq ("Seq _ _" [79,79] 78) and
- val.Stars ("Stars _" [79] 78) and
+ val.Stars ("Stars _" [1000] 78) and
Prf ("\<turnstile> _ : _" [75,75] 75) and
Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
@@ -57,6 +58,7 @@
AALTs ("ALTs _ _" [77,77] 78) and
ASEQ ("SEQ _ _ _" [79, 79,79] 78) and
ASTAR ("STAR _ _" [79, 79] 78) and
+ ANTIMES ("NT _ _ _" [79, 79, 79] 78) and
code ("code _" [79] 74) and
intern ("_\<^latex>\<open>\\mbox{$^\\uparrow$}\<close>" [900] 80) and
@@ -81,8 +83,10 @@
by (metis list.exhaust retrieve.simps(5))
-
-
+lemma better_retrieve2:
+ shows "retrieve (ANTIMES bs r (n + 1)) (Stars (v#vs)) =
+ bs @ [Z] @ retrieve r v @ retrieve (ANTIMES [] r n) (Stars vs)"
+by auto
(*>*)
section {* Introduction *}
@@ -91,27 +95,24 @@
In the last fifteen or so years, Brzozowski's derivatives of regular
expressions have sparked quite a bit of interest in the functional
-programming and theorem prover communities. The beauty of
-Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly
-expressible in any functional language, and easily definable and
-reasoned about in theorem provers---the definitions just consist of
-inductive datatypes and simple recursive functions. Another neat
-feature of derivatives is that they can be easily extended to bounded
-regular expressions, such as @{term "NTIMES r n"}, where numbers or
-intervals specify how many times a regular expression should be used
-during matching.
-
+programming and theorem prover communities.
Derivatives of a
regular expression, written @{term "der c r"}, give a simple solution
to the problem of matching a string @{term s} with a regular
expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
succession) all the characters of the string matches the empty string,
-then @{term r} matches @{term s} (and {\em vice versa}). We are aware
-of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
-Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
-of the work by Krauss and Nipkow~\cite{Krauss2011}. And another one
-in Coq is given by Coquand and Siles \cite{Coquand2012}.
-Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}.
+then @{term r} matches @{term s} (and {\em vice versa}).
+The beauty of
+Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly
+expressible in any functional language, and easily definable and
+reasoned about in theorem provers---the definitions just consist of
+inductive datatypes and simple recursive functions. Another attractive
+feature of derivatives is that they can be easily extended to \emph{bounded}
+regular expressions, such as @{term "r"}$^{\{@{text n}\}}$ or @{term r}$^{\{..@{text n}\}}$, where numbers or
+intervals of numbers specify how many times a regular expression should be used
+during matching.
+
+
However, there are two difficulties with derivative-based matchers:
@@ -159,11 +160,11 @@
earlier work to preserve the correctness of Sulzmann and Lu's
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
\emph{not} help with limiting the growth of the derivatives shown
-above: the growth is slowed, but the derivatives can still grow rather
+above: the growth is slowed, but some derivatives can still grow rather
quickly beyond any finite bound.
-Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
+Sulzmann and Lu address this ``growth problem'' in a second algorithm
\cite{Sulzmann2014} where they introduce bitcoded
regular expressions. In this version, POSIX values are
represented as bitsequences and such sequences are incrementally generated
@@ -189,7 +190,7 @@
the derivative-based lexing algorithm of Sulzmann and Lu
\cite{Sulzmann2014} where regular expressions are annotated with
bitsequences. We define the crucial simplification function as a
-recursive function, without the need of a fix-point operation. One objective of
+recursive function, without the need of a fixpoint operation. One objective of
the simplification function is to remove duplicates of regular
expressions. For this Sulzmann and Lu use in their paper the standard
@{text nub} function from Haskell's list library, but this function
@@ -199,7 +200,7 @@
Inspired by Scala's library for lists, we shall instead use a @{text
distinctWith} function that finds duplicates under an ``erasing'' function
which deletes bitcodes before comparing regular expressions.
-We shall also introduce our own argument and definitions for
+We shall also introduce our \emph{own} argument and definitions for
establishing the correctness of the bitcoded algorithm when
simplifications are included. Finally we
establish that the size of derivatives can be finitely bounded.\medskip%\footnote{ In this paper, we shall first briefly introduce the basic notions
@@ -228,7 +229,7 @@
concatenation is $\_ \,@\, \_\,$. We often use the usual
bracket notation for lists also for strings; for example a string
consisting of just a single character $c$ is written $[c]$.
- Our regular expressions are defined as usual as the following inductive
+ Our regular expressions are defined as the following inductive
datatype:\\[-4mm]
%
\begin{center}
@@ -242,20 +243,55 @@
@{term "NTIMES r n"}
\end{center}
- \noindent where @{const ZERO} stands for the regular expression that does
- not match any string, @{const ONE} for the regular expression that matches
- only the empty string and @{term c} for matching a character literal.
- The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively.
- We sometimes omit the $\cdot$ in a sequence regular expression for brevity.
- In our work here we also add to the usual ``basic'' regular expressions
- the bounded regular expression @{term "NTIMES r n"} where the @{term n}
- specifies that @{term r} should match exactly @{term n}-times. For
- brevity we omit the other bounded regular expressions
- @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$ and @{text "r"}$^{\{n..m\}}$
- which specify an interval for how many times @{text r} should match. Our
- results extend straightforwardly also to them. The
- \emph{language} of a regular expression, written $L(r)$, is defined as usual
- and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}).
+ \noindent where @{const ZERO} stands for the regular expression that
+ does not match any string, @{const ONE} for the regular expression
+ that matches only the empty string and @{term c} for matching a
+ character literal. The constructors $+$ and $\cdot$ represent
+ alternatives and sequences, respectively. We sometimes omit the
+ $\cdot$ in a sequence regular expression for brevity. The
+ \emph{language} of a regular expression, written $L(r)$, is defined
+ as usual and we omit giving the definition here (see for example
+ \cite{AusafDyckhoffUrban2016}).
+
+ In our work here we also add to the usual ``basic'' regular
+ expressions the \emph{bounded} regular expression @{term "NTIMES r
+ n"} where the @{term n} specifies that @{term r} should match
+ exactly @{term n}-times. Again for brevity we omit the other bounded
+ regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$
+ and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many
+ times @{text r} should match. The results presented in this paper
+ extend straightforwardly to them too. The importance of the bounded
+ regular expressions is that they are often used in practical
+ applications, such as Snort (a system for detecting network
+ intrusion) and also in XML Schema definitions. According to Bj\"{o}rklund et
+ al~\cite{BjorklundMartensTimm2015}, bounded regular expressions
+ occur frequently in the latter and can have counters of up to
+ ten million. The problem is that tools based on the classic notion
+ of automata need to expand @{text "r"}$^{\{n\}}$ into @{text n}
+ connected copies of the automaton for @{text r}. This leads to very
+ inefficient matching algorithms or algorithms that consume large
+ amounts of memory. A classic example is the regular expression
+ \mbox{@{term "SEQ (SEQ (STAR (ALT a b)) a) (NTIMES (ALT a b) n)"}}
+ where the minimal DFA requires at least $2^{n + 1}$ states (see
+ \cite{CountingSet2020}). Therefore regular expression matching
+ libraries that rely on the classic notion of DFAs often impose
+ adhoc limits for bounded regular expressions: For example in the
+ regular expression matching library in the Go language the regular expression
+ @{term "NTIMES a 1001"} is not permitted, because no counter can be
+ above 1000, and in the built-in Rust regular expression library
+ expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error
+ message for being too big. These problems can of course be solved in matching
+ algorithms where automata go beyond the classic notion and for
+ instance include explicit counters (see~\cite{CountingSet2020}).
+ The point here is that Brzozowski derivatives and the algorithms by
+ Sulzmann and Lu can be straightforwardly extended to deal with
+ bounded regular expressions and moreover the resulting code
+ still consists of only simple recursive functions and inductive
+ datatypes. Finally, bounded regular expressions
+ do not destroy our finite boundedness property, which we shall
+ prove later on.%, because during the lexing process counters will only be
+ %decremented.
+
Central to Brzozowski's regular expression matcher are two functions
called @{text nullable} and \emph{derivative}. The latter is written
@@ -264,7 +300,7 @@
regular expressions.
%
\begin{center}
-\begin{tabular}{cc}
+\begin{tabular}{c @ {\hspace{-1mm}}c}
\begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
@{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
@{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
@@ -274,7 +310,8 @@
& & @{text "then"} @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c r\<^sub>2)"}\\
& & @{text "else"} @{term "SEQ (der c r\<^sub>1) r\<^sub>2"}\\
% & & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\
- @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}
+ @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\\
+ @{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)}
\end{tabular}
&
\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
@@ -283,7 +320,10 @@
@{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\
@{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\
- @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}%\medskip\\
+ @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\
+ @{thm (lhs) nullable.simps(7)} & $\dn$ & @{text "if"} @{text "n = 0"}\\
+ & & @{text "then"} @{const "True"}\\
+ & & @{text "else"} @{term "nullable r"}
\end{tabular}
\end{tabular}
\end{center}
@@ -307,6 +347,12 @@
\noindent
It is a fun exercise to formally prove this property in a theorem prover.
+We are aware
+of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
+Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
+of the work by Krauss and Nipkow~\cite{Krauss2011}. And another one
+in Coq is given by Coquand and Siles \cite{Coquand2012}.
+Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}.
The novel idea of Sulzmann and Lu is to extend this algorithm for
lexing, where it is important to find out which part of the string
@@ -334,26 +380,38 @@
Sulzmann and Lu also define inductively an
- inhabitation relation that associates values to regular expressions. This
- is defined by the following six rules for the values:
+ inhabitation relation that associates values to regular expressions. Our
+ version of this relation is defined the following six rules for the values:
%
\begin{center}
- \begin{tabular}{@ {\hspace{-12mm}}c@ {}}
\begin{tabular}{@ {}c@ {}}
- \\[-8mm]
- @{thm[mode=Axiom] Prf.intros(4)}\\
- @{thm[mode=Axiom] Prf.intros(5)[of "c"]}
- \end{tabular}
- \quad
- @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad
- @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\quad
- @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\quad
- @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\mbox{}\hspace{-10mm}\mbox{}
+ @{thm[mode=Axiom] Prf.intros(4)}\qquad
+ @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad
+ @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\qquad
+ @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\qquad
+ @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
+ @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad
+ $\mprset{flushleft}\inferrule{
+ @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\\\\
+ @{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\quad
+ @{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}
+ }
+ {@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}}
+ $
\end{tabular}
\end{center}
\noindent Note that no values are associated with the regular expression
- @{term ZERO}, since it cannot match any string.
+ @{term ZERO}, since it cannot match any string. Interesting is our version
+ of the rule for @{term "STAR r"} where we require that each value
+ in @{term vs} flattens to a non-empty string. This means if @{term "STAR r"} ``fires''
+ one or more times, then each copy needs to match a non-empty string.
+ Similarly, in the rule
+ for @{term "NTIMES r n"} we require that the length of the list
+ @{term "vs\<^sub>1 @ vs\<^sub>2"} equals @{term n} (meaning the regular expression
+ @{text r} matches @{text n}-times) and that the first segment of this list
+ contains values that flatten to non-empty strings followed by a segment that
+ only contains values that flatten to the empty string.
It is routine to establish how values ``inhabiting'' a regular
expression correspond to the language of a regular expression, namely
@{thm L_flat_Prf}.
@@ -378,7 +436,7 @@
\begin{figure}[t]
\begin{center}\small%
- \begin{tabular}{@ {}c@ {}}
+ \begin{tabular}{@ {\hspace{-2mm}}c@ {}}
\\[-9mm]
@{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
@{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
@@ -397,13 +455,23 @@
@{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
@{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
@{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
- {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\\[-4mm]
+ {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\smallskip\\
+ \mprset{sep=4mm}
+ @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad
+ $\mprset{flushleft}
+ \inferrule
+ {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
+ @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
+ @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\
+ @{thm (prem 4) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]}}
+ {@{thm (concl) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]}}$\<open>Pn+\<close>
+ \\[-4mm]
\end{tabular}
\end{center}
\caption{The inductive definition of POSIX values taken from our earlier paper \cite{AusafDyckhoffUrban2016}. The ternary relation, written $(s, r) \rightarrow v$, formalises the notion
of given a string $s$ and a regular
expression $r$ what is the unique value $v$ that satisfies the informal POSIX constraints for
- regular expression matching.}\label{POSIXrules}
+ regular expression matching.\smallskip}\label{POSIXrules}
\end{figure}
The clever idea by Sulzmann and Lu \cite{Sulzmann2014} in their first algorithm is to define
@@ -419,7 +487,8 @@
@{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
- \end{tabular}\smallskip\\
+ @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
+ \end{tabular}\medskip\\
\begin{tabular}{@ {}cc@ {}}
\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
@@ -429,16 +498,18 @@
@{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ &
@{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
@{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$
- & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}
+ & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
+ @{thm (lhs) injval.simps(8)[of "r" "n" "c" "v" "vs"]} & $\dn$
+ & @{thm (rhs) injval.simps(8)[of "r" "n" "c" "v" "vs"]}
\end{tabular}
&
- \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
+ \begin{tabular}{@ {\hspace{-3mm}}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
@{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
& @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
@{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
& @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
@{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$\\
- \multicolumn{3}{@ {}r@ {}}{@{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}}
+ \multicolumn{3}{@ {}r@ {}}{@{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}}\\ \mbox{}\\ \mbox{}
\end{tabular}
\end{tabular}
\end{tabular}\smallskip
@@ -447,7 +518,10 @@
\noindent
The function @{text mkeps} is run when the last derivative is nullable, that is
the string to be matched is in the language of the regular expression. It generates
- a value for how the last derivative can match the empty string. The injection function
+ a value for how the last derivative can match the empty string. In case
+ of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate
+ a list of exactly @{term n} copies, which is the length of the list we expect in this
+ case. The injection function
then calculates the corresponding value for each intermediate derivative until
a value for the original regular expression is generated.
Graphically the algorithm by
@@ -586,6 +660,7 @@
$\;\mid\;$ @{term "AALTs bs rs"}
$\;\mid\;$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}
$\;\mid\;$ @{term "ASTAR bs r"}
+ $\;\mid\;$ @{term "ANTIMES bs r n"}
\end{center}
\noindent where @{text bs} stands for bitsequences; @{text r},
@@ -654,12 +729,17 @@
$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
& & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$
\hspace{2mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
- $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
- $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ &
+ $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+ $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ &
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
& & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$
- \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\medskip\\
- \multicolumn{3}{l}{$\textit{decode}\,bs\,r$ $\dn$
+ \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\\
+ $\textit{decode}'\,(\S\!::\!bs)\,(r^{\{n\}})$ & $\dn$ & $(\Stars\,[], bs)$\\
+ $\textit{decode}'\,(\Z\!::\!bs)\,(r^{\{n\}})$ & $\dn$ &
+ $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
+ & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^{\{n - 1\}}$
+ \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\medskip\\
+ \multicolumn{3}{@ {}l}{$\textit{decode}\,bs\,r$ $\dn$
$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$
$\;\,\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
\textit{else}\;\textit{None}$}\\[-4mm]
@@ -704,7 +784,9 @@
$\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,r_1\,r_2)$ & $\dn$ &
$\textit{SEQ}\,(bs\,@\,bs')\,r_1\,r_2$\\
$\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,r)$ & $\dn$ &
- $\textit{STAR}\,(bs\,@\,bs')\,r$
+ $\textit{STAR}\,(bs\,@\,bs')\,r$\\
+ $\textit{fuse}\,bs\,(\textit{NT}\,bs'\,r\,n)$ & $\dn$ &
+ $\textit{NT}\,(bs\,@\,bs')\,r\,n$
\end{tabular}
\end{tabular}
\end{center}
@@ -722,7 +804,9 @@
&
\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
$(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
- $(r^*)^\uparrow$ & $\dn$ & $\textit{STAR}\;[]\,r^\uparrow$
+ $(r^*)^\uparrow$ & $\dn$ & $\textit{STAR}\;[]\,r^\uparrow$\\
+ $(r^{\{n\}})^\uparrow$ & $\dn$ &
+ $\textit{NT}\;[]\,r^\uparrow\,n$
\end{tabular}
&
\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
@@ -745,8 +829,8 @@
bitcoded regular expressions.
%
\begin{center}
- \begin{tabular}{@ {}c@ {\hspace{2mm}}c@ {}}
- \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+ \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{0mm}}c@ {}}
+ \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l}
$\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\
$\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\
$\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\
@@ -755,21 +839,27 @@
$\textit{bnullable}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ &
$\textit{bnullable}\,r_1\wedge \textit{bnullable}\,r_2$\\
$\textit{bnullable}\,(\textit{STAR}\,bs\,r)$ & $\dn$ &
- $\textit{True}$
+ $\textit{True}$\\
+ $\textit{bnullable}\,(\textit{NT}\,bs\,r\,n)$ & $\dn$ &\\
+ \multicolumn{3}{l}{$\textit{if}\;n = 0\;\textit{then}\;\textit{True}\;
+ \textit{else}\;\textit{bnullable}\,r$}\\
\end{tabular}
&
- \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
+ \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l@ {}}
$\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
$\textit{bmkeps}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &
$bs\,@\,\textit{bmkepss}\,\rs$\\
$\textit{bmkeps}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ &\\
- \multicolumn{3}{r}{$bs \,@\,\textit{bmkeps}\,r_1\,@\, \textit{bmkeps}\,r_2$}\\
+ \multicolumn{3}{l}{$bs \,@\,\textit{bmkeps}\,r_1\,@\, \textit{bmkeps}\,r_2$}\\
$\textit{bmkeps}\,(\textit{STAR}\,bs\,r)$ & $\dn$ &
$bs \,@\, [\S]$\\
- $\textit{bmkepss}\,(r\!::\!\rs)$ & $\dn$ &
- $\textit{if}\;\textit{bnullable}\,r$\\
- & &$\textit{then}\;\textit{bmkeps}\,r$\\
- & &$\textit{else}\;\textit{bmkepss}\,\rs$
+ $\textit{bmkeps}\,(\textit{NT}\,bs\,r\,n)$ & $\dn$ &\\
+ \multicolumn{3}{l@ {}}{$\textit{if}\;n=0\;\textit{then}\;bs \,@\, [\S]$}\\
+ \multicolumn{3}{l@ {}}{$\textit{else}\,bs \,@\, [\Z] \,@\, \textit{bmkeps}\,r\,@\,
+ \textit{bmkeps}\,(@{term "ANTIMES [] r (n - 1)"})$}\\
+ $\textit{bmkepss}\,(r\!::\!\rs)$ & $\dn$ &\\
+ \multicolumn{3}{l}{$\textit{if}\;\textit{bnullable}\,r\,\textit{then}\;\textit{bmkeps}\,r\;
+ \textit{else}\;\textit{bmkepss}\,\rs$}
\end{tabular}
\end{tabular}
\end{center}
@@ -783,8 +873,8 @@
%
\begin{center}
\begin{tabular}{@ {}lcl@ {}}
- \multicolumn{3}{@ {}l}{$(\textit{ZERO})\backslash c$ $\;\dn\;$ $\textit{ZERO}$ \quad\qquad
- $(\textit{ONE}\;bs)\backslash c$ $\;\dn\;$ $\textit{ZERO}$}\\
+ $(\textit{ZERO})\backslash c$ & $\;\dn\;$ & $\textit{ZERO}$\\
+ $(\textit{ONE}\;bs)\backslash c$ & $\;\dn\;$ & $\textit{ZERO}$\\
$(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
$\textit{if}\;c=d\; \;\textit{then}\;
\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
@@ -796,8 +886,12 @@
$\;(\textit{fuse}\,(\textit{bmkeps}\,r_1)\,(r_2\backslash c))$\\
& &$\textit{else}\;\textit{SEQ}\,bs\,(r_1\backslash c)\,r_2$\\
$(\textit{STAR}\,bs\,r)\backslash c$ & $\dn$ &
- $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\,
- (\textit{STAR}\,[]\,r)$
+ $\textit{SEQ}\;(bs\,@\,[\Z])\,(r\backslash c)\,
+ (\textit{STAR}\,[]\,r)$\\
+ $(\textit{NT}\,bs\,r\,n)\backslash c$ & $\dn$ &
+ $\textit{if}\;n = 0\; \textit{then}\;\textit{ZERO}\;\textit{else}\;
+ \textit{SEQ}\;(bs\,@\,[\Z])\,(r\backslash c)\,
+ (\textit{NT}\,[]\,r\,(n - 1))$
\end{tabular}
\end{center}
@@ -838,7 +932,9 @@
@{thm (lhs) retrieve.simps(6)[of _ "r\<^sub>1" "r\<^sub>2" "v\<^sub>1" "v\<^sub>2"]}
& $\dn$ & @{thm (rhs) retrieve.simps(6)[of _ "r\<^sub>1" "r\<^sub>2" "v\<^sub>1" "v\<^sub>2"]}\\
@{thm (lhs) retrieve.simps(7)} & $\dn$ & @{thm (rhs) retrieve.simps(7)}\\
- @{thm (lhs) retrieve.simps(8)} & $\dn$ & @{thm (rhs) retrieve.simps(8)}
+ @{thm (lhs) retrieve.simps(8)} & $\dn$ & @{thm (rhs) retrieve.simps(8)}\\
+ @{thm (lhs) retrieve.simps(9)} & $\dn$ & @{thm (rhs) retrieve.simps(9)}\\
+ @{thm (lhs) better_retrieve2} & $\dn$ & @{thm (rhs) better_retrieve2}
\end{tabular}
\end{center}
@@ -957,7 +1053,7 @@
text {*
- Derivatives as calculated by Brzozowski’s method are usually more
+ Derivatives as calculated by Brzozowski's method are usually more
complex regular expressions than the initial one; the result is
that derivative-based matching and lexing algorithms are
often abysmally slow if the ``growth problem'' is not addressed. As Sulzmann and Lu wrote, various
@@ -1006,7 +1102,7 @@
\[
@{term "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) # rs\<^sub>1)"}
\quad\xrightarrow{bsimp}\quad
- @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) # rs\<^sub>1)"}
+ @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
\]
\noindent
@@ -1073,6 +1169,7 @@
@{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ &
@{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}\\
@{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\
+ @{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\
\end{tabular}
\end{center}
@@ -1089,7 +1186,7 @@
\begin{center}
\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
\multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad
- @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\\
+ @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\smallskip\\
@{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
\end{tabular}
\end{center}
@@ -1135,7 +1232,7 @@
\end{center}
\noindent
- We believe our recursive function @{term bsimp} simplifies regular
+ We believe our recursive function @{term bsimp} simplifies bitcoded regular
expressions as intended by Sulzmann and Lu. There is no point in applying the
@{text bsimp} function repeatedly (like the simplification in their paper which needs to be
applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent,
@@ -1149,7 +1246,9 @@
This can be proved by induction on @{text r} but requires a detailed analysis
that the de-nesting of alternatives always results in a flat list of regular
expressions. We omit the details since it does not concern the correctness proof.
-
+ %It might be interesting to not that we do not simplify inside @{term STAR} and
+ %@{text NT}: the reason is that we want to keep the
+
Next we can include simplification after each derivative step leading to the
following notion of bitcoded derivatives:
@@ -1228,7 +1327,7 @@
\noindent
- We can show that this rewrite system preserves @{term bnullable}, that
+ We can also show that this rewrite system preserves @{term bnullable}, that
is simplification does not affect nullability:
\begin{lemma}\label{lembnull}
@@ -1309,7 +1408,7 @@
@{thm[mode=Axiom] bs6}$A0$\quad
@{thm[mode=Axiom] bs7}$A1$\quad
@{thm[mode=Rule] bs10[of "rs\<^sub>1" "rs\<^sub>2"]}$AL$\\
- @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LT$\quad
+ @{thm[mode=Rule] rrewrite_srewrite.ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LT$\quad
@{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}$LH$\quad
@{thm[mode=Axiom] ss4}$L\ZERO$\quad
@{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}$LS$\medskip\\
@@ -1342,7 +1441,8 @@
\end{center}
\noindent
-We prove this by induction on $r$. The base cases for @{term AZERO},
+Note that the bound $N$ is a bound for \emph{all} strings, no matter how long they are.
+We establish this bound by induction on $r$. The base cases for @{term AZERO},
@{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is
for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction
hypotheses state $\exists N_1. \forall s. \; \llbracket{}@{term "bders_simp r\<^sub>1 s"}\rrbracket \leq N_1$ and
@@ -1385,8 +1485,27 @@
we need to introduce simplifications, such as
$(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
which exist for partial derivatives. However, if we introduce them in our
-setting we would lose the POSIX property of our calculated values. We leave better
-bounds for future work.\\[-6.5mm]
+setting we would lose the POSIX property of our calculated values. For example
+given the regular expressions @{term "SEQ (ALT a ab) (ALT b ONE)"} and the string $[a, b]$, then our
+algorithm generates the following correct POSIX value
+%
+\[
+@{term "Seq (Right (Seq (Char a) (Char b))) (Right Empty)"}
+\]
+
+\noindent
+Essentially it matches the string with the longer @{text "Right"}-alternative in the
+first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence).
+If we add the simplification above, then we obtain the following value
+%
+\[
+@{term "Seq (Left (Char a)) (Left (Char b))"}
+\]
+
+\noindent
+where the @{text "Left"}-alternatives get priority. However, this violates
+the POSIX rules and we have not been able to
+reconcile this problem. Therefore we leave better bounds for future work.\\[-6.5mm]
*}
@@ -1423,52 +1542,112 @@
%point-of-view it is really important to have the formal proofs of
%the corresponding properties at hand.
- We have also developed a
- healthy suspicion when experimental data is used to back up
+ We can of course only make a claim about the correctness and the sizes of the
+ derivatives, not about the efficiency or runtime of our version of
+ Sulzman and Lu's algorithm. But we found the size is an important
+ first indicator about efficiency: clearly if the derivatives can
+ grow to arbitrarily big sizes and the algorithm needs to traverse
+ the derivatives possibly several times, then the algorithm will be
+ slow---excruciatingly slow that is. Other works seems to make
+ stronger claims, but during our work we have developed a healthy
+ suspicion when for example experimental data is used to back up
efficiency claims. For example Sulzmann and Lu write about their
- equivalent of @{term blexer_simp} \textit{``...we can incrementally compute
- bitcoded parse trees in linear time in the size of the input''}
- \cite[Page 14]{Sulzmann2014}.
- Given the growth of the
- derivatives in some cases even after aggressive simplification, this
- is a hard to believe claim. A similar claim about a theoretical runtime
- of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates
- tokens according to POSIX rules~\cite{verbatim}. For this Verbatim uses Brzozowski's
- derivatives like in our work.
- The authors write: \textit{``The results of our empirical tests [..] confirm that Verbatim has
- @{text "O(n\<^sup>2)"} time complexity.''} \cite[Section~VII]{verbatim}.
- While their correctness proof for Verbatim is formalised in Coq, the claim about
- the runtime complexity is only supported by some emperical evidence obtained
- by using the code extraction facilities of Coq.
- Given our observation with the ``growth problem'' of derivatives,
- we
- tried out their extracted OCaml code with the example
- \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a
- string of 40 $a$'s and that increased to approximately 19 minutes when the
- string is 50 $a$'s long. Taking into account that derivatives are not simplified in the Verbatim
- lexer, such numbers are not surprising.
- Clearly our result of having finite
- derivatives might sound rather weak in this context but we think such effeciency claims
- really require further scrutiny.
+ equivalent of @{term blexer_simp} \textit{``...we can incrementally
+ compute bitcoded parse trees in linear time in the size of the
+ input''} \cite[Page 14]{Sulzmann2014}. Given the growth of the
+ derivatives in some cases even after aggressive simplification,
+ this is a hard to believe claim. A similar claim about a
+ theoretical runtime of @{text "O(n\<^sup>2)"} is made for the
+ Verbatim lexer, which calculates tokens according to POSIX
+ rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's
+ derivatives like in our work. The authors write: \textit{``The
+ results of our empirical tests [..] confirm that Verbatim has
+ @{text "O(n\<^sup>2)"} time complexity.''}
+ \cite[Section~VII]{verbatim}. While their correctness proof for
+ Verbatim is formalised in Coq, the claim about the runtime
+ complexity is only supported by some emperical evidence obtained by
+ using the code extraction facilities of Coq. Given our observation
+ with the ``growth problem'' of derivatives, we tried out their
+ extracted OCaml code with the example \mbox{@{text "(a +
+ aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5
+ minutes to tokenise a string of 40 $a$'s and that increased to
+ approximately 19 minutes when the string is 50 $a$'s long. Taking
+ into account that derivatives are not simplified in the Verbatim
+ lexer, such numbers are not surprising. Clearly our result of
+ having finite derivatives might sound rather weak in this context
+ but we think such effeciency claims really require further
+ scrutiny.
+
+ The contribution of this paper is to make sure derivatives do not
+ grow arbitrarily big (universially) In the example \mbox{@{text "(a
+ + aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or
+ less. The result is that lexing a string of, say, 50\,000 a's with
+ the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes
+ approximately 10 seconds with our Scala implementation of the
+ presented algorithm.
- The contribution of this paper is to make sure
- derivatives do not grow arbitrarily big (universially) In the example \mbox{@{text "(a + aa)\<^sup>*"}},
- \emph{all} derivatives have a size of 17 or less. The result is that
- lexing a string of, say, 50\,000 a's with the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes approximately
- 10 seconds with our Scala implementation
- of the presented algorithm. Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
+ Finally, let us come back to the point about bounded regular
+ expressions. We have in this paper only shown that @{term "NTIMES r
+ n"} can be included, but all our results extend straightforwardly
+ also to the other bounded regular expressions. We find bounded
+ regular expressions fit naturally into the setting of Brzozowski
+ derivatives and the bitcoded regular expressions by Sulzmann and Lu.
+ In contrast bounded regular expressions are often the Achilles'
+ heel in regular expression matchers that use the traditional
+ automata-based approach to lexing, primarily because they need to expand the
+ counters of bounded regular expressions into $n$-connected copies
+ of an automaton. This is not needed in Sulzmann and Lu's algorithm.
+ To see the difference consider for example the regular expression @{term "SEQ (NTIMES a
+ 1001) (STAR a)"}, which is not permitted in the Go language because
+ the counter is too big. In contrast we have no problem with
+ matching this regular expression with, say 50\,000 a's, because the
+ counters can be kept compact. In fact, the overall size of the
+ derivatives is never greater than 5 in this example. Even in the
+ example from Section 2, where Rust raises an error message, namely
+ \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for
+ our derivatives is a moderate 14.
+
+ Let us also compare our work to the verified Verbatim++ lexer where
+ the authors of the Verbatim lexer introduced a number of
+ improvements and optimisations, for example memoisation
+ \cite{verbatimpp}. However, unlike Verbatim, which works with
+ derivatives like in our work, Verbatim++ compiles first a regular
+ expression into a DFA. While this makes lexing fast in many cases,
+ with examples of bounded regular expressions like
+ \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}}
+ one needs to represent them as
+ sequences of $a \cdot a \cdot \ldots \cdot a$ (500 a's in sequence). We have run
+ their extracted code with such a regular expression as a
+ single lexing rule and a string of 50\,000 a's---lexing in this case
+ takes approximately 5~minutes. We are not aware of any better
+ translation using the traditional notion of DFAs. Therefore we
+ prefer to stick with calculating derivatives, but attempt to make
+ this calculation (in the future) as fast as possible. What we can guaranty
+ with the presented work is that the maximum size of the derivatives
+ for this example is not bigger than 9. This means our Scala
+ implementation only needs a few seconds for this example.
+ %
+ %
+ %Possible ideas are
+ %zippers which have been used in the context of parsing of
+ %context-free grammars \cite{zipperparser}.
+ \medskip
+
+ \noindent
+ Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
%\\[-10mm]
%%\bibliographystyle{plain}
\bibliography{root}
+\newpage
\appendix
\section{Some Proofs}
While we have formalised \emph{all} results in Isabelle/HOL, below we shall give some
-rough details of our reasoning in ``English''.
+rough details of our reasoning in ``English'' if this is helpful for reviewing.
\begin{proof}[Proof of Lemma~\ref{codedecode}]
This follows from the property that