diff -r 7a579f5533f8 -r 5af61c89f51e thys3/Paper.thy --- a/thys3/Paper.thy Sat Jul 16 18:34:46 2022 +0100 +++ b/thys3/Paper.thy Sun Jul 17 13:07:05 2022 +0100 @@ -1,6 +1,7 @@ (*<*) theory Paper imports + "Posix.LexerSimp" "Posix.FBound" "HOL-Library.LaTeXsugar" begin @@ -29,17 +30,17 @@ ZERO ("\<^bold>0" 81) and ONE ("\<^bold>1" 81) and CH ("_" [1000] 80) and - ALT ("_ + _" [77,77] 78) and - SEQ ("_ \ _" [77,77] 78) and + ALT ("_ + _" [76,76] 77) and + SEQ ("_ \ _" [78,78] 78) and STAR ("_\<^sup>*" [79] 78) and - NTIMES ("_\<^sup>{\<^sup>_\<^sup>}" [79] 78) and + NTIMES ("_\<^sup>{\<^bsup>_\<^esup>\<^sup>}" [79] 78) and val.Void ("Empty" 78) and val.Char ("Char _" [1000] 78) and val.Left ("Left _" [79] 78) and val.Right ("Right _" [1000] 78) and val.Seq ("Seq _ _" [79,79] 78) and - val.Stars ("Stars _" [79] 78) and + val.Stars ("Stars _" [1000] 78) and Prf ("\ _ : _" [75,75] 75) and Posix ("'(_, _') \ _" [63,75,75] 75) and @@ -57,6 +58,7 @@ AALTs ("ALTs _ _" [77,77] 78) and ASEQ ("SEQ _ _ _" [79, 79,79] 78) and ASTAR ("STAR _ _" [79, 79] 78) and + ANTIMES ("NT _ _ _" [79, 79, 79] 78) and code ("code _" [79] 74) and intern ("_\<^latex>\\\mbox{$^\\uparrow$}\" [900] 80) and @@ -81,8 +83,10 @@ by (metis list.exhaust retrieve.simps(5)) - - +lemma better_retrieve2: + shows "retrieve (ANTIMES bs r (n + 1)) (Stars (v#vs)) = + bs @ [Z] @ retrieve r v @ retrieve (ANTIMES [] r n) (Stars vs)" +by auto (*>*) section {* Introduction *} @@ -91,27 +95,24 @@ In the last fifteen or so years, Brzozowski's derivatives of regular expressions have sparked quite a bit of interest in the functional -programming and theorem prover communities. The beauty of -Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly -expressible in any functional language, and easily definable and -reasoned about in theorem provers---the definitions just consist of -inductive datatypes and simple recursive functions. Another neat -feature of derivatives is that they can be easily extended to bounded -regular expressions, such as @{term "NTIMES r n"}, where numbers or -intervals specify how many times a regular expression should be used -during matching. - +programming and theorem prover communities. Derivatives of a regular expression, written @{term "der c r"}, give a simple solution to the problem of matching a string @{term s} with a regular expression @{term r}: if the derivative of @{term r} w.r.t.\ (in succession) all the characters of the string matches the empty string, -then @{term r} matches @{term s} (and {\em vice versa}). We are aware -of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by -Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part -of the work by Krauss and Nipkow~\cite{Krauss2011}. And another one -in Coq is given by Coquand and Siles \cite{Coquand2012}. -Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}. +then @{term r} matches @{term s} (and {\em vice versa}). +The beauty of +Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly +expressible in any functional language, and easily definable and +reasoned about in theorem provers---the definitions just consist of +inductive datatypes and simple recursive functions. Another attractive +feature of derivatives is that they can be easily extended to \emph{bounded} +regular expressions, such as @{term "r"}$^{\{@{text n}\}}$ or @{term r}$^{\{..@{text n}\}}$, where numbers or +intervals of numbers specify how many times a regular expression should be used +during matching. + + However, there are two difficulties with derivative-based matchers: @@ -159,11 +160,11 @@ earlier work to preserve the correctness of Sulzmann and Lu's algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do \emph{not} help with limiting the growth of the derivatives shown -above: the growth is slowed, but the derivatives can still grow rather +above: the growth is slowed, but some derivatives can still grow rather quickly beyond any finite bound. -Sulzmann and Lu overcome this ``growth problem'' in a second algorithm +Sulzmann and Lu address this ``growth problem'' in a second algorithm \cite{Sulzmann2014} where they introduce bitcoded regular expressions. In this version, POSIX values are represented as bitsequences and such sequences are incrementally generated @@ -189,7 +190,7 @@ the derivative-based lexing algorithm of Sulzmann and Lu \cite{Sulzmann2014} where regular expressions are annotated with bitsequences. We define the crucial simplification function as a -recursive function, without the need of a fix-point operation. One objective of +recursive function, without the need of a fixpoint operation. One objective of the simplification function is to remove duplicates of regular expressions. For this Sulzmann and Lu use in their paper the standard @{text nub} function from Haskell's list library, but this function @@ -199,7 +200,7 @@ Inspired by Scala's library for lists, we shall instead use a @{text distinctWith} function that finds duplicates under an ``erasing'' function which deletes bitcodes before comparing regular expressions. -We shall also introduce our own argument and definitions for +We shall also introduce our \emph{own} argument and definitions for establishing the correctness of the bitcoded algorithm when simplifications are included. Finally we establish that the size of derivatives can be finitely bounded.\medskip%\footnote{ In this paper, we shall first briefly introduce the basic notions @@ -228,7 +229,7 @@ concatenation is $\_ \,@\, \_\,$. We often use the usual bracket notation for lists also for strings; for example a string consisting of just a single character $c$ is written $[c]$. - Our regular expressions are defined as usual as the following inductive + Our regular expressions are defined as the following inductive datatype:\\[-4mm] % \begin{center} @@ -242,20 +243,55 @@ @{term "NTIMES r n"} \end{center} - \noindent where @{const ZERO} stands for the regular expression that does - not match any string, @{const ONE} for the regular expression that matches - only the empty string and @{term c} for matching a character literal. - The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively. - We sometimes omit the $\cdot$ in a sequence regular expression for brevity. - In our work here we also add to the usual ``basic'' regular expressions - the bounded regular expression @{term "NTIMES r n"} where the @{term n} - specifies that @{term r} should match exactly @{term n}-times. For - brevity we omit the other bounded regular expressions - @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$ and @{text "r"}$^{\{n..m\}}$ - which specify an interval for how many times @{text r} should match. Our - results extend straightforwardly also to them. The - \emph{language} of a regular expression, written $L(r)$, is defined as usual - and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}). + \noindent where @{const ZERO} stands for the regular expression that + does not match any string, @{const ONE} for the regular expression + that matches only the empty string and @{term c} for matching a + character literal. The constructors $+$ and $\cdot$ represent + alternatives and sequences, respectively. We sometimes omit the + $\cdot$ in a sequence regular expression for brevity. The + \emph{language} of a regular expression, written $L(r)$, is defined + as usual and we omit giving the definition here (see for example + \cite{AusafDyckhoffUrban2016}). + + In our work here we also add to the usual ``basic'' regular + expressions the \emph{bounded} regular expression @{term "NTIMES r + n"} where the @{term n} specifies that @{term r} should match + exactly @{term n}-times. Again for brevity we omit the other bounded + regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$ + and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many + times @{text r} should match. The results presented in this paper + extend straightforwardly to them too. The importance of the bounded + regular expressions is that they are often used in practical + applications, such as Snort (a system for detecting network + intrusion) and also in XML Schema definitions. According to Bj\"{o}rklund et + al~\cite{BjorklundMartensTimm2015}, bounded regular expressions + occur frequently in the latter and can have counters of up to + ten million. The problem is that tools based on the classic notion + of automata need to expand @{text "r"}$^{\{n\}}$ into @{text n} + connected copies of the automaton for @{text r}. This leads to very + inefficient matching algorithms or algorithms that consume large + amounts of memory. A classic example is the regular expression + \mbox{@{term "SEQ (SEQ (STAR (ALT a b)) a) (NTIMES (ALT a b) n)"}} + where the minimal DFA requires at least $2^{n + 1}$ states (see + \cite{CountingSet2020}). Therefore regular expression matching + libraries that rely on the classic notion of DFAs often impose + adhoc limits for bounded regular expressions: For example in the + regular expression matching library in the Go language the regular expression + @{term "NTIMES a 1001"} is not permitted, because no counter can be + above 1000, and in the built-in Rust regular expression library + expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error + message for being too big. These problems can of course be solved in matching + algorithms where automata go beyond the classic notion and for + instance include explicit counters (see~\cite{CountingSet2020}). + The point here is that Brzozowski derivatives and the algorithms by + Sulzmann and Lu can be straightforwardly extended to deal with + bounded regular expressions and moreover the resulting code + still consists of only simple recursive functions and inductive + datatypes. Finally, bounded regular expressions + do not destroy our finite boundedness property, which we shall + prove later on.%, because during the lexing process counters will only be + %decremented. + Central to Brzozowski's regular expression matcher are two functions called @{text nullable} and \emph{derivative}. The latter is written @@ -264,7 +300,7 @@ regular expressions. % \begin{center} -\begin{tabular}{cc} +\begin{tabular}{c @ {\hspace{-1mm}}c} \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l} @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\ @@ -274,7 +310,8 @@ & & @{text "then"} @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c r\<^sub>2)"}\\ & & @{text "else"} @{term "SEQ (der c r\<^sub>1) r\<^sub>2"}\\ % & & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\ - @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)} + @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\\ + @{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)} \end{tabular} & \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} @@ -283,7 +320,10 @@ @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\ @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ - @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}%\medskip\\ + @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\ + @{thm (lhs) nullable.simps(7)} & $\dn$ & @{text "if"} @{text "n = 0"}\\ + & & @{text "then"} @{const "True"}\\ + & & @{text "else"} @{term "nullable r"} \end{tabular} \end{tabular} \end{center} @@ -307,6 +347,12 @@ \noindent It is a fun exercise to formally prove this property in a theorem prover. +We are aware +of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by +Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part +of the work by Krauss and Nipkow~\cite{Krauss2011}. And another one +in Coq is given by Coquand and Siles \cite{Coquand2012}. +Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}. The novel idea of Sulzmann and Lu is to extend this algorithm for lexing, where it is important to find out which part of the string @@ -334,26 +380,38 @@ Sulzmann and Lu also define inductively an - inhabitation relation that associates values to regular expressions. This - is defined by the following six rules for the values: + inhabitation relation that associates values to regular expressions. Our + version of this relation is defined the following six rules for the values: % \begin{center} - \begin{tabular}{@ {\hspace{-12mm}}c@ {}} \begin{tabular}{@ {}c@ {}} - \\[-8mm] - @{thm[mode=Axiom] Prf.intros(4)}\\ - @{thm[mode=Axiom] Prf.intros(5)[of "c"]} - \end{tabular} - \quad - @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad - @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\quad - @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\quad - @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\mbox{}\hspace{-10mm}\mbox{} + @{thm[mode=Axiom] Prf.intros(4)}\qquad + @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad + @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\qquad + @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\qquad + @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ + @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad + $\mprset{flushleft}\inferrule{ + @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\\\\ + @{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\quad + @{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]} + } + {@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}} + $ \end{tabular} \end{center} \noindent Note that no values are associated with the regular expression - @{term ZERO}, since it cannot match any string. + @{term ZERO}, since it cannot match any string. Interesting is our version + of the rule for @{term "STAR r"} where we require that each value + in @{term vs} flattens to a non-empty string. This means if @{term "STAR r"} ``fires'' + one or more times, then each copy needs to match a non-empty string. + Similarly, in the rule + for @{term "NTIMES r n"} we require that the length of the list + @{term "vs\<^sub>1 @ vs\<^sub>2"} equals @{term n} (meaning the regular expression + @{text r} matches @{text n}-times) and that the first segment of this list + contains values that flatten to non-empty strings followed by a segment that + only contains values that flatten to the empty string. It is routine to establish how values ``inhabiting'' a regular expression correspond to the language of a regular expression, namely @{thm L_flat_Prf}. @@ -378,7 +436,7 @@ \begin{figure}[t] \begin{center}\small% - \begin{tabular}{@ {}c@ {}} + \begin{tabular}{@ {\hspace{-2mm}}c@ {}} \\[-9mm] @{thm[mode=Axiom] Posix.intros(1)}\P\@{term "ONE"} \quad @{thm[mode=Axiom] Posix.intros(2)}\P\@{term "c"}\quad @@ -397,13 +455,23 @@ @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} - {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\P\\\\[-4mm] + {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\P\\\medskip\smallskip\\ + \mprset{sep=4mm} + @{thm[mode=Rule] Posix.intros(9)}\Pn[]\\quad + $\mprset{flushleft} + \inferrule + {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad + @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad + @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\ + @{thm (prem 4) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]}} + {@{thm (concl) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]}}$\Pn+\ + \\[-4mm] \end{tabular} \end{center} \caption{The inductive definition of POSIX values taken from our earlier paper \cite{AusafDyckhoffUrban2016}. The ternary relation, written $(s, r) \rightarrow v$, formalises the notion of given a string $s$ and a regular expression $r$ what is the unique value $v$ that satisfies the informal POSIX constraints for - regular expression matching.}\label{POSIXrules} + regular expression matching.\smallskip}\label{POSIXrules} \end{figure} The clever idea by Sulzmann and Lu \cite{Sulzmann2014} in their first algorithm is to define @@ -419,7 +487,8 @@ @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ - \end{tabular}\smallskip\\ + @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\ + \end{tabular}\medskip\\ \begin{tabular}{@ {}cc@ {}} \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}} @@ -429,16 +498,18 @@ @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ - & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]} + & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\ + @{thm (lhs) injval.simps(8)[of "r" "n" "c" "v" "vs"]} & $\dn$ + & @{thm (rhs) injval.simps(8)[of "r" "n" "c" "v" "vs"]} \end{tabular} & - \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}} + \begin{tabular}{@ {\hspace{-3mm}}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}} @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$\\ - \multicolumn{3}{@ {}r@ {}}{@{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}} + \multicolumn{3}{@ {}r@ {}}{@{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}}\\ \mbox{}\\ \mbox{} \end{tabular} \end{tabular} \end{tabular}\smallskip @@ -447,7 +518,10 @@ \noindent The function @{text mkeps} is run when the last derivative is nullable, that is the string to be matched is in the language of the regular expression. It generates - a value for how the last derivative can match the empty string. The injection function + a value for how the last derivative can match the empty string. In case + of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate + a list of exactly @{term n} copies, which is the length of the list we expect in this + case. The injection function then calculates the corresponding value for each intermediate derivative until a value for the original regular expression is generated. Graphically the algorithm by @@ -586,6 +660,7 @@ $\;\mid\;$ @{term "AALTs bs rs"} $\;\mid\;$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"} $\;\mid\;$ @{term "ASTAR bs r"} + $\;\mid\;$ @{term "ANTIMES bs r n"} \end{center} \noindent where @{text bs} stands for bitsequences; @{text r}, @@ -654,12 +729,17 @@ $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$ \hspace{2mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ - $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ - $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & + $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ + $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$ - \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\medskip\\ - \multicolumn{3}{l}{$\textit{decode}\,bs\,r$ $\dn$ + \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\\ + $\textit{decode}'\,(\S\!::\!bs)\,(r^{\{n\}})$ & $\dn$ & $(\Stars\,[], bs)$\\ + $\textit{decode}'\,(\Z\!::\!bs)\,(r^{\{n\}})$ & $\dn$ & + $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ + & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^{\{n - 1\}}$ + \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\medskip\\ + \multicolumn{3}{@ {}l}{$\textit{decode}\,bs\,r$ $\dn$ $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$ $\;\,\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; \textit{else}\;\textit{None}$}\\[-4mm] @@ -704,7 +784,9 @@ $\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,r_1\,r_2)$ & $\dn$ & $\textit{SEQ}\,(bs\,@\,bs')\,r_1\,r_2$\\ $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,r)$ & $\dn$ & - $\textit{STAR}\,(bs\,@\,bs')\,r$ + $\textit{STAR}\,(bs\,@\,bs')\,r$\\ + $\textit{fuse}\,bs\,(\textit{NT}\,bs'\,r\,n)$ & $\dn$ & + $\textit{NT}\,(bs\,@\,bs')\,r\,n$ \end{tabular} \end{tabular} \end{center} @@ -722,7 +804,9 @@ & \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\ - $(r^*)^\uparrow$ & $\dn$ & $\textit{STAR}\;[]\,r^\uparrow$ + $(r^*)^\uparrow$ & $\dn$ & $\textit{STAR}\;[]\,r^\uparrow$\\ + $(r^{\{n\}})^\uparrow$ & $\dn$ & + $\textit{NT}\;[]\,r^\uparrow\,n$ \end{tabular} & \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} @@ -745,8 +829,8 @@ bitcoded regular expressions. % \begin{center} - \begin{tabular}{@ {}c@ {\hspace{2mm}}c@ {}} - \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} + \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{0mm}}c@ {}} + \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l} $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\ $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\ $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\ @@ -755,21 +839,27 @@ $\textit{bnullable}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ & $\textit{bnullable}\,r_1\wedge \textit{bnullable}\,r_2$\\ $\textit{bnullable}\,(\textit{STAR}\,bs\,r)$ & $\dn$ & - $\textit{True}$ + $\textit{True}$\\ + $\textit{bnullable}\,(\textit{NT}\,bs\,r\,n)$ & $\dn$ &\\ + \multicolumn{3}{l}{$\textit{if}\;n = 0\;\textit{then}\;\textit{True}\; + \textit{else}\;\textit{bnullable}\,r$}\\ \end{tabular} & - \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}} + \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l@ {}} $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\ $\textit{bmkeps}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ & $bs\,@\,\textit{bmkepss}\,\rs$\\ $\textit{bmkeps}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ &\\ - \multicolumn{3}{r}{$bs \,@\,\textit{bmkeps}\,r_1\,@\, \textit{bmkeps}\,r_2$}\\ + \multicolumn{3}{l}{$bs \,@\,\textit{bmkeps}\,r_1\,@\, \textit{bmkeps}\,r_2$}\\ $\textit{bmkeps}\,(\textit{STAR}\,bs\,r)$ & $\dn$ & $bs \,@\, [\S]$\\ - $\textit{bmkepss}\,(r\!::\!\rs)$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,r$\\ - & &$\textit{then}\;\textit{bmkeps}\,r$\\ - & &$\textit{else}\;\textit{bmkepss}\,\rs$ + $\textit{bmkeps}\,(\textit{NT}\,bs\,r\,n)$ & $\dn$ &\\ + \multicolumn{3}{l@ {}}{$\textit{if}\;n=0\;\textit{then}\;bs \,@\, [\S]$}\\ + \multicolumn{3}{l@ {}}{$\textit{else}\,bs \,@\, [\Z] \,@\, \textit{bmkeps}\,r\,@\, + \textit{bmkeps}\,(@{term "ANTIMES [] r (n - 1)"})$}\\ + $\textit{bmkepss}\,(r\!::\!\rs)$ & $\dn$ &\\ + \multicolumn{3}{l}{$\textit{if}\;\textit{bnullable}\,r\,\textit{then}\;\textit{bmkeps}\,r\; + \textit{else}\;\textit{bmkepss}\,\rs$} \end{tabular} \end{tabular} \end{center} @@ -783,8 +873,8 @@ % \begin{center} \begin{tabular}{@ {}lcl@ {}} - \multicolumn{3}{@ {}l}{$(\textit{ZERO})\backslash c$ $\;\dn\;$ $\textit{ZERO}$ \quad\qquad - $(\textit{ONE}\;bs)\backslash c$ $\;\dn\;$ $\textit{ZERO}$}\\ + $(\textit{ZERO})\backslash c$ & $\;\dn\;$ & $\textit{ZERO}$\\ + $(\textit{ONE}\;bs)\backslash c$ & $\;\dn\;$ & $\textit{ZERO}$\\ $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ @@ -796,8 +886,12 @@ $\;(\textit{fuse}\,(\textit{bmkeps}\,r_1)\,(r_2\backslash c))$\\ & &$\textit{else}\;\textit{SEQ}\,bs\,(r_1\backslash c)\,r_2$\\ $(\textit{STAR}\,bs\,r)\backslash c$ & $\dn$ & - $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\, - (\textit{STAR}\,[]\,r)$ + $\textit{SEQ}\;(bs\,@\,[\Z])\,(r\backslash c)\, + (\textit{STAR}\,[]\,r)$\\ + $(\textit{NT}\,bs\,r\,n)\backslash c$ & $\dn$ & + $\textit{if}\;n = 0\; \textit{then}\;\textit{ZERO}\;\textit{else}\; + \textit{SEQ}\;(bs\,@\,[\Z])\,(r\backslash c)\, + (\textit{NT}\,[]\,r\,(n - 1))$ \end{tabular} \end{center} @@ -838,7 +932,9 @@ @{thm (lhs) retrieve.simps(6)[of _ "r\<^sub>1" "r\<^sub>2" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) retrieve.simps(6)[of _ "r\<^sub>1" "r\<^sub>2" "v\<^sub>1" "v\<^sub>2"]}\\ @{thm (lhs) retrieve.simps(7)} & $\dn$ & @{thm (rhs) retrieve.simps(7)}\\ - @{thm (lhs) retrieve.simps(8)} & $\dn$ & @{thm (rhs) retrieve.simps(8)} + @{thm (lhs) retrieve.simps(8)} & $\dn$ & @{thm (rhs) retrieve.simps(8)}\\ + @{thm (lhs) retrieve.simps(9)} & $\dn$ & @{thm (rhs) retrieve.simps(9)}\\ + @{thm (lhs) better_retrieve2} & $\dn$ & @{thm (rhs) better_retrieve2} \end{tabular} \end{center} @@ -957,7 +1053,7 @@ text {* - Derivatives as calculated by Brzozowski’s method are usually more + Derivatives as calculated by Brzozowski's method are usually more complex regular expressions than the initial one; the result is that derivative-based matching and lexing algorithms are often abysmally slow if the ``growth problem'' is not addressed. As Sulzmann and Lu wrote, various @@ -1006,7 +1102,7 @@ \[ @{term "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) # rs\<^sub>1)"} \quad\xrightarrow{bsimp}\quad - @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) # rs\<^sub>1)"} + @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"} \] \noindent @@ -1073,6 +1169,7 @@ @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}\\ @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\ + @{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\ \end{tabular} \end{center} @@ -1089,7 +1186,7 @@ \begin{center} \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} \multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad - @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\\ + @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\smallskip\\ @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\ \end{tabular} \end{center} @@ -1135,7 +1232,7 @@ \end{center} \noindent - We believe our recursive function @{term bsimp} simplifies regular + We believe our recursive function @{term bsimp} simplifies bitcoded regular expressions as intended by Sulzmann and Lu. There is no point in applying the @{text bsimp} function repeatedly (like the simplification in their paper which needs to be applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent, @@ -1149,7 +1246,9 @@ This can be proved by induction on @{text r} but requires a detailed analysis that the de-nesting of alternatives always results in a flat list of regular expressions. We omit the details since it does not concern the correctness proof. - + %It might be interesting to not that we do not simplify inside @{term STAR} and + %@{text NT}: the reason is that we want to keep the + Next we can include simplification after each derivative step leading to the following notion of bitcoded derivatives: @@ -1228,7 +1327,7 @@ \noindent - We can show that this rewrite system preserves @{term bnullable}, that + We can also show that this rewrite system preserves @{term bnullable}, that is simplification does not affect nullability: \begin{lemma}\label{lembnull} @@ -1309,7 +1408,7 @@ @{thm[mode=Axiom] bs6}$A0$\quad @{thm[mode=Axiom] bs7}$A1$\quad @{thm[mode=Rule] bs10[of "rs\<^sub>1" "rs\<^sub>2"]}$AL$\\ - @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LT$\quad + @{thm[mode=Rule] rrewrite_srewrite.ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LT$\quad @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}$LH$\quad @{thm[mode=Axiom] ss4}$L\ZERO$\quad @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}$LS$\medskip\\ @@ -1342,7 +1441,8 @@ \end{center} \noindent -We prove this by induction on $r$. The base cases for @{term AZERO}, +Note that the bound $N$ is a bound for \emph{all} strings, no matter how long they are. +We establish this bound by induction on $r$. The base cases for @{term AZERO}, @{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction hypotheses state $\exists N_1. \forall s. \; \llbracket{}@{term "bders_simp r\<^sub>1 s"}\rrbracket \leq N_1$ and @@ -1385,8 +1485,27 @@ we need to introduce simplifications, such as $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, which exist for partial derivatives. However, if we introduce them in our -setting we would lose the POSIX property of our calculated values. We leave better -bounds for future work.\\[-6.5mm] +setting we would lose the POSIX property of our calculated values. For example +given the regular expressions @{term "SEQ (ALT a ab) (ALT b ONE)"} and the string $[a, b]$, then our +algorithm generates the following correct POSIX value +% +\[ +@{term "Seq (Right (Seq (Char a) (Char b))) (Right Empty)"} +\] + +\noindent +Essentially it matches the string with the longer @{text "Right"}-alternative in the +first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence). +If we add the simplification above, then we obtain the following value +% +\[ +@{term "Seq (Left (Char a)) (Left (Char b))"} +\] + +\noindent +where the @{text "Left"}-alternatives get priority. However, this violates +the POSIX rules and we have not been able to +reconcile this problem. Therefore we leave better bounds for future work.\\[-6.5mm] *} @@ -1423,52 +1542,112 @@ %point-of-view it is really important to have the formal proofs of %the corresponding properties at hand. - We have also developed a - healthy suspicion when experimental data is used to back up + We can of course only make a claim about the correctness and the sizes of the + derivatives, not about the efficiency or runtime of our version of + Sulzman and Lu's algorithm. But we found the size is an important + first indicator about efficiency: clearly if the derivatives can + grow to arbitrarily big sizes and the algorithm needs to traverse + the derivatives possibly several times, then the algorithm will be + slow---excruciatingly slow that is. Other works seems to make + stronger claims, but during our work we have developed a healthy + suspicion when for example experimental data is used to back up efficiency claims. For example Sulzmann and Lu write about their - equivalent of @{term blexer_simp} \textit{``...we can incrementally compute - bitcoded parse trees in linear time in the size of the input''} - \cite[Page 14]{Sulzmann2014}. - Given the growth of the - derivatives in some cases even after aggressive simplification, this - is a hard to believe claim. A similar claim about a theoretical runtime - of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates - tokens according to POSIX rules~\cite{verbatim}. For this Verbatim uses Brzozowski's - derivatives like in our work. - The authors write: \textit{``The results of our empirical tests [..] confirm that Verbatim has - @{text "O(n\<^sup>2)"} time complexity.''} \cite[Section~VII]{verbatim}. - While their correctness proof for Verbatim is formalised in Coq, the claim about - the runtime complexity is only supported by some emperical evidence obtained - by using the code extraction facilities of Coq. - Given our observation with the ``growth problem'' of derivatives, - we - tried out their extracted OCaml code with the example - \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a - string of 40 $a$'s and that increased to approximately 19 minutes when the - string is 50 $a$'s long. Taking into account that derivatives are not simplified in the Verbatim - lexer, such numbers are not surprising. - Clearly our result of having finite - derivatives might sound rather weak in this context but we think such effeciency claims - really require further scrutiny. + equivalent of @{term blexer_simp} \textit{``...we can incrementally + compute bitcoded parse trees in linear time in the size of the + input''} \cite[Page 14]{Sulzmann2014}. Given the growth of the + derivatives in some cases even after aggressive simplification, + this is a hard to believe claim. A similar claim about a + theoretical runtime of @{text "O(n\<^sup>2)"} is made for the + Verbatim lexer, which calculates tokens according to POSIX + rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's + derivatives like in our work. The authors write: \textit{``The + results of our empirical tests [..] confirm that Verbatim has + @{text "O(n\<^sup>2)"} time complexity.''} + \cite[Section~VII]{verbatim}. While their correctness proof for + Verbatim is formalised in Coq, the claim about the runtime + complexity is only supported by some emperical evidence obtained by + using the code extraction facilities of Coq. Given our observation + with the ``growth problem'' of derivatives, we tried out their + extracted OCaml code with the example \mbox{@{text "(a + + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 + minutes to tokenise a string of 40 $a$'s and that increased to + approximately 19 minutes when the string is 50 $a$'s long. Taking + into account that derivatives are not simplified in the Verbatim + lexer, such numbers are not surprising. Clearly our result of + having finite derivatives might sound rather weak in this context + but we think such effeciency claims really require further + scrutiny. + + The contribution of this paper is to make sure derivatives do not + grow arbitrarily big (universially) In the example \mbox{@{text "(a + + aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or + less. The result is that lexing a string of, say, 50\,000 a's with + the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes + approximately 10 seconds with our Scala implementation of the + presented algorithm. - The contribution of this paper is to make sure - derivatives do not grow arbitrarily big (universially) In the example \mbox{@{text "(a + aa)\<^sup>*"}}, - \emph{all} derivatives have a size of 17 or less. The result is that - lexing a string of, say, 50\,000 a's with the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes approximately - 10 seconds with our Scala implementation - of the presented algorithm. Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}. + Finally, let us come back to the point about bounded regular + expressions. We have in this paper only shown that @{term "NTIMES r + n"} can be included, but all our results extend straightforwardly + also to the other bounded regular expressions. We find bounded + regular expressions fit naturally into the setting of Brzozowski + derivatives and the bitcoded regular expressions by Sulzmann and Lu. + In contrast bounded regular expressions are often the Achilles' + heel in regular expression matchers that use the traditional + automata-based approach to lexing, primarily because they need to expand the + counters of bounded regular expressions into $n$-connected copies + of an automaton. This is not needed in Sulzmann and Lu's algorithm. + To see the difference consider for example the regular expression @{term "SEQ (NTIMES a + 1001) (STAR a)"}, which is not permitted in the Go language because + the counter is too big. In contrast we have no problem with + matching this regular expression with, say 50\,000 a's, because the + counters can be kept compact. In fact, the overall size of the + derivatives is never greater than 5 in this example. Even in the + example from Section 2, where Rust raises an error message, namely + \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for + our derivatives is a moderate 14. + + Let us also compare our work to the verified Verbatim++ lexer where + the authors of the Verbatim lexer introduced a number of + improvements and optimisations, for example memoisation + \cite{verbatimpp}. However, unlike Verbatim, which works with + derivatives like in our work, Verbatim++ compiles first a regular + expression into a DFA. While this makes lexing fast in many cases, + with examples of bounded regular expressions like + \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}} + one needs to represent them as + sequences of $a \cdot a \cdot \ldots \cdot a$ (500 a's in sequence). We have run + their extracted code with such a regular expression as a + single lexing rule and a string of 50\,000 a's---lexing in this case + takes approximately 5~minutes. We are not aware of any better + translation using the traditional notion of DFAs. Therefore we + prefer to stick with calculating derivatives, but attempt to make + this calculation (in the future) as fast as possible. What we can guaranty + with the presented work is that the maximum size of the derivatives + for this example is not bigger than 9. This means our Scala + implementation only needs a few seconds for this example. + % + % + %Possible ideas are + %zippers which have been used in the context of parsing of + %context-free grammars \cite{zipperparser}. + \medskip + + \noindent + Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}. %\\[-10mm] %%\bibliographystyle{plain} \bibliography{root} +\newpage \appendix \section{Some Proofs} While we have formalised \emph{all} results in Isabelle/HOL, below we shall give some -rough details of our reasoning in ``English''. +rough details of our reasoning in ``English'' if this is helpful for reviewing. \begin{proof}[Proof of Lemma~\ref{codedecode}] This follows from the property that