lexing: comparison thys3/Paper.thy

equal deleted inserted replaced

-:56057198e4f5
+:70c10dc41606
 In the last fifteen or so years, Brzozowski's derivatives of regular
 expressions have sparked quite a bit of interest in the functional
 programming and theorem prover communities.
 Derivatives of a
-regular expression, written @{term "der c r"}, give a simple solution
+regular expressions, written @{term "der c r"}, give a simple solution
 to the problem of matching a string @{term s} with a regular
 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
 then @{term r} matches @{term s} (and {\em vice versa}).
 The beauty of
 \noindent In this paper, we shall first briefly introduce the basic notions
 of regular expressions and describe the definition
 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves
 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove
 the correctness for the bitcoded algorithm without simplification, and
-after that extend the proof to include simplification.\mbox{}\\[-6mm]
+after that extend the proof to include simplification.
+Our Isabelle code including the results from Sec.~5 is available
+from \textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}.
+\mbox{}\\[-6mm]
 *}
 section {* Background *}
 \cite{AusafDyckhoffUrban2016}).
 In our work here we also add to the usual ``basic'' regular
 expressions the \emph{bounded} regular expression @{term "NTIMES r
 n"} where the @{term n} specifies that @{term r} should match
-exactly @{term n}-times. Again for brevity we omit the other bounded
+exactly @{term n}-times (it is not included in Sulzmann and Lu's original work). For brevity we omit the other bounded
-regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$
+regular expressions @{text "r"}$^{\{..\textit{n}\}}$,
-and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many
+@{text "r"}$^{\{\textit{n}..\}}$
+and @{text "r"}$^{\{\textit{n}..\textit{m}\}}$ which specify intervals for how many
 times @{text r} should match. The results presented in this paper
 extend straightforwardly to them too. The importance of the bounded
 regular expressions is that they are often used in practical
 applications, such as Snort (a system for detecting network
 intrusions) and also in XML Schema definitions. According to Bj\"{o}rklund et
 al~\cite{BjorklundMartensTimm2015}, bounded regular expressions
 occur frequently in the latter and can have counters of up to
 ten million.  The problem is that tools based on the classic notion
-of automata need to expand @{text "r"}$^{\{n\}}$ into @{text n}
+of automata need to expand @{text "r"}$^{\{\textit{n}\}}$ into @{text n}
 connected copies of the automaton for @{text r}. This leads to very
 inefficient matching algorithms or algorithms that consume large
 amounts of memory.  A classic example is the regular expression
 \mbox{@{term "SEQ (SEQ (STAR (ALT a b)) a) (NTIMES (ALT a b) n)"}}
 where the minimal DFA requires at least $2^{n + 1}$ states (see
 \cite{CountingSet2020}). Therefore regular expression matching
 libraries that rely on the classic notion of DFAs often impose
 adhoc limits for bounded regular expressions: For example in the
 regular expression matching library in the Go language and also in Google's RE2 library the regular expression
 @{term "NTIMES a 1001"} is not permitted, because no counter can be
-above 1000; and in the built-in regular expression library in Rust
+above 1000; and in the regular expression library in Rust
 expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error
-message for being too big.  These problems can of course be solved in matching
+message for being too big. Up until recently,\footnote{up until version 1.5.4 of the regex
+library in Rust; see also CVE-2022-24713.} Rust
+however happily generated automata for regular expressions such as
+@{text "a\<^bsup>{0}{4294967295}\<^esup>"}. This was due to a bug
+in the algorithm that decides when a regular expression is acceptable
+or too big according to Rust's classification (it did not account for the fact that @{text "a\<^bsup>{0}\<^esup>"} and similar examples can match the empty string). We shall come back to
+this example later in the paper.
+These problems can of course be solved in matching
 algorithms where automata go beyond the classic notion and for
-instance include explicit counters (see~\cite{CountingSet2020}).
+instance include explicit counters (e.g.~\cite{CountingSet2020}).
 The point here is that Brzozowski derivatives and the algorithms by
 Sulzmann and Lu can be straightforwardly extended to deal with
 bounded regular expressions and moreover the resulting code
 still consists of only simple recursive functions and inductive
 datatypes. Finally, bounded regular expressions
 do not destroy our finite boundedness property, which we shall
-prove later on.%, because during the lexing process counters will only be
+prove later on.
+%, because during the lexing process counters will only be
 %decremented.
 Central to Brzozowski's regular expression matcher are two functions
 called @{text nullable} and \emph{derivative}. The latter is written
 inhabitation relation that associates values to regular expressions. Our
 version of this relation is defined by the following six rules:
 %
 \begin{center}
 \begin{tabular}{@ {}l@ {}}
-@{thm[mode=Axiom] Prf.intros(4)}\quad
+@{thm[mode=Axiom] Prf.intros(4)}\qquad
-@{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad
+@{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\qquad
-@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\quad
+@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\qquad
 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
 @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad
 @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad
 $\mprset{flushleft}\inferrule{
 @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
-@{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
+@{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\quad
 @{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}
 }
 {@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}}
 $
 \end{tabular}
 Fig~\ref{POSIXrules}).
 \begin{figure}[t]
 \begin{center}\small%
 \begin{tabular}{@ {\hspace{-2mm}}c@ {}}
-\\[-4.5mm]
+\\[-8.5mm]
 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
 $\mprset{flushleft}
 \inferrule
 {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
-@{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\
+@{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
 @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
-{@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\smallskip\\
+{@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\\
 @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad
 $\mprset{flushleft}
 \inferrule
 {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
 @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
-{@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\smallskip\\
+{@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\\
 \mprset{sep=4mm}
-@{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad
+@{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad\;
 $\mprset{flushleft}
 \inferrule
 {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
 @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
 @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\
 \begin{tabular}{@ {}lcl@ {}}
 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
-@{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
+@{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}
 \end{tabular}
-\medskip\\
+%\end{tabular}
+%\end{center}
-%!\begin{tabular}{@ {}cc@ {}}
+\smallskip\\
+%\begin{center}
+%\begin{tabular}{@ {}cc@ {}}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
-@{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
+@{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)[of "c"]}\\
 @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ &
 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
 @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ &
 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
 %!
 The function @{text mkeps} is run when the last derivative is nullable, that is
 the string to be matched is in the language of the regular expression. It generates
 a value for how the last derivative can match the empty string. In case
 of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate
 a list of exactly @{term n} copies, which is the length of the list we expect in this
-case.  The injection function
+case.  The injection function\footnote{While the character argument @{text c} is not
+strictly necessary in the @{text inj}-function for the fragment of regular expressions we
+use in this paper, it is necessary for extended regular expressions. For example for the range regular expression of the form @{text "[a-z]"}.
+We therefore keep this argument from the original formulation of @{text inj} by Sulzmann and Lu.}
 then calculates the corresponding value for each intermediate derivative until
 a value for the original regular expression is generated.
 Graphically the algorithm by
 Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz}
 where the path from the left to the right involving @{term derivatives}/@{const
 nullable} is the first phase of the algorithm (calculating successive
 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
 left, the second phase.
 %
 \begin{center}
-\begin{tikzpicture}[scale=0.99,node distance=9mm,
+\begin{tikzpicture}[scale=0.85,node distance=8mm,
 every node/.style={minimum size=6mm}]
 \node (r1)  {@{term "r\<^sub>1"}};
 \node (r2) [right=of r1]{@{term "r\<^sub>2"}};
 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}};
 \node (r3) [right=of r2]{@{term "r\<^sub>3"}};
 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {\<open>inj r\<^sub>3 c\<close>};
 \node (v2) [left=of v3]{@{term "v\<^sub>2"}};
 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {\<open>inj r\<^sub>2 b\<close>};
 \node (v1) [left=of v2] {@{term "v\<^sub>1"}};
 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {\<open>inj r\<^sub>1 a\<close>};
-\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}};
+\draw (r4) node[anchor=north west] {\;\raisebox{-5mm}{@{term "mkeps"}}};
 \end{tikzpicture}
 \end{center}
 %
 \noindent
 The picture shows the steps required when a
 \begin{center}
 \begin{tabular}{lcl}
 @{thm (lhs) lexer.simps(1)} & $\dn$ & @{thm (rhs) lexer.simps(1)}\\
 @{thm (lhs) lexer.simps(2)} & $\dn$ & @{text "case"} @{term "lexer (der c r) s"} @{text of}
-@{term "None"}  @{text "\<Rightarrow>"} @{term None}\\
+@{term "None"}  @{text "\<Rightarrow>"} @{term None}
-& & \hspace{27mm}$|\;$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}
+%%& & \hspace{27mm}
+		     $|\;$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}
 \end{tabular}
 \end{center}
 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
 \noindent
 With this in place we were able to prove:
 \begin{proposition}\mbox{}\label{lexercorrect}
-\textrm{(1)} @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}.\\
+\textrm{(1)}\; @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}.\\
-\mbox{\hspace{23.5mm}}\textrm{(2)}\; @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}.
+\mbox{\hspace{29.5mm}}\textrm{(2)}\; @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}.
 %
 % \smallskip\\
 %\begin{tabular}{ll}
-%(1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\
+%(1) & @{thm (lhs) lexer_correct_None} \;if and only if\; @{thm (rhs) lexer_correct_None}\\
-%(2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
+%(2) & @{thm (lhs) lexer_correct_Some} \;if and only if\; @{thm (rhs) lexer_correct_Some}\\
 %\end{tabular}
 \end{proposition}
 \noindent
 In fact we have shown that, in the success case, the generated POSIX value $v$ is
 Sulzmann and Lu describe another algorithm that also generates POSIX
 values but dispenses with the second phase where characters are
 injected ``back'' into values. For this they annotate bitcodes to
 regular expressions, which we define in Isabelle/HOL as the datatype\medskip
-%!\begin{center}
+\begin{center}
-\noindent
+%\noindent
-\begin{minipage}{1.01\textwidth}
+%\begin{minipage}{0.9\textwidth}
 \,@{term breg} $\,::=\,$ @{term "AZERO"}
 $\,\mid$ @{term "AONE bs"}
 $\,\mid$ @{term "ACHAR bs c"}
 $\,\mid$ @{term "AALTs bs rs"}
 $\,\mid$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}
 $\,\mid$ @{term "ASTAR bs r"}
-	       $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}\hfill\mbox{}
+	       $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}%\hfill\mbox{}
-\end{minipage}\medskip
+%\end{minipage}\medskip
-%!\end{center}
+\end{center}
 \noindent where @{text bs} stands for bitsequences; @{text r},
 @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
 expressions; and @{text rs} for lists of bitcoded regular
 expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
 \end{tabular}
 \end{tabular}
 \end{center}
 \noindent
-As can be seen, this coding is ``lossy'' in the sense that we do not
+As can be seen, this coding is ``lossy'' in the sense that it does not
 record explicitly character values and also not sequence values (for
-them we just append two bitsequences). However, the
+them it just appends two bitsequences). However, the
 different alternatives for @{text Left}, respectively @{text Right}, are recorded as @{text Z} and
 @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate
 if there is still a value coming in the list of @{text Stars}, whereas @{text S}
 indicates the end of the list. The lossiness makes the process of
 decoding a bit more involved, but the point is that if we have a
 the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the
 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
 bitcoded regular expressions.
 %
 \begin{center}
-\begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{6mm}}c@ {}}
+\begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{10mm}}c@ {}}
 \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l}
 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\
 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\
 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\
 $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &\\
 can define Sulzmann and Lu's bitcoded lexer, which we call \textit{blexer}:
 %
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
 $\textit{blexer}\;r\,s$ & $\dn$ &
 $\textit{let}\;r_{der} = (r^\uparrow)\backslash s\;\textit{in}$\\
-& & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
+& & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$
-& & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$\\
+$\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$
-& & $\;\;\;\textit{else}\;\textit{None}$
+$\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 \noindent
 This bitcoded lexer first internalises the regular expression $r$ and then
 is the same as if we first erase the bitcoded regular expression and
 then perform the ``standard'' derivative operation.
 \begin{lemma}\label{bnullable}%\mbox{}\smallskip\\
 \textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\
-\mbox{\hspace{17mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
+\mbox{\hspace{21.5mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
-\mbox{\hspace{17mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$
+\mbox{\hspace{21.5mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$
 %\begin{tabular}{ll}
 %\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\
 %\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
-%\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$.
+%\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ \;provided\; $\textit{nullable}(r^\downarrow)$.
 %\end{tabular}
 \end{lemma}
 %\begin{proof}
 %  All properties are by induction on annotated regular expressions.
 Using this function we can recast the success case in @{text lexer}
 as follows:
 \begin{lemma}\label{flex}
 If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\,
-(\mkeps (r\backslash s))$.
+(\textit{mkeps}\,(r\backslash s))$.
 \end{lemma}
 \noindent
 Note we did not redefine \textit{lexer}, we just established that the
 value generated by \textit{lexer} can also be obtained by a different
 cannot make any further simplifications. This is a problem because
 the outermost alternatives contains two copies of the same
 regular expression (underlined with $r$). These copies will
 spawn new copies in later derivative steps and they in turn even more copies. This
 destroys any hope of taming the size of the derivatives.  But the
-second copy of $r$ in \eqref{derivex} will never contribute to a
+second copy of $r$ in~\eqref{derivex} will never contribute to a
 value, because POSIX lexing will always prefer matching a string
 with the first copy. So it could be safely removed without affecting the correctness of the algorithm.
 The issue with the simple-minded
 simplification rules above is that the rule $r + r \Rightarrow r$
 will never be applicable because as can be seen in this example the
 expressions is that they can be easily modified such that simplification does not
 interfere with the value constructions. For example we can ``flatten'', or
 de-nest, or spill out, @{text ALTs} as follows
 %
 \[
-@{term "ALTs bs\<^sub>1 (((ALTs bs\<^sub>2 rs\<^sub>2)) # rs\<^sub>1)"}
+@{text "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) :: rs\<^sub>1)"}
 \quad\xrightarrow{bsimp}\quad
 @{text "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
 \]
 \noindent
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
 @{thm (lhs) distinctWith.simps(1)} & $\dn$ & @{thm (rhs) distinctWith.simps(1)}\\
 @{thm (lhs) distinctWith.simps(2)} & $\dn$ &
-@{text "if (\<exists> y \<in> acc. eq x y)"} \\
+@{text "if (\<exists> y \<in> acc. eq x y)"}
-& & @{text "then distinctWith xs eq acc"}\\
+@{text "then distinctWith xs eq acc"}\\
 & & @{text "else x :: distinctWith xs eq ({x} \<union> acc)"}
 \end{tabular}
 \end{center}
 \noindent where we scan the list from left to right (because we
 @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions
 before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from
 bitcoded regular expressions to @{text bool}:
 %
 \begin{center}
-\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
+\begin{tabular}{@ {}cc@ {}}
+\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {\hspace{1mm}}}
 @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\
 @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\
+@{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\
+@{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\
+\mbox{}
+\end{tabular}
+&
+\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
 @{thm (lhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]} & $\dn$ & @{thm (rhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]}\\
 @{thm (lhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]} & $\dn$ &
 @{thm (rhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]}\\
-@{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\
-@{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ &
-@{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}\\
-@{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\
 @{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\
+@{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ & \\
+\multicolumn{3}{r}{@{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}}
+\end{tabular}
 \end{tabular}
 \end{center}
 \noindent
 where all other cases are set to @{text False}.
 but is needed in order to make the removal of unnecessary copies
 to work properly.
 Our simplification function depends on three more helper functions, one is called
 @{text flts} and analyses lists of regular expressions coming from alternatives.
-It is defined as follows:
+It is defined by four clauses as follows:
 \begin{center}
-\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+\begin{tabular}{c}
-\multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad
+@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad
-@{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\smallskip\\
+@{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)} \qquad
-@{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
+@{text "flts ((ALTs bs' rs') :: rs"}
-\multicolumn{3}{@ {}c}{@{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"}
+%@{ thm (lhs) flts.simps(3)[of "bs'" "rs'"]}
-\hspace{5mm}(otherwise)}
+$\dn$ @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\smallskip\\
+@{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"}\hspace{5mm}(otherwise)
 \end{tabular}
 \end{center}
 \noindent
 The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and
-the third ``de-nests'' alternatives (but retaining the
+the third ``de-nests'' alternatives (but retains the
 bitsequence @{text "bs'"} accumulated in the inner alternative). There are
 some corner cases to be considered when the resulting list inside an alternative is
 empty or a singleton list. We take care of those cases in the
 @{text "bsimpALTs"} function; similarly we define a helper function that simplifies
 sequences according to the usual rules about @{text ZERO}s and @{text ONE}s:
 \end{tabular}
 \end{center}
 \noindent
 We believe our recursive function @{term bsimp} simplifies bitcoded regular
-expressions as intended by Sulzmann and Lu. There is no point in applying the
+expressions as intended by Sulzmann and Lu with the small addition of removing ``useless'' @{text ONE}s
+in sequence regular
+expressions. There is no point in applying the
 @{text bsimp} function repeatedly (like the simplification in their paper which needs to be
 applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent,
 that is
 \begin{proposition}
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
 $\textit{blexer}^+\;r\,s$ & $\dn$ &
 $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\
-& & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
+& & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$
-& & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\;\textit{else}\;\textit{None}$
+$\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 \noindent
 Note that in $\textit{blexer}^+$ the derivative $r_{der}$ is calculated
 expression anymore. So unless one can somehow
 synchronise the change in the simplified regular expressions with
 the original POSIX value, there is no hope of appealing to @{text retrieve} in the
 correctness argument for @{term blexer_simp}.
-We found it more helpful to introduce the rewriting systems shown in
+For our proof we found it more helpful to introduce the rewriting systems shown in
 Fig~\ref{SimpRewrites}. The idea is to generate
 simplified regular expressions in small steps (unlike the @{text bsimp}-function which
 does the same in a big step), and show that each of
 the small steps preserves the bitcodes that lead to the POSIX value.
 The rewrite system is organised such that $\leadsto$ is for bitcoded regular
 we shall show next.
 \begin{figure}[t]
 \begin{center}
 \begin{tabular}{@ {\hspace{-8mm}}c@ {}}
-\\[-7mm]
+\\[-8mm]
 @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad
 @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad
 @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\
 @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad
 @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\
 where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
 In (3) we know that  $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is
 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
-We reason similarly for @{text STAR}.\smallskip
+We reason similarly for @{text STAR} and @{text NT}.\smallskip
 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
 far from the actual bound we can expect. We can do better than this, but this does not improve
 the finiteness property we are proving. If we are interested in a polynomial bound,
 \noindent
 Essentially it matches the string with the longer @{text "Right"}-alternative in the
 first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence).
 If we add the simplification above, then we obtain the following value
-%
-\[
 @{term "Seq (Left (Char a)) (Left (Char b))"}
-\]
-\noindent
 where the @{text "Left"}-alternatives get priority. However, this violates
 the POSIX rules and we have not been able to
 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm]
-Note that while Antimirov was able to give a bound on the \emph{size}
+Note also that while Antimirov was able to give a bound on the \emph{size}
 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound
 on the \emph{number} of derivatives, provided they are quotient via
-ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one
+ACI rules \cite{Brzozowski1964}. Brzozowski's result is crucial when one
-uses derivatives for obtaining an automaton (it essentially bounds
+uses his derivatives for obtaining a DFA (it essentially bounds
 the number of states). However, this result does \emph{not}
 transfer to our setting where we are interested in the \emph{size} of the
-derivatives. For example it is not true for our derivatives that the
+derivatives. For example it is \emph{not} true for our derivatives that the
-set of of derivatives $r \backslash s$ for a given $r$ and all strings
+set of derivatives $r \backslash s$ for a given $r$ and all strings
 $s$ is finite (even when simplified). This is because for our annotated regular expressions
 the bitcode annotation is dependent on the number of iterations that
-are needed for STAR-regular expressions. This is not a problem for us: Since we intend to do lexing
+are needed for @{text STAR}-regular expressions. This is not a problem for us: Since we intend to do lexing
 by calculating (as fast as possible) derivatives, the bound on the size
-of the derivatives is important, not the number of derivatives.
+of the derivatives is important, not their number. % of derivatives.
 *}
 We set out in this work to prove in Isabelle/HOL the correctness of
 the second POSIX lexing algorithm by Sulzmann and Lu
 \cite{Sulzmann2014}. This follows earlier work where we established
 the correctness of the first algorithm
 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
-introduce our own specification about what POSIX values are,
+introduce our own specification for POSIX values,
 because the informal definition given by Sulzmann and Lu did not
 stand up to formal proof. Also for the second algorithm we needed
-to introduce our own definitions and proof ideas in order to establish the
+to introduce our own definitions and proof ideas in order to
-correctness.  Our interest in the second algorithm
+establish the correctness.  Our interest in the second algorithm
-lies in the fact that by using bitcoded regular expressions and an aggressive
+lies in the fact that by using bitcoded regular expressions and an
-simplification method there is a chance that the derivatives
+aggressive simplification method there is a chance that the
-can be kept universally small  (we established in this paper that
+derivatives can be kept universally small (we established in this
-for a given $r$ they can be kept finitely bounded for all strings).
+paper that for a given $r$ they can be kept finitely bounded for
+\emph{all} strings).  Our formalisation is approximately 7500 lines of
+Isabelle code. A little more than half of this code concerns the finiteness bound
+obtained in Section 5. This slight ``bloat'' in the latter part is because
+we had to introduce an intermediate datatype for annotated regular expressions and repeat many
+definitions for this intermediate datatype. But overall we think this made our
+formalisation work smoother. The code of our formalisation can be found at
+\textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}.
 %This is important if one is after
 %an efficient POSIX lexing algorithm based on derivatives.
 Having proved the correctness of the POSIX lexing algorithm, which
 lessons have we learned? Well, we feel this is a very good example
 obscure, examples.
 %We found that from an implementation
 %point-of-view it is really important to have the formal proofs of
 %the corresponding properties at hand.
-With the results reported here, we can of course only make a claim about the correctness
+With the results reported here, we can of course only make a claim
-of the algorithm and the sizes of the
+about the correctness of the algorithm and the sizes of the
 derivatives, not about the efficiency or runtime of our version of
-Sulzman and Lu's algorithm. But we found the size is an important
+Sulzmann and Lu's algorithm. But we found the size is an important
 first indicator about efficiency: clearly if the derivatives can
 grow to arbitrarily big sizes and the algorithm needs to traverse
 the derivatives possibly several times, then the algorithm will be
-slow---excruciatingly slow that is. Other works seems to make
+slow---excruciatingly slow that is. Other works seem to make
-stronger claims, but during our work we have developed a healthy
+stronger claims, but during our formalisation work we have
-suspicion when for example experimental data is used to back up
+developed a healthy suspicion when for example experimental data is
-efficiency claims. For instance Sulzmann and Lu write about their
+used to back up efficiency claims. For instance Sulzmann and Lu
-equivalent of @{term blexer_simp} \textit{``...we can incrementally
+write about their equivalent of @{term blexer_simp} \textit{``...we
-compute bitcoded parse trees in linear time in the size of the
+can incrementally compute bitcoded parse trees in linear time in
-input''} \cite[Page 14]{Sulzmann2014}.  Given the growth of the
+the size of the input''} \cite[Page 14]{Sulzmann2014}.  Given the
-derivatives in some cases even after aggressive simplification,
+growth of the derivatives in some cases even after aggressive
-this is a hard to believe claim. A similar claim about a
+simplification, this is a hard to believe claim. A similar claim
-theoretical runtime of @{text "O(n\<^sup>2)"} is made for the
+about a theoretical runtime of @{text "O(n\<^sup>2)"} for one
-Verbatim lexer, which calculates tokens according to POSIX
+specific list of regular expressions is made for the Verbatim
+lexer, which calculates tokens according to POSIX
 rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's
-derivatives like in our work.  The authors write: \textit{``The
+derivatives like in our work.  About their empirical data, the authors write:
-results of our empirical tests [..] confirm that Verbatim has
+\textit{``The results of our empirical tests [..] confirm that
-@{text "O(n\<^sup>2)"} time complexity.''}
+Verbatim has @{text "O(n\<^sup>2)"} time complexity.''}
 \cite[Section~VII]{verbatim}.  While their correctness proof for
 Verbatim is formalised in Coq, the claim about the runtime
-complexity is only supported by some emperical evidence obtained by
+complexity is only supported by some empirical evidence obtained by
 using the code extraction facilities of Coq.  Given our observation
-with the ``growth problem'' of derivatives, we tried out their
+with the ``growth problem'' of derivatives, this runtime bound
-extracted OCaml code with the example \mbox{@{text "(a +
+is unlikely to hold universally: indeed we tried out their extracted OCaml
-aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5
+code with the example \mbox{@{text "(a + aa)\<^sup>*"}} as a single
-minutes to tokenise a string of 40 $a$'s and that increased to
+lexing rule, and it took for us around 5 minutes to tokenise a
-approximately 19 minutes when the string is 50 $a$'s long. Taking
+string of 40 $a$'s and that increased to approximately 19 minutes
-into account that derivatives are not simplified in the Verbatim
+when the string is 50 $a$'s long. Taking into account that
-lexer, such numbers are not surprising.  Clearly our result of
+derivatives are not simplified in the Verbatim lexer, such numbers
-having finite derivatives might sound rather weak in this context
+are not surprising.  Clearly our result of having finite
-but we think such effeciency claims really require further
+derivatives might sound rather weak in this context but we think
-scrutiny.
+such efficiency claims really require further scrutiny.  The
+contribution of this paper is to make sure derivatives do not grow
-The contribution of this paper is to make sure derivatives do not
+arbitrarily big (universally). In the example \mbox{@{text "(a +
-grow arbitrarily big (universially) In the example \mbox{@{text "(a
+aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or
-+ aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or
 less. The result is that lexing a string of, say, 50\,000 a's with
 the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes
 approximately 10 seconds with our Scala implementation of the
 presented algorithm.
 derivatives is never greater than 5 in this example. Even in the
 example from Section 2, where Rust raises an error message, namely
 \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for
 our derivatives is a moderate 14.
+Let us also return to the example @{text
+"a\<^bsup>{0}{4294967295}\<^esup>"} which until recently Rust
+deemed acceptable. But this was due to a bug. It turns out that it took Rust
+more than 11 minutes to generate an automaton for this regular
+expression and then to determine that a string of just one(!)
+@{text a} does \emph{not} match this regular expression.  Therefore
+it is probably a wise choice that in newer versions of Rust's
+regular expression library such regular expressions are declared as
+``too big'' and raise an error message. While this is clearly
+a contrived example, the safety guaranties Rust wants to provide necessitate
+this conservative approach.
+However, with the derivatives and simplifications we have shown
+in this paper, the example can be solved with ease:
+it essentially only involves operations on
+integers and our Scala implementation takes only a few seconds to
+find out that this string, or even much larger strings, do not match.
 Let us also compare our work to the verified Verbatim++ lexer where
 the authors of the Verbatim lexer introduced a number of
 improvements and optimisations, for example memoisation
 \cite{verbatimpp}. However, unlike Verbatim, which works with
 derivatives like in our work, Verbatim++ compiles first a regular
 translation using the traditional notion of DFAs so that we can improve on this. Therefore we
 prefer to stick with calculating derivatives, but attempt to make
 this calculation (in the future) as fast as possible. What we can guaranty
 with the presented work is that the maximum size of the derivatives
 for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala
-implementation only needs a few seconds for this example and matching 50\,000 a's.
+implementation again only needs a few seconds for this example and matching 50\,000 a's, say.
 %
 %
 %Possible ideas are
 %zippers which have been used in the context of parsing of
 %context-free grammars \cite{zipperparser}.
-\medskip
+%\\[-5mm]  %\smallskip
-\noindent
+%\noindent
-Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
+%Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
-%\\[-10mm]
+%%\\[-10mm]
 %%\bibliographystyle{plain}
 \bibliography{root}
 *}

changeset 647	70c10dc41606
parent 644	9f984ff20020