--- a/thys3/Paper.thy Wed Feb 15 11:52:22 2023 +0000
+++ b/thys3/Paper.thy Thu Feb 16 23:23:22 2023 +0000
@@ -97,7 +97,7 @@
expressions have sparked quite a bit of interest in the functional
programming and theorem prover communities.
Derivatives of a
-regular expression, written @{term "der c r"}, give a simple solution
+regular expressions, written @{term "der c r"}, give a simple solution
to the problem of matching a string @{term s} with a regular
expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
succession) all the characters of the string matches the empty string,
@@ -280,16 +280,25 @@
@{term "NTIMES a 1001"} is not permitted, because no counter can be
above 1000; and in the built-in regular expression library in Rust
expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error
- message for being too big. These problems can of course be solved in matching
+ message for being too big. Up until recently,\footnote{up until version 1.5.4 of the regex
+ library in Rust; see also CVE-2022-24713.} Rust
+ however happily generated an automaton for the regular expression
+ @{text "a\<^bsup>{0}{4294967295}\<^esup>"}. This was due to a bug
+ in the algorithm that decides when a regular expression is acceptable
+ or too big according to Rust's classification. We shall come back to
+ this example later in the paper.
+ These problems can of course be solved in matching
algorithms where automata go beyond the classic notion and for
- instance include explicit counters (see~\cite{CountingSet2020}).
+ instance include explicit counters (see for example~\cite{CountingSet2020}).
The point here is that Brzozowski derivatives and the algorithms by
Sulzmann and Lu can be straightforwardly extended to deal with
bounded regular expressions and moreover the resulting code
still consists of only simple recursive functions and inductive
datatypes. Finally, bounded regular expressions
do not destroy our finite boundedness property, which we shall
- prove later on.%, because during the lexing process counters will only be
+ prove later on.
+
+ %, because during the lexing process counters will only be
%decremented.
@@ -393,7 +402,7 @@
@{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad
$\mprset{flushleft}\inferrule{
@{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\\\\
- @{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\\\\
+ @{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\quad
@{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}
}
{@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}}
@@ -437,7 +446,7 @@
\begin{figure}[t]
\begin{center}\small%
\begin{tabular}{@ {\hspace{-2mm}}c@ {}}
- \\[-4.5mm]
+ \\[-8.5mm]
@{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
@{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
@{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
@@ -445,9 +454,9 @@
$\mprset{flushleft}
\inferrule
{@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
- @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\
+ @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
@{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
- {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\smallskip\\
+ {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\\
@{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad
$\mprset{flushleft}
\inferrule
@@ -455,9 +464,9 @@
@{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
@{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
@{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
- {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\smallskip\\
+ {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\\
\mprset{sep=4mm}
- @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad
+ @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad\;
$\mprset{flushleft}
\inferrule
{@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
@@ -487,13 +496,16 @@
@{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
@{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
- @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
+ @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}
\end{tabular}
- \medskip\\
+ %\end{tabular}
+ %\end{center}
+ \smallskip\\
- %!\begin{tabular}{@ {}cc@ {}}
+ %\begin{center}
+ %\begin{tabular}{@ {}cc@ {}}
\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
- @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
+ @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)[of "c"]}\\
@{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ &
@{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
@{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ &
@@ -525,7 +537,10 @@
a value for how the last derivative can match the empty string. In case
of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate
a list of exactly @{term n} copies, which is the length of the list we expect in this
- case. The injection function
+ case. The injection function\footnote{While the character argument @{text c} is not
+ strictly necessary in the @{text inj}-function for the fragment of regular expressions we
+ use in this paper, it is necessary for extended regular expressions, for example for the range regular expression of the form @{text "[a-z]"}.
+ We therefore keep this argument from the original formulation of @{text inj} by Sulzmann and Lu.}
then calculates the corresponding value for each intermediate derivative until
a value for the original regular expression is generated.
Graphically the algorithm by
@@ -536,7 +551,7 @@
left, the second phase.
%
\begin{center}
-\begin{tikzpicture}[scale=0.99,node distance=9mm,
+\begin{tikzpicture}[scale=0.85,node distance=8mm,
every node/.style={minimum size=6mm}]
\node (r1) {@{term "r\<^sub>1"}};
\node (r2) [right=of r1]{@{term "r\<^sub>2"}};
@@ -554,7 +569,7 @@
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {\<open>inj r\<^sub>2 b\<close>};
\node (v1) [left=of v2] {@{term "v\<^sub>1"}};
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {\<open>inj r\<^sub>1 a\<close>};
-\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}};
+\draw (r4) node[anchor=north west] {\;\raisebox{-5mm}{@{term "mkeps"}}};
\end{tikzpicture}
\end{center}
%
@@ -608,8 +623,9 @@
\begin{tabular}{lcl}
@{thm (lhs) lexer.simps(1)} & $\dn$ & @{thm (rhs) lexer.simps(1)}\\
@{thm (lhs) lexer.simps(2)} & $\dn$ & @{text "case"} @{term "lexer (der c r) s"} @{text of}
- @{term "None"} @{text "\<Rightarrow>"} @{term None}\\
- & & \hspace{27mm}$|\;$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}
+ @{term "None"} @{text "\<Rightarrow>"} @{term None}
+ %%& & \hspace{27mm}
+ $|\;$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}
\end{tabular}
\end{center}
@@ -627,13 +643,13 @@
With this in place we were able to prove:
\begin{proposition}\mbox{}\label{lexercorrect}
- \textrm{(1)} @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}.\\
- \mbox{\hspace{23.5mm}}\textrm{(2)}\; @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}.
+ \textrm{(1)}\; @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}.\\
+ \mbox{\hspace{29.5mm}}\textrm{(2)}\; @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}.
%
% \smallskip\\
%\begin{tabular}{ll}
- %(1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\
- %(2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
+ %(1) & @{thm (lhs) lexer_correct_None} \;if and only if\; @{thm (rhs) lexer_correct_None}\\
+ %(2) & @{thm (lhs) lexer_correct_Some} \;if and only if\; @{thm (rhs) lexer_correct_Some}\\
%\end{tabular}
\end{proposition}
@@ -657,18 +673,18 @@
injected ``back'' into values. For this they annotate bitcodes to
regular expressions, which we define in Isabelle/HOL as the datatype\medskip
- %!\begin{center}
- \noindent
- \begin{minipage}{1.01\textwidth}
+ \begin{center}
+ %\noindent
+ %\begin{minipage}{0.9\textwidth}
\,@{term breg} $\,::=\,$ @{term "AZERO"}
$\,\mid$ @{term "AONE bs"}
$\,\mid$ @{term "ACHAR bs c"}
$\,\mid$ @{term "AALTs bs rs"}
$\,\mid$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}
$\,\mid$ @{term "ASTAR bs r"}
- $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}\hfill\mbox{}
- \end{minipage}\medskip
- %!\end{center}
+ $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}%\hfill\mbox{}
+ %\end{minipage}\medskip
+ \end{center}
\noindent where @{text bs} stands for bitsequences; @{text r},
@{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
@@ -702,9 +718,9 @@
\end{center}
\noindent
- As can be seen, this coding is ``lossy'' in the sense that we do not
+ As can be seen, this coding is ``lossy'' in the sense that it does not
record explicitly character values and also not sequence values (for
- them we just append two bitsequences). However, the
+ them it just appends two bitsequences). However, the
different alternatives for @{text Left}, respectively @{text Right}, are recorded as @{text Z} and
@{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate
if there is still a value coming in the list of @{text Stars}, whereas @{text S}
@@ -933,10 +949,10 @@
\begin{center}
\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
$\textit{blexer}\;r\,s$ & $\dn$ &
- $\textit{let}\;r_{der} = (r^\uparrow)\backslash s\;\textit{in}$\\
- & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
- & & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$\\
- & & $\;\;\;\textit{else}\;\textit{None}$
+ $\textit{let}\;r_{der} = (r^\uparrow)\backslash s\;\textit{in}$\\
+ & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$
+ $\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$
+ $\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}
@@ -991,12 +1007,12 @@
\begin{lemma}\label{bnullable}%\mbox{}\smallskip\\
\textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\
-\mbox{\hspace{17mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
-\mbox{\hspace{17mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$
+\mbox{\hspace{21.5mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
+\mbox{\hspace{21.5mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$
%\begin{tabular}{ll}
%\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\
%\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
-%\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$.
+%\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ \;provided\; $\textit{nullable}(r^\downarrow)$.
%\end{tabular}
\end{lemma}
@@ -1030,7 +1046,7 @@
\begin{lemma}\label{flex}
If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\,
- (\mkeps (r\backslash s))$.
+ (\textit{mkeps}\,(r\backslash s))$.
\end{lemma}
\noindent
@@ -1111,7 +1127,7 @@
regular expression (underlined with $r$). These copies will
spawn new copies in later derivative steps and they in turn even more copies. This
destroys any hope of taming the size of the derivatives. But the
- second copy of $r$ in \eqref{derivex} will never contribute to a
+ second copy of $r$ in~\eqref{derivex} will never contribute to a
value, because POSIX lexing will always prefer matching a string
with the first copy. So it could be safely removed without affecting the correctness of the algorithm.
The issue with the simple-minded
@@ -1130,7 +1146,7 @@
de-nest, or spill out, @{text ALTs} as follows
%
\[
- @{term "ALTs bs\<^sub>1 (((ALTs bs\<^sub>2 rs\<^sub>2)) # rs\<^sub>1)"}
+ @{text "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) :: rs\<^sub>1)"}
\quad\xrightarrow{bsimp}\quad
@{text "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
\]
@@ -1171,8 +1187,8 @@
\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
@{thm (lhs) distinctWith.simps(1)} & $\dn$ & @{thm (rhs) distinctWith.simps(1)}\\
@{thm (lhs) distinctWith.simps(2)} & $\dn$ &
- @{text "if (\<exists> y \<in> acc. eq x y)"} \\
- & & @{text "then distinctWith xs eq acc"}\\
+ @{text "if (\<exists> y \<in> acc. eq x y)"}
+ @{text "then distinctWith xs eq acc"}\\
& & @{text "else x :: distinctWith xs eq ({x} \<union> acc)"}
\end{tabular}
\end{center}
@@ -1190,17 +1206,23 @@
bitcoded regular expressions to @{text bool}:
%
\begin{center}
- \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
+ \begin{tabular}{@ {}cc@ {}}
+ \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {\hspace{1mm}}}
@{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\
@{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\
+ @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\
+ @{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\
+ \mbox{}
+ \end{tabular}
+ &
+ \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
@{thm (lhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]} & $\dn$ & @{thm (rhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]}\\
@{thm (lhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]} & $\dn$ &
@{thm (rhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]}\\
- @{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\
- @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ &
- @{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}\\
- @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\
@{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\
+ @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ & \\
+ \multicolumn{3}{r}{@{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}}
+ \end{tabular}
\end{tabular}
\end{center}
@@ -1212,21 +1234,22 @@
Our simplification function depends on three more helper functions, one is called
@{text flts} and analyses lists of regular expressions coming from alternatives.
- It is defined as follows:
+ It is defined by four clauses as follows:
\begin{center}
- \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
- \multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad
- @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\smallskip\\
- @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
- \multicolumn{3}{@ {}c}{@{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"}
- \hspace{5mm}(otherwise)}
+ \begin{tabular}{c}
+ @{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad
+ @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)} \qquad
+ @{text "flts ((ALTs bs' rs') :: rs"}
+ %@{ thm (lhs) flts.simps(3)[of "bs'" "rs'"]}
+ $\dn$ @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\smallskip\\
+ @{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"}\hspace{5mm}(otherwise)
\end{tabular}
\end{center}
\noindent
The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and
- the third ``de-nests'' alternatives (but retaining the
+ the third ``de-nests'' alternatives (but retains the
bitsequence @{text "bs'"} accumulated in the inner alternative). There are
some corner cases to be considered when the resulting list inside an alternative is
empty or a singleton list. We take care of those cases in the
@@ -1266,7 +1289,9 @@
\noindent
We believe our recursive function @{term bsimp} simplifies bitcoded regular
- expressions as intended by Sulzmann and Lu. There is no point in applying the
+ expressions as intended by Sulzmann and Lu with the small addition of removing ``useless'' @{text ONE}s
+ in sequence regular
+ expressions. There is no point in applying the
@{text bsimp} function repeatedly (like the simplification in their paper which needs to be
applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent,
that is
@@ -1304,8 +1329,8 @@
\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
$\textit{blexer}^+\;r\,s$ & $\dn$ &
$\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\
- & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
- & & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\;\textit{else}\;\textit{None}$
+ & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$
+ $\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}
@@ -1343,7 +1368,7 @@
the original POSIX value, there is no hope of appealing to @{text retrieve} in the
correctness argument for @{term blexer_simp}.
- We found it more helpful to introduce the rewriting systems shown in
+ For our proof we found it more helpful to introduce the rewriting systems shown in
Fig~\ref{SimpRewrites}. The idea is to generate
simplified regular expressions in small steps (unlike the @{text bsimp}-function which
does the same in a big step), and show that each of
@@ -1511,7 +1536,7 @@
bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
-We reason similarly for @{text STAR}.\smallskip
+We reason similarly for @{text STAR} and @{text NT}.\smallskip
Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
@@ -1536,12 +1561,7 @@
Essentially it matches the string with the longer @{text "Right"}-alternative in the
first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence).
If we add the simplification above, then we obtain the following value
-%
-\[
@{term "Seq (Left (Char a)) (Left (Char b))"}
-\]
-
-\noindent
where the @{text "Left"}-alternatives get priority. However, this violates
the POSIX rules and we have not been able to
reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm]
@@ -1550,16 +1570,16 @@
of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound
on the \emph{number} of derivatives, provided they are quotient via
ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one
-uses derivatives for obtaining an automaton (it essentially bounds
+uses his derivatives for obtaining a DFA (it essentially bounds
the number of states). However, this result does \emph{not}
transfer to our setting where we are interested in the \emph{size} of the
-derivatives. For example it is not true for our derivatives that the
-set of of derivatives $r \backslash s$ for a given $r$ and all strings
+derivatives. For example it is \emph{not} true for our derivatives that the
+set of derivatives $r \backslash s$ for a given $r$ and all strings
$s$ is finite (even when simplified). This is because for our annotated regular expressions
the bitcode annotation is dependent on the number of iterations that
-are needed for STAR-regular expressions. This is not a problem for us: Since we intend to do lexing
+are needed for @{text STAR}-regular expressions. This is not a problem for us: Since we intend to do lexing
by calculating (as fast as possible) derivatives, the bound on the size
-of the derivatives is important, not the number of derivatives.
+of the derivatives is important, not their number. % of derivatives.
*}
@@ -1577,12 +1597,18 @@
introduce our own specification about what POSIX values are,
because the informal definition given by Sulzmann and Lu did not
stand up to formal proof. Also for the second algorithm we needed
- to introduce our own definitions and proof ideas in order to establish the
- correctness. Our interest in the second algorithm
- lies in the fact that by using bitcoded regular expressions and an aggressive
- simplification method there is a chance that the derivatives
- can be kept universally small (we established in this paper that
- for a given $r$ they can be kept finitely bounded for all strings).
+ to introduce our own definitions and proof ideas in order to
+ establish the correctness. Our interest in the second algorithm
+ lies in the fact that by using bitcoded regular expressions and an
+ aggressive simplification method there is a chance that the
+ derivatives can be kept universally small (we established in this
+ paper that for a given $r$ they can be kept finitely bounded for
+ all strings). Our formalisation is approximately 7500 lines of
+ Isabelle code. A little more than half of this code concerns the finiteness bound
+ obtained in Section 5. This slight ``bloat'' in the latter part is because
+ we had to introduce an intermediate datatype for annotated regular expressions and repeat many
+ definitions for this intermediate datatype. But overall this made our
+ formalisation work smoother.
%This is important if one is after
%an efficient POSIX lexing algorithm based on derivatives.
@@ -1597,46 +1623,46 @@
%point-of-view it is really important to have the formal proofs of
%the corresponding properties at hand.
- With the results reported here, we can of course only make a claim about the correctness
- of the algorithm and the sizes of the
+ With the results reported here, we can of course only make a claim
+ about the correctness of the algorithm and the sizes of the
derivatives, not about the efficiency or runtime of our version of
- Sulzman and Lu's algorithm. But we found the size is an important
+ Sulzmann and Lu's algorithm. But we found the size is an important
first indicator about efficiency: clearly if the derivatives can
grow to arbitrarily big sizes and the algorithm needs to traverse
the derivatives possibly several times, then the algorithm will be
- slow---excruciatingly slow that is. Other works seems to make
- stronger claims, but during our work we have developed a healthy
- suspicion when for example experimental data is used to back up
- efficiency claims. For instance Sulzmann and Lu write about their
- equivalent of @{term blexer_simp} \textit{``...we can incrementally
- compute bitcoded parse trees in linear time in the size of the
- input''} \cite[Page 14]{Sulzmann2014}. Given the growth of the
- derivatives in some cases even after aggressive simplification,
- this is a hard to believe claim. A similar claim about a
- theoretical runtime of @{text "O(n\<^sup>2)"} is made for the
- Verbatim lexer, which calculates tokens according to POSIX
+ slow---excruciatingly slow that is. Other works seem to make
+ stronger claims, but during our formalisation work we have
+ developed a healthy suspicion when for example experimental data is
+ used to back up efficiency claims. For instance Sulzmann and Lu
+ write about their equivalent of @{term blexer_simp} \textit{``...we
+ can incrementally compute bitcoded parse trees in linear time in
+ the size of the input''} \cite[Page 14]{Sulzmann2014}. Given the
+ growth of the derivatives in some cases even after aggressive
+ simplification, this is a hard to believe claim. A similar claim
+ about a theoretical runtime of @{text "O(n\<^sup>2)"} for one
+ specific list of regular expressions is made for the Verbatim
+ lexer, which calculates tokens according to POSIX
rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's
- derivatives like in our work. The authors write: \textit{``The
- results of our empirical tests [..] confirm that Verbatim has
- @{text "O(n\<^sup>2)"} time complexity.''}
+ derivatives like in our work. About their empirical data, the authors write:
+ \textit{``The results of our empirical tests [..] confirm that
+ Verbatim has @{text "O(n\<^sup>2)"} time complexity.''}
\cite[Section~VII]{verbatim}. While their correctness proof for
Verbatim is formalised in Coq, the claim about the runtime
- complexity is only supported by some emperical evidence obtained by
+ complexity is only supported by some empirical evidence obtained by
using the code extraction facilities of Coq. Given our observation
- with the ``growth problem'' of derivatives, we tried out their
- extracted OCaml code with the example \mbox{@{text "(a +
- aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5
- minutes to tokenise a string of 40 $a$'s and that increased to
- approximately 19 minutes when the string is 50 $a$'s long. Taking
- into account that derivatives are not simplified in the Verbatim
- lexer, such numbers are not surprising. Clearly our result of
- having finite derivatives might sound rather weak in this context
- but we think such effeciency claims really require further
- scrutiny.
-
- The contribution of this paper is to make sure derivatives do not
- grow arbitrarily big (universially) In the example \mbox{@{text "(a
- + aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or
+ with the ``growth problem'' of derivatives, this runtime bound
+ is unlikely to hold universally: indeed we tried out their extracted OCaml
+ code with the example \mbox{@{text "(a + aa)\<^sup>*"}} as a single
+ lexing rule, and it took for us around 5 minutes to tokenise a
+ string of 40 $a$'s and that increased to approximately 19 minutes
+ when the string is 50 $a$'s long. Taking into account that
+ derivatives are not simplified in the Verbatim lexer, such numbers
+ are not surprising. Clearly our result of having finite
+ derivatives might sound rather weak in this context but we think
+ such efficiency claims really require further scrutiny. The
+ contribution of this paper is to make sure derivatives do not grow
+ arbitrarily big (universally). In the example \mbox{@{text "(a +
+ aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or
less. The result is that lexing a string of, say, 50\,000 a's with
the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes
approximately 10 seconds with our Scala implementation of the
@@ -1663,6 +1689,23 @@
\mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for
our derivatives is a moderate 14.
+ Let us also return to the example @{text
+ "a\<^bsup>{0}{4294967295}\<^esup>"} which until recently Rust
+ deemed acceptable. But this was due to a bug. It turns out that it took Rust
+ more than 11 minutes to generate an automaton for this regular
+ expression and then to determine that a string of just one(!)
+ @{text a} does \emph{not} match this regular expression. Therefore
+ it is probably a wise choice that in newer versions of Rust's
+ regular expression library such regular expressions are declared as
+ ``too big''. While this is clearly
+ a contrived example, the safety guaranties Rust wants to provide necessitate
+ this conservative approach.
+ However, with the derivatives and simplifications we have shown
+ in this paper, the example can be solved with ease:
+ it essentially only involves operations on
+ integers and our Scala implementation takes only a few seconds to
+ find out that this string, or even much larger strings, do not match.
+
Let us also compare our work to the verified Verbatim++ lexer where
the authors of the Verbatim lexer introduced a number of
improvements and optimisations, for example memoisation
@@ -1681,17 +1724,17 @@
this calculation (in the future) as fast as possible. What we can guaranty
with the presented work is that the maximum size of the derivatives
for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala
- implementation only needs a few seconds for this example and matching 50\,000 a's.
+ implementation again only needs a few seconds for this example and matching 50\,000 a's, say.
%
%
%Possible ideas are
%zippers which have been used in the context of parsing of
%context-free grammars \cite{zipperparser}.
- \medskip
-
+ \\[-4mm] %\smallskip
+
\noindent
Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
- %\\[-10mm]
+ \\[-10mm]
%%\bibliographystyle{plain}