--- a/thys2/Paper/Paper.thy Mon Feb 07 01:12:36 2022 +0000
+++ b/thys2/Paper/Paper.thy Mon Feb 07 14:22:08 2022 +0000
@@ -23,6 +23,8 @@
notation (latex output)
der_syn ("_\\_" [79, 1000] 76) and
bder_syn ("_\\_" [79, 1000] 76) and
+ bders ("_\\_" [79, 1000] 76) and
+ bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and
ZERO ("\<^bold>0" 81) and
ONE ("\<^bold>1" 81) and
@@ -38,6 +40,7 @@
val.Seq ("Seq _ _" [79,79] 78) and
val.Stars ("Stars _" [79] 78) and
+ Prf ("\<turnstile> _ : _" [75,75] 75) and
Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
flat ("|_|" [75] 74) and
@@ -83,24 +86,77 @@
Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly
expressible in any functional language, and easily definable and
reasoned about in theorem provers---the definitions just consist of
-inductive datatypes and simple recursive functions. A mechanised
-correctness proof of Brzozowski's matcher in for example HOL4 has been
-mentioned by Owens and Slind~\cite{Owens2008}. Another one in
-Isabelle/HOL is part of the work by Krauss and Nipkow
-\cite{Krauss2011}. And another one in Coq is given by Coquand and
-Siles \cite{Coquand2012}.
+inductive datatypes and simple recursive functions. Derivatives of a
+regular expression, written @{term "der c r"}, give a simple solution
+to the problem of matching a string @{term s} with a regular
+expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
+succession) all the characters of the string matches the empty string,
+then @{term r} matches @{term s} (and {\em vice versa}). We are aware
+of a mechanised correctness proof of Brzozowski's matcher in HOL4 by
+Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
+of the work by Krauss and Nipkow \cite{Krauss2011}. And another one
+in Coq is given by Coquand and Siles \cite{Coquand2012}.
+
+There are two difficulties with derivative-based matchers and also
+lexers: First, Brzozowski's original matcher only generates a yes/no
+answer for whether a regular expression matches a string or not.
+Sulzmann and Lu~\cite{Sulzmann2014} overcome this difficulty by
+cleverly extending Brzozowski's matching algorithm to POSIX
+lexing. This extended version generates additional information on
+\emph{how} a regular expression matches a string. They achieve this by
+
+
-The notion of derivatives
-\cite{Brzozowski1964}, written @{term "der c r"}, of a regular
-expression give a simple solution to the problem of matching a string
-@{term s} with a regular expression @{term r}: if the derivative of
-@{term r} w.r.t.\ (in succession) all the characters of the string
-matches the empty string, then @{term r} matches @{term s} (and {\em
-vice versa}). The derivative has the property (which may almost be
-regarded as its specification) that, for every string @{term s} and
-regular expression @{term r} and character @{term c}, one has @{term
-"cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}.
+The second problem is that Brzozowski's derivatives can
+grow to arbitrarily big sizes. For example if we start with the
+regular expression \mbox{@{text "(a + aa)\<^sup>*"}} and take
+successive derivatives according to the character $a$, we end up with
+a sequence of ever-growing derivatives like
+
+\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+\begin{center}
+\begin{tabular}{rll}
+$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & \ldots
+\end{tabular}
+\end{center}
+
+\noindent where after around 35 steps we run out of memory on a
+typical computer (we define the precise details of the derivative
+operation later). Clearly, the notation involving $\ZERO$s and
+$\ONE$s already suggests simplification rules that can be applied to
+regular regular expressions, for example $\ZERO{}r \Rightarrow \ZERO$,
+$\ONE{}r \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r
+\Rightarrow r$. While such simple-minded reductions have been proved
+in our earlier work to preserve the correctness of Sulzmann and Lu's
+algorithm, they unfortunately do \emph{not} help with limiting the
+gowth of the derivatives shown above: yes, the growth is slowed, but the
+derivatives can still grow beyond any finite bound.
+
+
+
+Sulzmann and Lu introduce a
+bitcoded version of their lexing algorithm. They make some claims
+about the correctness and speed of this version, but do
+not provide any supporting proof arguments, not even
+``pencil-and-paper'' arguments. They wrote about their bitcoded
+``incremental parsing method'' (that is the algorithm to be studied in this
+section):
+
+\begin{quote}\it
+ ``Correctness Claim: We further claim that the incremental parsing
+ method [..] in combination with the simplification steps [..]
+ yields POSIX parse trees. We have tested this claim
+ extensively [..] but yet
+ have to work out all proof details.''
+\end{quote}
+
+
+
If a regular expression matches a string, then in general there is more
@@ -119,6 +175,14 @@
longest match.
+The derivative has the property (which may almost be
+regarded as its specification) that, for every string @{term s} and
+regular expression @{term r} and character @{term c}, one has @{term
+"cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}.
+
+
+
+
\begin{center}
\begin{tabular}{cc}
\begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
@@ -188,6 +252,115 @@
section {* Background *}
text {*
+ In our Isabelle/HOL formalisation strings are lists of characters with
+ the empty string being represented by the empty list, written $[]$,
+ and list-cons being written as $\_\!\_\!::\!\_\!\_\,$; string
+ concatenation is $\_\!\_ \,@\, \_\!\_\,$. Often we use the usual
+ bracket notation for lists also for strings; for example a string
+ consisting of just a single character $c$ is written $[c]$.
+ Our egular expressions are defined as usual as the elements of the following inductive
+ datatype:
+
+ \begin{center}
+ @{text "r :="}
+ @{const "ZERO"} $\mid$
+ @{const "ONE"} $\mid$
+ @{term "CH c"} $\mid$
+ @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
+ @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
+ @{term "STAR r"}
+ \end{center}
+
+ \noindent where @{const ZERO} stands for the regular expression that does
+ not match any string, @{const ONE} for the regular expression that matches
+ only the empty string and @{term c} for matching a character literal. The
+ language of a regular expression, written $L$, is defined as usual
+ (see for example \cite{AusafDyckhoffUrban2016}).
+
+ Central to Brzozowski's regular expression matcher are two functions
+ called $\nullable$ and \emph{derivative}. The latter is written
+ $r\backslash c$ for the derivative of the regular expression $r$
+ w.r.t.~the character $c$. Both functions are defined by recursion over
+ regular expressions.
+
+
+ \begin{center}
+ \begin{tabular}{lcl}
+ $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
+ $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
+ $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
+ $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
+ $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
+ $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
+ \end{tabular}
+ \end{center}
+
+ \noindent
+ The derivative function takes a regular expression, say $r$ and a
+ character, say $c$, as input and returns the derivative regular
+ expression.
+
+ \begin{center}
+ \begin{tabular}{lcl}
+ $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
+ $\ONE \backslash c$ & $\dn$ & $\ZERO$\\
+ $d \backslash c$ & $\dn$ &
+ $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+ $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+ $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\
+ & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+ & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
+ $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
+ \end{tabular}
+ \end{center}
+
+
+ Sulzmann and Lu presented two lexing algorithms in their paper from 2014
+ \cite{Sulzmann2014}. This first algorithm consists of two phases: first a
+ matching phase (which is Brzozowski's algorithm) and then a value
+ construction phase. The values encode \emph{how} a regular expression
+ matches a string. \emph{Values} are defined as the inductive datatype
+
+ \begin{center}
+ @{text "v :="}
+ @{const "Void"} $\mid$
+ @{term "val.Char c"} $\mid$
+ @{term "Left v"} $\mid$
+ @{term "Right v"} $\mid$
+ @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$
+ @{term "Stars vs"}
+ \end{center}
+
+ \noindent where we use @{term vs} to stand for a list of
+ values.
+
+
+
+
+ \noindent Sulzmann and Lu also define inductively an inhabitation relation
+ that associates values to regular expressions:
+
+ \begin{center}
+ \begin{tabular}{c}
+ \\[-8mm]
+ @{thm[mode=Axiom] Prf.intros(4)} \qquad
+ @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm]
+ @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad
+ @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm]
+ @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[4mm]
+ %%@ { t h m[mode=Axiom] Prf.intros(6)[of "r"]} \qquad
+ @{thm[mode=Rule] Prf.intros(6)[of "r" "vs"]}
+ \end{tabular}
+ \end{center}
+
+ \noindent Note that no values are associated with the regular expression
+ @{term ZERO}. It is routine to establish how values ``inhabiting'' a regular
+ expression correspond to the language of a regular expression, namely
+
+ \begin{proposition}
+ @{thm L_flat_Prf}
+ \end{proposition}
+
Sulzmann-Lu algorithm with inj. State that POSIX rules.
metion slg is correct.
@@ -640,6 +813,17 @@
text {*
+ Derivatives as calculated by Brzozowski’s method are usually more
+ complex regular expressions than the initial one; the result is
+ that the derivative-based matching and lexing algorithms are
+ often abysmally slow.
+
+ However, as Sulzmann and
+ Lu wrote, various optimisations are possible, such as the
+ simplifications of 0 + r,r + 0,1 · r and r · 1 to r. These
+ simplifications can speed up the algorithms considerably.
+
+
\begin{lemma}
@{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]}
\end{lemma}