thys2/Paper/Paper.thy
changeset 420 b66a4305749c
parent 418 41a2a3b63853
child 423 b7199d6c672d
--- a/thys2/Paper/Paper.thy	Mon Feb 07 01:12:36 2022 +0000
+++ b/thys2/Paper/Paper.thy	Mon Feb 07 14:22:08 2022 +0000
@@ -23,6 +23,8 @@
 notation (latex output)
   der_syn ("_\\_" [79, 1000] 76) and
   bder_syn ("_\\_" [79, 1000] 76) and
+  bders ("_\\_" [79, 1000] 76) and
+  bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and
 
   ZERO ("\<^bold>0" 81) and 
   ONE ("\<^bold>1" 81) and 
@@ -38,6 +40,7 @@
   val.Seq ("Seq _ _" [79,79] 78) and
   val.Stars ("Stars _" [79] 78) and
 
+  Prf ("\<turnstile> _ : _" [75,75] 75) and  
   Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
 
   flat ("|_|" [75] 74) and
@@ -83,24 +86,77 @@
 Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly
 expressible in any functional language, and easily definable and
 reasoned about in theorem provers---the definitions just consist of
-inductive datatypes and simple recursive functions. A mechanised
-correctness proof of Brzozowski's matcher in for example HOL4 has been
-mentioned by Owens and Slind~\cite{Owens2008}. Another one in
-Isabelle/HOL is part of the work by Krauss and Nipkow
-\cite{Krauss2011}.  And another one in Coq is given by Coquand and
-Siles \cite{Coquand2012}.
+inductive datatypes and simple recursive functions.  Derivatives of a
+regular expression, written @{term "der c r"}, give a simple solution
+to the problem of matching a string @{term s} with a regular
+expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
+succession) all the characters of the string matches the empty string,
+then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
+of a mechanised correctness proof of Brzozowski's matcher in HOL4 by
+Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
+of the work by Krauss and Nipkow \cite{Krauss2011}.  And another one
+in Coq is given by Coquand and Siles \cite{Coquand2012}.
+
+There are two difficulties with derivative-based matchers and also
+lexers: First, Brzozowski's original matcher only generates a yes/no
+answer for whether a regular expression matches a string or not.
+Sulzmann and Lu~\cite{Sulzmann2014} overcome this difficulty by
+cleverly extending Brzozowski's matching algorithm to POSIX
+lexing. This extended version generates additional information on
+\emph{how} a regular expression matches a string. They achieve this by
+
+
 
 
-The notion of derivatives
-\cite{Brzozowski1964}, written @{term "der c r"}, of a regular
-expression give a simple solution to the problem of matching a string
-@{term s} with a regular expression @{term r}: if the derivative of
-@{term r} w.r.t.\ (in succession) all the characters of the string
-matches the empty string, then @{term r} matches @{term s} (and {\em
-vice versa}). The derivative has the property (which may almost be
-regarded as its specification) that, for every string @{term s} and
-regular expression @{term r} and character @{term c}, one has @{term
-"cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}.
+The second problem is that Brzozowski's derivatives can 
+grow to arbitrarily big sizes. For example if we start with the
+regular expression \mbox{@{text "(a + aa)\<^sup>*"}} and take
+successive derivatives according to the character $a$, we end up with
+a sequence of ever-growing derivatives like 
+
+\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+\begin{center}
+\begin{tabular}{rll}
+$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & \ldots
+\end{tabular}
+\end{center}
+ 
+\noindent where after around 35 steps we run out of memory on a
+typical computer (we define the precise details of the derivative
+operation later).  Clearly, the notation involving $\ZERO$s and
+$\ONE$s already suggests simplification rules that can be applied to
+regular regular expressions, for example $\ZERO{}r \Rightarrow \ZERO$,
+$\ONE{}r \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r
+\Rightarrow r$. While such simple-minded reductions have been proved
+in our earlier work to preserve the correctness of Sulzmann and Lu's
+algorithm, they unfortunately do \emph{not} help with limiting the
+gowth of the derivatives shown above: yes, the growth is slowed, but the
+derivatives can still grow beyond any finite bound.
+
+
+
+Sulzmann and Lu introduce a
+bitcoded version of their lexing algorithm. They make some claims
+about the correctness and speed of this version, but do
+not provide any supporting proof arguments, not even
+``pencil-and-paper'' arguments. They wrote about their bitcoded
+``incremental parsing method'' (that is the algorithm to be studied in this
+section):
+
+\begin{quote}\it
+  ``Correctness Claim: We further claim that the incremental parsing
+  method [..] in combination with the simplification steps [..]
+  yields POSIX parse trees. We have tested this claim
+  extensively [..] but yet
+  have to work out all proof details.''
+\end{quote}  
+
+
+
 
 
 If a regular expression matches a string, then in general there is more
@@ -119,6 +175,14 @@
 longest match.
 
 
+The derivative has the property (which may almost be
+regarded as its specification) that, for every string @{term s} and
+regular expression @{term r} and character @{term c}, one has @{term
+"cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}.
+
+
+
+
 \begin{center}
 \begin{tabular}{cc}
   \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
@@ -188,6 +252,115 @@
 section {* Background *}
 
 text {*
+  In our Isabelle/HOL formalisation strings are lists of characters with
+  the empty string being represented by the empty list, written $[]$,
+  and list-cons being written as $\_\!\_\!::\!\_\!\_\,$; string
+  concatenation is $\_\!\_ \,@\, \_\!\_\,$. Often we use the usual
+  bracket notation for lists also for strings; for example a string
+  consisting of just a single character $c$ is written $[c]$.   
+  Our egular expressions are defined as usual as the elements of the following inductive
+  datatype:
+
+  \begin{center}
+  @{text "r :="}
+  @{const "ZERO"} $\mid$
+  @{const "ONE"} $\mid$
+  @{term "CH c"} $\mid$
+  @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
+  @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
+  @{term "STAR r"} 
+  \end{center}
+
+  \noindent where @{const ZERO} stands for the regular expression that does
+  not match any string, @{const ONE} for the regular expression that matches
+  only the empty string and @{term c} for matching a character literal. The
+  language of a regular expression, written $L$, is defined as usual
+  (see for example \cite{AusafDyckhoffUrban2016}).
+
+  Central to Brzozowski's regular expression matcher are two functions
+  called $\nullable$ and \emph{derivative}. The latter is written
+  $r\backslash c$ for the derivative of the regular expression $r$
+  w.r.t.~the character $c$. Both functions are defined by recursion over
+  regular expressions.  
+
+
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
+			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
+			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
+			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
+			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
+			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
+		\end{tabular}
+	\end{center}
+
+  \noindent
+  The derivative function takes a regular expression, say $r$ and a
+  character, say $c$, as input and returns the derivative regular
+  expression.
+
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
+			$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
+			$d \backslash c$     & $\dn$ & 
+			$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+			$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+			$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\
+			&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+			&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
+			$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
+		\end{tabular}
+	\end{center}
+
+
+  Sulzmann and Lu presented two lexing algorithms in their paper from 2014
+  \cite{Sulzmann2014}. This first algorithm consists of two phases: first a
+  matching phase (which is Brzozowski's algorithm) and then a value
+  construction phase. The values encode \emph{how} a regular expression
+  matches a string. \emph{Values} are defined as the inductive datatype
+
+  \begin{center}
+  @{text "v :="}
+  @{const "Void"} $\mid$
+  @{term "val.Char c"} $\mid$
+  @{term "Left v"} $\mid$
+  @{term "Right v"} $\mid$
+  @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
+  @{term "Stars vs"} 
+  \end{center}  
+
+  \noindent where we use @{term vs} to stand for a list of
+  values.
+
+
+
+
+  \noindent Sulzmann and Lu also define inductively an inhabitation relation
+  that associates values to regular expressions:
+
+  \begin{center}
+  \begin{tabular}{c}
+  \\[-8mm]
+  @{thm[mode=Axiom] Prf.intros(4)} \qquad
+  @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm]
+  @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
+  @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm]
+  @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[4mm]
+  %%@ { t h m[mode=Axiom] Prf.intros(6)[of "r"]} \qquad  
+  @{thm[mode=Rule] Prf.intros(6)[of "r" "vs"]}
+  \end{tabular}
+  \end{center}
+
+  \noindent Note that no values are associated with the regular expression
+  @{term ZERO}. It is routine to establish how values ``inhabiting'' a regular
+  expression correspond to the language of a regular expression, namely
+
+  \begin{proposition}
+  @{thm L_flat_Prf}
+  \end{proposition}
+
   Sulzmann-Lu algorithm with inj. State that POSIX rules.
   metion slg is correct.
 
@@ -640,6 +813,17 @@
 
 text {*
 
+     Derivatives as calculated by Brzozowski’s method are usually more
+     complex regular expressions than the initial one; the result is
+     that the derivative-based matching and lexing algorithms are
+     often abysmally slow.
+
+     However, as Sulzmann and
+     Lu wrote, various optimisations are possible, such as the
+     simplifications of 0 + r,r + 0,1 · r and r · 1 to r. These
+     simplifications can speed up the algorithms considerably.
+
+
      \begin{lemma}
      @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]}
      \end{lemma}