diff -r 6de6bc551a8b -r b66a4305749c thys2/Paper/Paper.thy --- a/thys2/Paper/Paper.thy Mon Feb 07 01:12:36 2022 +0000 +++ b/thys2/Paper/Paper.thy Mon Feb 07 14:22:08 2022 +0000 @@ -23,6 +23,8 @@ notation (latex output) der_syn ("_\\_" [79, 1000] 76) and bder_syn ("_\\_" [79, 1000] 76) and + bders ("_\\_" [79, 1000] 76) and + bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and ZERO ("\<^bold>0" 81) and ONE ("\<^bold>1" 81) and @@ -38,6 +40,7 @@ val.Seq ("Seq _ _" [79,79] 78) and val.Stars ("Stars _" [79] 78) and + Prf ("\ _ : _" [75,75] 75) and Posix ("'(_, _') \ _" [63,75,75] 75) and flat ("|_|" [75] 74) and @@ -83,24 +86,77 @@ Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly expressible in any functional language, and easily definable and reasoned about in theorem provers---the definitions just consist of -inductive datatypes and simple recursive functions. A mechanised -correctness proof of Brzozowski's matcher in for example HOL4 has been -mentioned by Owens and Slind~\cite{Owens2008}. Another one in -Isabelle/HOL is part of the work by Krauss and Nipkow -\cite{Krauss2011}. And another one in Coq is given by Coquand and -Siles \cite{Coquand2012}. +inductive datatypes and simple recursive functions. Derivatives of a +regular expression, written @{term "der c r"}, give a simple solution +to the problem of matching a string @{term s} with a regular +expression @{term r}: if the derivative of @{term r} w.r.t.\ (in +succession) all the characters of the string matches the empty string, +then @{term r} matches @{term s} (and {\em vice versa}). We are aware +of a mechanised correctness proof of Brzozowski's matcher in HOL4 by +Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part +of the work by Krauss and Nipkow \cite{Krauss2011}. And another one +in Coq is given by Coquand and Siles \cite{Coquand2012}. + +There are two difficulties with derivative-based matchers and also +lexers: First, Brzozowski's original matcher only generates a yes/no +answer for whether a regular expression matches a string or not. +Sulzmann and Lu~\cite{Sulzmann2014} overcome this difficulty by +cleverly extending Brzozowski's matching algorithm to POSIX +lexing. This extended version generates additional information on +\emph{how} a regular expression matches a string. They achieve this by + + -The notion of derivatives -\cite{Brzozowski1964}, written @{term "der c r"}, of a regular -expression give a simple solution to the problem of matching a string -@{term s} with a regular expression @{term r}: if the derivative of -@{term r} w.r.t.\ (in succession) all the characters of the string -matches the empty string, then @{term r} matches @{term s} (and {\em -vice versa}). The derivative has the property (which may almost be -regarded as its specification) that, for every string @{term s} and -regular expression @{term r} and character @{term c}, one has @{term -"cs \ L(r)"} if and only if \mbox{@{term "s \ L(der c r)"}}. +The second problem is that Brzozowski's derivatives can +grow to arbitrarily big sizes. For example if we start with the +regular expression \mbox{@{text "(a + aa)\<^sup>*"}} and take +successive derivatives according to the character $a$, we end up with +a sequence of ever-growing derivatives like + +\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}} +\begin{center} +\begin{tabular}{rll} +$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\ +& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ +& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\ +& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ +& $\ll$ & \ldots +\end{tabular} +\end{center} + +\noindent where after around 35 steps we run out of memory on a +typical computer (we define the precise details of the derivative +operation later). Clearly, the notation involving $\ZERO$s and +$\ONE$s already suggests simplification rules that can be applied to +regular regular expressions, for example $\ZERO{}r \Rightarrow \ZERO$, +$\ONE{}r \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r +\Rightarrow r$. While such simple-minded reductions have been proved +in our earlier work to preserve the correctness of Sulzmann and Lu's +algorithm, they unfortunately do \emph{not} help with limiting the +gowth of the derivatives shown above: yes, the growth is slowed, but the +derivatives can still grow beyond any finite bound. + + + +Sulzmann and Lu introduce a +bitcoded version of their lexing algorithm. They make some claims +about the correctness and speed of this version, but do +not provide any supporting proof arguments, not even +``pencil-and-paper'' arguments. They wrote about their bitcoded +``incremental parsing method'' (that is the algorithm to be studied in this +section): + +\begin{quote}\it + ``Correctness Claim: We further claim that the incremental parsing + method [..] in combination with the simplification steps [..] + yields POSIX parse trees. We have tested this claim + extensively [..] but yet + have to work out all proof details.'' +\end{quote} + + + If a regular expression matches a string, then in general there is more @@ -119,6 +175,14 @@ longest match. +The derivative has the property (which may almost be +regarded as its specification) that, for every string @{term s} and +regular expression @{term r} and character @{term c}, one has @{term +"cs \ L(r)"} if and only if \mbox{@{term "s \ L(der c r)"}}. + + + + \begin{center} \begin{tabular}{cc} \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l} @@ -188,6 +252,115 @@ section {* Background *} text {* + In our Isabelle/HOL formalisation strings are lists of characters with + the empty string being represented by the empty list, written $[]$, + and list-cons being written as $\_\!\_\!::\!\_\!\_\,$; string + concatenation is $\_\!\_ \,@\, \_\!\_\,$. Often we use the usual + bracket notation for lists also for strings; for example a string + consisting of just a single character $c$ is written $[c]$. + Our egular expressions are defined as usual as the elements of the following inductive + datatype: + + \begin{center} + @{text "r :="} + @{const "ZERO"} $\mid$ + @{const "ONE"} $\mid$ + @{term "CH c"} $\mid$ + @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$ + @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ + @{term "STAR r"} + \end{center} + + \noindent where @{const ZERO} stands for the regular expression that does + not match any string, @{const ONE} for the regular expression that matches + only the empty string and @{term c} for matching a character literal. The + language of a regular expression, written $L$, is defined as usual + (see for example \cite{AusafDyckhoffUrban2016}). + + Central to Brzozowski's regular expression matcher are two functions + called $\nullable$ and \emph{derivative}. The latter is written + $r\backslash c$ for the derivative of the regular expression $r$ + w.r.t.~the character $c$. Both functions are defined by recursion over + regular expressions. + + + \begin{center} + \begin{tabular}{lcl} + $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ + $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ + $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ + $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ + $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ + $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ + \end{tabular} + \end{center} + + \noindent + The derivative function takes a regular expression, say $r$ and a + character, say $c$, as input and returns the derivative regular + expression. + + \begin{center} + \begin{tabular}{lcl} + $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ + $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ + $d \backslash c$ & $\dn$ & + $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ + $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ + $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\ + & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ + & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ + $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ + \end{tabular} + \end{center} + + + Sulzmann and Lu presented two lexing algorithms in their paper from 2014 + \cite{Sulzmann2014}. This first algorithm consists of two phases: first a + matching phase (which is Brzozowski's algorithm) and then a value + construction phase. The values encode \emph{how} a regular expression + matches a string. \emph{Values} are defined as the inductive datatype + + \begin{center} + @{text "v :="} + @{const "Void"} $\mid$ + @{term "val.Char c"} $\mid$ + @{term "Left v"} $\mid$ + @{term "Right v"} $\mid$ + @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ + @{term "Stars vs"} + \end{center} + + \noindent where we use @{term vs} to stand for a list of + values. + + + + + \noindent Sulzmann and Lu also define inductively an inhabitation relation + that associates values to regular expressions: + + \begin{center} + \begin{tabular}{c} + \\[-8mm] + @{thm[mode=Axiom] Prf.intros(4)} \qquad + @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm] + @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad + @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm] + @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[4mm] + %%@ { t h m[mode=Axiom] Prf.intros(6)[of "r"]} \qquad + @{thm[mode=Rule] Prf.intros(6)[of "r" "vs"]} + \end{tabular} + \end{center} + + \noindent Note that no values are associated with the regular expression + @{term ZERO}. It is routine to establish how values ``inhabiting'' a regular + expression correspond to the language of a regular expression, namely + + \begin{proposition} + @{thm L_flat_Prf} + \end{proposition} + Sulzmann-Lu algorithm with inj. State that POSIX rules. metion slg is correct. @@ -640,6 +813,17 @@ text {* + Derivatives as calculated by Brzozowski’s method are usually more + complex regular expressions than the initial one; the result is + that the derivative-based matching and lexing algorithms are + often abysmally slow. + + However, as Sulzmann and + Lu wrote, various optimisations are possible, such as the + simplifications of 0 + r,r + 0,1 · r and r · 1 to r. These + simplifications can speed up the algorithms considerably. + + \begin{lemma} @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]} \end{lemma}