lexing: comparison thys2/Paper/Paper.thy

equal deleted inserted replaced

-:6de6bc551a8b
+:b66a4305749c
 "bder_syn r c \<equiv> bder c r"
 notation (latex output)
 der_syn ("_\\_" [79, 1000] 76) and
 bder_syn ("_\\_" [79, 1000] 76) and
+bders ("_\\_" [79, 1000] 76) and
+bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and
 ZERO ("\<^bold>0" 81) and
 ONE ("\<^bold>1" 81) and
 CH ("_" [1000] 80) and
 ALT ("_ + _" [77,77] 78) and
 val.Left ("Left _" [79] 78) and
 val.Right ("Right _" [1000] 78) and
 val.Seq ("Seq _ _" [79,79] 78) and
 val.Stars ("Stars _" [79] 78) and
+Prf ("\<turnstile> _ : _" [75,75] 75) and
 Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and
 flat ("|_|" [75] 74) and
 flats ("|_|" [72] 74) and
 injval ("inj _ _ _" [79,77,79] 76) and
 expressions have sparked quite a bit of interest in the functional
 programming and theorem prover communities.  The beauty of
 Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly
 expressible in any functional language, and easily definable and
 reasoned about in theorem provers---the definitions just consist of
-inductive datatypes and simple recursive functions. A mechanised
+inductive datatypes and simple recursive functions.  Derivatives of a
-correctness proof of Brzozowski's matcher in for example HOL4 has been
+regular expression, written @{term "der c r"}, give a simple solution
-mentioned by Owens and Slind~\cite{Owens2008}. Another one in
+to the problem of matching a string @{term s} with a regular
-Isabelle/HOL is part of the work by Krauss and Nipkow
+expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
-\cite{Krauss2011}.  And another one in Coq is given by Coquand and
+succession) all the characters of the string matches the empty string,
-Siles \cite{Coquand2012}.
+then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
+of a mechanised correctness proof of Brzozowski's matcher in HOL4 by
+Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
-The notion of derivatives
+of the work by Krauss and Nipkow \cite{Krauss2011}.  And another one
-\cite{Brzozowski1964}, written @{term "der c r"}, of a regular
+in Coq is given by Coquand and Siles \cite{Coquand2012}.
-expression give a simple solution to the problem of matching a string
-@{term s} with a regular expression @{term r}: if the derivative of
+There are two difficulties with derivative-based matchers and also
-@{term r} w.r.t.\ (in succession) all the characters of the string
+lexers: First, Brzozowski's original matcher only generates a yes/no
-matches the empty string, then @{term r} matches @{term s} (and {\em
+answer for whether a regular expression matches a string or not.
-vice versa}). The derivative has the property (which may almost be
+Sulzmann and Lu~\cite{Sulzmann2014} overcome this difficulty by
-regarded as its specification) that, for every string @{term s} and
+cleverly extending Brzozowski's matching algorithm to POSIX
-regular expression @{term r} and character @{term c}, one has @{term
+lexing. This extended version generates additional information on
-"cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}.
+\emph{how} a regular expression matches a string. They achieve this by
+The second problem is that Brzozowski's derivatives can
+grow to arbitrarily big sizes. For example if we start with the
+regular expression \mbox{@{text "(a + aa)\<^sup>*"}} and take
+successive derivatives according to the character $a$, we end up with
+a sequence of ever-growing derivatives like
+\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+\begin{center}
+\begin{tabular}{rll}
+$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & \ldots
+\end{tabular}
+\end{center}
+\noindent where after around 35 steps we run out of memory on a
+typical computer (we define the precise details of the derivative
+operation later).  Clearly, the notation involving $\ZERO$s and
+$\ONE$s already suggests simplification rules that can be applied to
+regular regular expressions, for example $\ZERO{}r \Rightarrow \ZERO$,
+$\ONE{}r \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r
+\Rightarrow r$. While such simple-minded reductions have been proved
+in our earlier work to preserve the correctness of Sulzmann and Lu's
+algorithm, they unfortunately do \emph{not} help with limiting the
+gowth of the derivatives shown above: yes, the growth is slowed, but the
+derivatives can still grow beyond any finite bound.
+Sulzmann and Lu introduce a
+bitcoded version of their lexing algorithm. They make some claims
+about the correctness and speed of this version, but do
+not provide any supporting proof arguments, not even
+``pencil-and-paper'' arguments. They wrote about their bitcoded
+``incremental parsing method'' (that is the algorithm to be studied in this
+section):
+\begin{quote}\it
+``Correctness Claim: We further claim that the incremental parsing
+method [..] in combination with the simplification steps [..]
+yields POSIX parse trees. We have tested this claim
+extensively [..] but yet
+have to work out all proof details.''
+\end{quote}
 If a regular expression matches a string, then in general there is more
 than one way of how the string is matched. There are two commonly used
 disambiguation strategies to generate a unique answer: one is called
 first case corresponds to GREEDY matching, which first matches with the
 left-most symbol and only matches the next symbol in case of a mismatch
 (this is greedy in the sense of preferring instant gratification to
 delayed repletion). The second case is POSIX matching, which prefers the
 longest match.
+The derivative has the property (which may almost be
+regarded as its specification) that, for every string @{term s} and
+regular expression @{term r} and character @{term c}, one has @{term
+"cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}.
 \begin{center}
 \begin{tabular}{cc}
 \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
 *}
 section {* Background *}
 text {*
+In our Isabelle/HOL formalisation strings are lists of characters with
+the empty string being represented by the empty list, written $[]$,
+and list-cons being written as $\_\!\_\!::\!\_\!\_\,$; string
+concatenation is $\_\!\_ \,@\, \_\!\_\,$. Often we use the usual
+bracket notation for lists also for strings; for example a string
+consisting of just a single character $c$ is written $[c]$.
+Our egular expressions are defined as usual as the elements of the following inductive
+datatype:
+\begin{center}
+@{text "r :="}
+@{const "ZERO"} $\mid$
+@{const "ONE"} $\mid$
+@{term "CH c"} $\mid$
+@{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
+@{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
+@{term "STAR r"}
+\end{center}
+\noindent where @{const ZERO} stands for the regular expression that does
+not match any string, @{const ONE} for the regular expression that matches
+only the empty string and @{term c} for matching a character literal. The
+language of a regular expression, written $L$, is defined as usual
+(see for example \cite{AusafDyckhoffUrban2016}).
+Central to Brzozowski's regular expression matcher are two functions
+called $\nullable$ and \emph{derivative}. The latter is written
+$r\backslash c$ for the derivative of the regular expression $r$
+w.r.t.~the character $c$. Both functions are defined by recursion over
+regular expressions.
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\
+			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
+			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
+			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
+			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
+			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
+		\end{tabular}
+	\end{center}
+\noindent
+The derivative function takes a regular expression, say $r$ and a
+character, say $c$, as input and returns the derivative regular
+expression.
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
+			$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
+			$d \backslash c$     & $\dn$ &
+			$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+			$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+			$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\
+			&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+			&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
+			$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
+		\end{tabular}
+	\end{center}
+Sulzmann and Lu presented two lexing algorithms in their paper from 2014
+\cite{Sulzmann2014}. This first algorithm consists of two phases: first a
+matching phase (which is Brzozowski's algorithm) and then a value
+construction phase. The values encode \emph{how} a regular expression
+matches a string. \emph{Values} are defined as the inductive datatype
+\begin{center}
+@{text "v :="}
+@{const "Void"} $\mid$
+@{term "val.Char c"} $\mid$
+@{term "Left v"} $\mid$
+@{term "Right v"} $\mid$
+@{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$
+@{term "Stars vs"}
+\end{center}
+\noindent where we use @{term vs} to stand for a list of
+values.
+\noindent Sulzmann and Lu also define inductively an inhabitation relation
+that associates values to regular expressions:
+\begin{center}
+\begin{tabular}{c}
+\\[-8mm]
+@{thm[mode=Axiom] Prf.intros(4)} \qquad
+@{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm]
+@{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad
+@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm]
+@{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[4mm]
+%%@ { t h m[mode=Axiom] Prf.intros(6)[of "r"]} \qquad
+@{thm[mode=Rule] Prf.intros(6)[of "r" "vs"]}
+\end{tabular}
+\end{center}
+\noindent Note that no values are associated with the regular expression
+@{term ZERO}. It is routine to establish how values ``inhabiting'' a regular
+expression correspond to the language of a regular expression, namely
+\begin{proposition}
+@{thm L_flat_Prf}
+\end{proposition}
 Sulzmann-Lu algorithm with inj. State that POSIX rules.
 metion slg is correct.
 \begin{figure}[t]
 section {* Simplification *}
 text {*
+Derivatives as calculated by Brzozowski’s method are usually more
+complex regular expressions than the initial one; the result is
+that the derivative-based matching and lexing algorithms are
+often abysmally slow.
+However, as Sulzmann and
+Lu wrote, various optimisations are possible, such as the
+simplifications of 0 + r,r + 0,1 · r and r · 1 to r. These
+simplifications can speed up the algorithms considerably.
 \begin{lemma}
 @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]}
 \end{lemma}

changeset 420	b66a4305749c
parent 418	41a2a3b63853
child 423	b7199d6c672d