# HG changeset patch # User Christian Urban # Date 1644330581 0 # Node ID b7199d6c672d40da60809e8c8a8c4c8b1e740510 # Parent fb23e3fd12e55242bf5ad0f07916995a646b271c updated paper diff -r fb23e3fd12e5 -r b7199d6c672d thys/Paper/Paper.thy --- a/thys/Paper/Paper.thy Tue Feb 08 01:25:26 2022 +0000 +++ b/thys/Paper/Paper.thy Tue Feb 08 14:29:41 2022 +0000 @@ -431,9 +431,11 @@ \end{proposition} In general there is more than one value associated with a regular - expression. In case of POSIX matching the problem is to calculate the - unique value that satisfies the (informal) POSIX rules from the - Introduction. Graphically the POSIX value calculation algorithm by + expression (meaining regular expressions can typically match more + than one string). But even given a string from the language of the + regular expression, there are generally more than one way how the + regular expression can match this string. POSIX lexing is about identifying + a unique value that satisfies the (informal) POSIX. Graphically the POSIX value calculation algorithm by Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz} where the path from the left to the right involving @{term derivatives}/@{const nullable} is the first phase of the algorithm (calculating successive diff -r fb23e3fd12e5 -r b7199d6c672d thys2/Paper/Paper.thy --- a/thys2/Paper/Paper.thy Tue Feb 08 01:25:26 2022 +0000 +++ b/thys2/Paper/Paper.thy Tue Feb 08 14:29:41 2022 +0000 @@ -18,10 +18,13 @@ abbreviation "der_syn r c \ der c r" abbreviation + "ders_syn r s \ ders s r" +abbreviation "bder_syn r c \ bder c r" notation (latex output) der_syn ("_\\_" [79, 1000] 76) and + ders_syn ("_\\_" [79, 1000] 76) and bder_syn ("_\\_" [79, 1000] 76) and bders ("_\\_" [79, 1000] 76) and bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and @@ -31,7 +34,7 @@ CH ("_" [1000] 80) and ALT ("_ + _" [77,77] 78) and SEQ ("_ \ _" [77,77] 78) and - STAR ("_\<^sup>\" [79] 78) and + STAR ("_\<^sup>*" [79] 78) and val.Void ("Empty" 78) and val.Char ("Char _" [1000] 78) and @@ -92,23 +95,31 @@ expression @{term r}: if the derivative of @{term r} w.r.t.\ (in succession) all the characters of the string matches the empty string, then @{term r} matches @{term s} (and {\em vice versa}). We are aware -of a mechanised correctness proof of Brzozowski's matcher in HOL4 by +of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part of the work by Krauss and Nipkow \cite{Krauss2011}. And another one in Coq is given by Coquand and Siles \cite{Coquand2012}. - -There are two difficulties with derivative-based matchers and also -lexers: First, Brzozowski's original matcher only generates a yes/no -answer for whether a regular expression matches a string or not. -Sulzmann and Lu~\cite{Sulzmann2014} overcome this difficulty by -cleverly extending Brzozowski's matching algorithm to POSIX -lexing. This extended version generates additional information on -\emph{how} a regular expression matches a string. They achieve this by +Also Ribeiro and Du Bois give one in Agda \cite{RibeiroAgda2017}. - +However, there are two difficulties with derivative-based matchers: +First, Brzozowski's original matcher only generates a yes/no answer +for whether a regular expression matches a string or not. This is too +little information in the context of lexing where separate tokens must +be identified and also classified (for example as keywords +or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this +difficulty by cleverly extending Brzozowski's matching +algorithm. Their extended version generates additional information on +\emph{how} a regular expression matches a string following the POSIX +rules for regular expression matching. They achieve this by adding a +second ``phase'' to Brzozowski's algorithm involving an injection +function. In our own earlier work we provided the formal +specification of what POSIX matching means and proved in Isabelle/HOL +the correctness +of Sulzmann and Lu's extended algorithm accordingly +\cite{AusafDyckhoffUrban2016}. -The second problem is that Brzozowski's derivatives can +The second difficulty is that Brzozowski's derivatives can grow to arbitrarily big sizes. For example if we start with the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} and take successive derivatives according to the character $a$, we end up with @@ -121,31 +132,39 @@ & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\ & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ -& $\ll$ & \ldots +& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) \end{tabular} \end{center} \noindent where after around 35 steps we run out of memory on a -typical computer (we define the precise details of the derivative -operation later). Clearly, the notation involving $\ZERO$s and -$\ONE$s already suggests simplification rules that can be applied to -regular regular expressions, for example $\ZERO{}r \Rightarrow \ZERO$, -$\ONE{}r \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r -\Rightarrow r$. While such simple-minded reductions have been proved -in our earlier work to preserve the correctness of Sulzmann and Lu's -algorithm, they unfortunately do \emph{not} help with limiting the -gowth of the derivatives shown above: yes, the growth is slowed, but the -derivatives can still grow beyond any finite bound. +typical computer (we shall define shortly the precise details of our +regular expressions and the derivative operation). Clearly, the +notation involving $\ZERO$s and $\ONE$s already suggests +simplification rules that can be applied to regular regular +expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r +\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow +r$. While such simple-minded simplifications have been proved in our +earlier work to preserve the correctness of Sulzmann and Lu's +algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do +\emph{not} help with limiting the growth of the derivatives shown +above: the growth is slowed, but the derivatives can still grow quickly +beyond any finite bound. -Sulzmann and Lu introduce a -bitcoded version of their lexing algorithm. They make some claims -about the correctness and speed of this version, but do -not provide any supporting proof arguments, not even -``pencil-and-paper'' arguments. They wrote about their bitcoded -``incremental parsing method'' (that is the algorithm to be studied in this -section): +Sulzmann and Lu overcome this ``growth problem'' in a second algorithm +\cite{Sulzmann2014} where they introduce bitcoded +regular expressions. In this version, POSIX values are +represented as bitsequences and such sequences are incrementally generated +when derivatives are calculated. The compact representation +of bitsequences and regular expressions allows them to define a more +``aggressive'' simplification method that keeps the size of the +derivatives finite no matter what the length of the string is. +They make some informal claims about the correctness and linear behaviour +of this version, but do not provide any supporting proof arguments, not +even ``pencil-and-paper'' arguments. They write about their bitcoded +\emph{incremental parsing method} (that is the algorithm to be formalised +in this paper): \begin{quote}\it ``Correctness Claim: We further claim that the incremental parsing @@ -155,33 +174,68 @@ have to work out all proof details.'' \end{quote} - +\noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL +the derivative-based lexing algorithm of Sulzmann and Lu +\cite{Sulzmann2014} where regular expressions are annotated with +bitsequences. We define the crucial simplification function as a +recursive function, instead of a fix-point operation. One objective of +the simplification function is to remove duplicates of regular +expressions. For this Sulzmann and Lu use in their paper the standard +@{text nub} function from Haskell's list library, but this function +does not achieve the intended objective with bitcoded regular expressions. The +reason is that in the bitcoded setting, each copy generally has a +different bitcode annotation---so @{text nub} would never ``fire''. +Inspired by Scala's library for lists, we shall instead use a @{text +distinctBy} function that finds duplicates under an erasing function +that deletes bitcodes. +We shall also introduce our own argument and definitions for +establishing the correctness of the bitcoded algorithm when +simplifications are included.\medskip +\noindent In this paper, we shall first briefly introduce the basic notions +of regular expressions and describe the basic definitions +of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves +as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove +the correctness for the bitcoded algorithm without simplification, and +after that extend the proof to include simplification. +*} + +section {* Background *} -If a regular expression matches a string, then in general there is more -than one way of how the string is matched. There are two commonly used -disambiguation strategies to generate a unique answer: one is called -GREEDY matching \cite{Frisch2004} and the other is POSIX -matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}. -For example consider the string @{term xy} and the regular expression -\mbox{@{term "STAR (ALT (ALT x y) xy)"}}. Either the string can be -matched in two `iterations' by the single letter-regular expressions -@{term x} and @{term y}, or directly in one iteration by @{term xy}. The -first case corresponds to GREEDY matching, which first matches with the -left-most symbol and only matches the next symbol in case of a mismatch -(this is greedy in the sense of preferring instant gratification to -delayed repletion). The second case is POSIX matching, which prefers the -longest match. +text {* + In our Isabelle/HOL formalisation strings are lists of characters with + the empty string being represented by the empty list, written $[]$, + and list-cons being written as $\_\!::\!\_\,$; string + concatenation is $\_ \,@\, \_\,$. We often use the usual + bracket notation for lists also for strings; for example a string + consisting of just a single character $c$ is written $[c]$. + Our regular expressions are defined as usual as the elements of the following inductive + datatype: + \begin{center} + @{text "r :="} + @{const "ZERO"} $\mid$ + @{const "ONE"} $\mid$ + @{term "CH c"} $\mid$ + @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$ + @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ + @{term "STAR r"} + \end{center} -The derivative has the property (which may almost be -regarded as its specification) that, for every string @{term s} and -regular expression @{term r} and character @{term c}, one has @{term -"cs \ L(r)"} if and only if \mbox{@{term "s \ L(der c r)"}}. + \noindent where @{const ZERO} stands for the regular expression that does + not match any string, @{const ONE} for the regular expression that matches + only the empty string and @{term c} for matching a character literal. + The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively. + The + \emph{language} of a regular expression, written $L$, is defined as usual + and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}). - - + Central to Brzozowski's regular expression matcher are two functions + called @{text nullable} and \emph{derivative}. The latter is written + $r\backslash c$ for the derivative of the regular expression $r$ + w.r.t.~the character $c$. Both functions are defined by recursion over + regular expressions. \begin{center} \begin{tabular}{cc} @@ -208,8 +262,182 @@ \end{tabular} \end{center} + \noindent + We can extend this definition to give derivatives w.r.t.~strings: + + \begin{center} + \begin{tabular}{cc} + \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} + @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)} + \end{tabular} + & + \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} + @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)} + \end{tabular} + \end{tabular} + \end{center} + +\noindent +Using @{text nullable} and the derivative operation, we can +define the following simple regular expression matcher: +% +\[ +@{text "match s r"} \;\dn\; @{term nullable}(r\backslash s) +\] + +\noindent This is essentially Brzozowski's algorithm from 1964. Its +main virtue is that the algorithm can be easily implemented as a +functional program (either in a functional programming language or in +a theorem prover). The correctness proof for @{text match} amounts to +establishing the property +% +\begin{proposition}\label{matchcorr} +@{text "match s r"} \;\;\text{if and only if}\;\; $s \in L(r)$ +\end{proposition} + +\noindent +It is a fun exercise to formaly prove this property in a theorem prover. + +The novel idea of Sulzmann and Lu is to extend this algorithm for +lexing, where it is important to find out which part of the string +is matched by which part of the regular expression. +For this Sulzmann and Lu presented two lexing algorithms in their paper + \cite{Sulzmann2014}. This first algorithm consists of two phases: first a + matching phase (which is Brzozowski's algorithm) and then a value + construction phase. The values encode \emph{how} a regular expression + matches a string. \emph{Values} are defined as the inductive datatype + + \begin{center} + @{text "v :="} + @{const "Void"} $\mid$ + @{term "val.Char c"} $\mid$ + @{term "Left v"} $\mid$ + @{term "Right v"} $\mid$ + @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ + @{term "Stars vs"} + \end{center} + + \noindent where we use @{term vs} to stand for a list of values. The + string underlying a value can be calculated by a @{const flat} + function, written @{term "flat DUMMY"}. It traverses a value and + collects the characters contained in it. Sulzmann and Lu also define inductively an + inhabitation relation that associates values to regular expressions: + + \begin{center} + \begin{tabular}{c} + \\[-8mm] + @{thm[mode=Axiom] Prf.intros(4)} \qquad + @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm] + @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad + @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm] + @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]} \qquad + @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]} + \end{tabular} + \end{center} + + \noindent Note that no values are associated with the regular expression + @{term ZERO}, since it cannot match any string. + It is routine to establish how values ``inhabiting'' a regular + expression correspond to the language of a regular expression, namely + + \begin{proposition} + @{thm L_flat_Prf} + \end{proposition} + + In general there is more than one value inhabited by a regular + expression (meaning regular expressions can typically match more + than one string). But even when fixing a string from the language of the + regular expression, there are generally more than one way of how the + regular expression can match this string. POSIX lexing is about + identifying the unique value for a given regular expression and a + string that satisfies the informal POSIX rules (see + \cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}).\footnote{POSIX + lexing acquired its name from the fact that the corresponding + rules were described as part of the POSIX specification for + Unix-like operating systems \cite{POSIX}.} Sometimes these + informal rules are called \emph{maximal much rule} and \emph{rule priority}. + One contribution of our earlier paper is to give a convenient + specification for what a POSIX value is (the inductive rules are shown in + Figure~\ref{POSIXrules}). \begin{figure}[t] + \begin{center} + \begin{tabular}{c} + @{thm[mode=Axiom] Posix.intros(1)}\P\@{term "ONE"} \qquad + @{thm[mode=Axiom] Posix.intros(2)}\P\@{term "c"}\medskip\\ + @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\P+L\\qquad + @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\P+R\\medskip\\ + $\mprset{flushleft} + \inferrule + {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad + @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\ + @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}} + {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\PS\\medskip\smallskip\\ + @{thm[mode=Axiom] Posix.intros(7)}\P[]\\qquad + $\mprset{flushleft} + \inferrule + {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad + @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad + @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ + @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} + {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\P\\\\[-4mm] + \end{tabular} + \end{center} + \caption{The inductive definition of POSIX values taken from our earlier paper \cite{AusafDyckhoffUrban2016}. The ternary relation, written $(s, r) \rightarrow v$, formalises the notion + of given a string $s$ and a regular + expression $r$ what is the unique value $v$ that satisfies the informal POSIX constraints for + regular expression matching.}\label{POSIXrules} + \end{figure} + + The clever idea by Sulzmann and Lu \cite{Sulzmann2014} in their first algorithm is to define + an injection function on values that mirrors (but inverts) the + construction of the derivative on regular expressions. Essentially it + injects back a character into a value. + For this they define two functions called @{text mkeps} and @{text inj}: + + \begin{center} + \begin{tabular}{l} + \begin{tabular}{lcl} + @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ + @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ + @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ + @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ + \end{tabular}\smallskip\\ + + \begin{tabular}{lcl} + @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ + @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & + @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ + @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & + @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ + @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ + & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ + @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ + & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ + @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ + & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ + @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ + & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]} + \end{tabular} + \end{tabular} + \end{center} + + \noindent + The function @{text mkeps} is called when the last derivative is nullable, that is + the string to be matched is in the language of the regular expression. It generates + a value for how the last derivative can match the empty string. The injection function + then calculates the corresponding value for each intermediate derivative until + a value for the original regular expression is generated. + Graphically the algorithm by + Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz} + where the path from the left to the right involving @{term derivatives}/@{const + nullable} is the first phase of the algorithm (calculating successive + \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to + left, the second phase. This picture shows the steps required when a + regular expression, say @{text "r\<^sub>1"}, matches the string @{term + "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as: + + \begin{figure}[t] \begin{center} \begin{tikzpicture}[scale=2,node distance=1.3cm, every node/.style={minimum size=6mm}] @@ -234,7 +462,7 @@ \end{center} \mbox{}\\[-13mm] -\caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014}, +\caption{The two phases of the first algorithm by Sulzmann \& Lu \cite{Sulzmann2014}, matching the string @{term "[a,b,c]"}. The first phase (the arrows from left to right) is \Brz's matcher building successive derivatives. If the last regular expression is @{term nullable}, then the functions of the @@ -242,183 +470,50 @@ @{term mkeps} calculates a value @{term "v\<^sub>4"} witnessing how the empty string has been recognised by @{term "r\<^sub>4"}. After that the function @{term inj} ``injects back'' the characters of the string into -the values. +the values. The value @{term "v\<^sub>1"} is the result of the algorithm representing +the POSIX value for this string and +regular expression. \label{Sulz}} \end{figure} -*} - -section {* Background *} - -text {* - In our Isabelle/HOL formalisation strings are lists of characters with - the empty string being represented by the empty list, written $[]$, - and list-cons being written as $\_\!\_\!::\!\_\!\_\,$; string - concatenation is $\_\!\_ \,@\, \_\!\_\,$. Often we use the usual - bracket notation for lists also for strings; for example a string - consisting of just a single character $c$ is written $[c]$. - Our egular expressions are defined as usual as the elements of the following inductive - datatype: \begin{center} - @{text "r :="} - @{const "ZERO"} $\mid$ - @{const "ONE"} $\mid$ - @{term "CH c"} $\mid$ - @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$ - @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ - @{term "STAR r"} - \end{center} - - \noindent where @{const ZERO} stands for the regular expression that does - not match any string, @{const ONE} for the regular expression that matches - only the empty string and @{term c} for matching a character literal. The - language of a regular expression, written $L$, is defined as usual - (see for example \cite{AusafDyckhoffUrban2016}). - - Central to Brzozowski's regular expression matcher are two functions - called $\nullable$ and \emph{derivative}. The latter is written - $r\backslash c$ for the derivative of the regular expression $r$ - w.r.t.~the character $c$. Both functions are defined by recursion over - regular expressions. - - - \begin{center} - \begin{tabular}{lcl} - $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ - $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ - $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ - $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ - $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ - $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ - \end{tabular} - \end{center} - - \noindent - The derivative function takes a regular expression, say $r$ and a - character, say $c$, as input and returns the derivative regular - expression. - - \begin{center} - \begin{tabular}{lcl} - $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ - $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ - $d \backslash c$ & $\dn$ & - $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ - $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ - $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\ - & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ - & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ - $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ - \end{tabular} - \end{center} - - - Sulzmann and Lu presented two lexing algorithms in their paper from 2014 - \cite{Sulzmann2014}. This first algorithm consists of two phases: first a - matching phase (which is Brzozowski's algorithm) and then a value - construction phase. The values encode \emph{how} a regular expression - matches a string. \emph{Values} are defined as the inductive datatype - - \begin{center} - @{text "v :="} - @{const "Void"} $\mid$ - @{term "val.Char c"} $\mid$ - @{term "Left v"} $\mid$ - @{term "Right v"} $\mid$ - @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ - @{term "Stars vs"} - \end{center} - - \noindent where we use @{term vs} to stand for a list of - values. - - - - - \noindent Sulzmann and Lu also define inductively an inhabitation relation - that associates values to regular expressions: - - \begin{center} - \begin{tabular}{c} - \\[-8mm] - @{thm[mode=Axiom] Prf.intros(4)} \qquad - @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm] - @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad - @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm] - @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[4mm] - %%@ { t h m[mode=Axiom] Prf.intros(6)[of "r"]} \qquad - @{thm[mode=Rule] Prf.intros(6)[of "r" "vs"]} + \begin{tabular}{lcl} + @{thm (lhs) lexer.simps(1)} & $\dn$ & @{thm (rhs) lexer.simps(1)}\\ + @{thm (lhs) lexer.simps(2)} & $\dn$ & @{text "case"} @{term "lexer (der c r) s"} @{text of}\\ + & & \phantom{$|$} @{term "None"} @{text "\"} @{term None}\\ + & & $|$ @{term "Some v"} @{text "\"} @{term "Some (injval r c v)"} \end{tabular} \end{center} - \noindent Note that no values are associated with the regular expression - @{term ZERO}. It is routine to establish how values ``inhabiting'' a regular - expression correspond to the language of a regular expression, namely + +We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that the algorithm by Sulzmann and Lu +is correct. The cenral property we established relates the derivative operation to the injection function. - \begin{proposition} - @{thm L_flat_Prf} - \end{proposition} + \begin{proposition}\label{Posix2} + \textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$. +\end{proposition} - Sulzmann-Lu algorithm with inj. State that POSIX rules. - metion slg is correct. + \noindent + With this in place we were able to prove: - \begin{figure}[t] - \begin{center} - \begin{tabular}{c} - @{thm[mode=Axiom] Posix.intros(1)}\P\@{term "ONE"} \qquad - @{thm[mode=Axiom] Posix.intros(2)}\P\@{term "c"}\medskip\\ - @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\P+L\\qquad - @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\P+R\\medskip\\ - $\mprset{flushleft} - \inferrule - {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad - @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\ - @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}} - {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\PS\\\ - @{thm[mode=Axiom] Posix.intros(7)}\P[]\\medskip\\ - $\mprset{flushleft} - \inferrule - {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad - @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad - @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ - @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} - {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\P\\ + \begin{proposition}\mbox{}\smallskip\\\label{lexercorrect} + \begin{tabular}{ll} + (1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\ + (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\ \end{tabular} - \end{center} - \caption{Our inductive definition of POSIX values.}\label{POSIXrules} - \end{figure} - + \end{proposition} - \begin{center} - \begin{tabular}{lcl} - @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ - @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ - @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ - @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ - \end{tabular} - \end{center} - - \begin{center} - \begin{tabular}{l@ {\hspace{5mm}}lcl} - \textit{(1)} & @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ - \textit{(2)} & @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & - @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ - \textit{(3)} & @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & - @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ - \textit{(4)} & @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ - & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ - \textit{(5)} & @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ - & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ - \textit{(6)} & @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ - & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ - \textit{(7)} & @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ - & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\ - \end{tabular} - \end{center} - + \noindent + In fact we have shown that in the success case the generated POSIX value $v$ is + unique and in the failure case that there is no POSIX value $v$ that satisfies + $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly + slow in examples where the derivatives grow arbitrarily (see example from the + Introduction). However it can be used as a conveninet reference point for the correctness + proof of the second algorithm by Sulzmann and Lu, which we shall describe next. + *} section {* Bitcoded Regular Expressions and Derivatives *} @@ -426,7 +521,7 @@ text {* In the second part of their paper \cite{Sulzmann2014}, - Sulzmann and Lu describe another algorithm that generates POSIX + Sulzmann and Lu describe another algorithm that also generates POSIX values but dispences with the second phase where characters are injected ``back'' into values. For this they annotate bitcodes to regular expressions, which we define in Isabelle/HOL as the datatype @@ -579,10 +674,10 @@ the functions \textit{bnullable} and \textit{bmkeps}, which are the ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on bitcoded regular expressions, instead of regular expressions. - + % \begin{center} \begin{tabular}{@ {}c@ {}c@ {}} - \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} + \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{false}$\\ $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{true}$\\ $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{false}$\\ @@ -594,7 +689,7 @@ $\textit{true}$ \end{tabular} & - \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} + \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}} $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\ $\textit{bmkeps}\,(\textit{ALTs}\,bs\,r\!::\!\rs)$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,r$\\ @@ -711,6 +806,42 @@ \end{proof} \noindent +The only problem left for the correctness proof is that the bitcoded algorithm +has only a ``forward phase'' where POSIX values are generated incrementally. +We can achive the same effect with @{text lexer} by stacking up injection +functions in the during forward phase. An auxiliary function, called $\textit{flex}$, +allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single +phase. + +\begin{center} +\begin{tabular}{lcl} + $\textit{flex}\;r\,f\,[]$ & $\dn$ & $f$\\ + $\textit{flex}\;r\,f\,(c\!::\!s)$ & $\dn$ & + $\textit{flex}\,(r\backslash c)\,(\lambda v.\,f\,(\inj\,r\,c\,v))\,s$\\ +\end{tabular} +\end{center} + +\noindent +The point of this function is that when +reaching the end of the string, we just need to apply the stacked +injection functions to the value generated by $\mkeps$. +Using this function we can recast the success case in @{text lexer} +as follows: + +\begin{proposition}\label{flex} +If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\, + (\mkeps (r\backslash s))$. +\end{proposition} + +\noindent +Note we did not redefine \textit{lexer}, we just established that the +value generated by \textit{lexer} can also be obtained by a different +method. While this different method is not efficient (we essentially +need to traverse the string $s$ twice, once for building the +derivative $r\backslash s$ and another time for stacking up injection +functions using \textit{flex}), it helped us with proving +that incrementally building up values. + This brings us to our main lemma in this section: if we build a derivative, say $r\backslash s$ and have a value, say $v$, inhabited by this derivative, then we can produce the result $\lexer$ generates @@ -737,7 +868,7 @@ The case for $[]$ is routine using Lemmas~\ref{codedecode} and~\ref{retrievecode}. In the case $s\,@\,[c]$, we can infer from the assumption that $\vdash v : (r\backslash s)\backslash c$ - holds. Hence by Lemma~\ref{Posix2} we know that + holds. Hence by Prop.~\ref{Posix2} we know that (*) $\vdash \inj\,(r\backslash s)\,c\,v : r\backslash s$ holds too. By definition of $\textit{flex}$ we can unfold the left-hand side to be @@ -748,7 +879,7 @@ \noindent By induction hypothesis and (*) we can rewrite the right-hand side to - + % \[ \textit{decode}\,(\textit{retrieve}\,(r^\uparrow\backslash s)\; (\inj\,(r\backslash s)\,c\,\,v))\,r @@ -771,7 +902,7 @@ \end{theorem} \begin{proof} - We can first expand both sides using Lemma~\ref{flex} and the + We can first expand both sides using Prop.~\ref{flex} and the definition of \textit{blexer}. This gives us two \textit{if}-statements, which we need to show to be equal. By Lemma~\ref{bnullable}\textit{(2)} we know the \textit{if}-tests coincide: @@ -802,7 +933,7 @@ \noindent This establishes that the bitcoded algorithm by Sulzmann -and Lu without simplification produces correct results. This was +and Lu \emph{without} simplification produces correct results. This was only conjectured in their paper \cite{Sulzmann2014}. The next step is to add simplifications. @@ -816,12 +947,13 @@ Derivatives as calculated by Brzozowski’s method are usually more complex regular expressions than the initial one; the result is that the derivative-based matching and lexing algorithms are - often abysmally slow. - - However, as Sulzmann and - Lu wrote, various optimisations are possible, such as the - simplifications of 0 + r,r + 0,1 · r and r · 1 to r. These - simplifications can speed up the algorithms considerably. + often abysmally slow. However, as Sulzmann and Lu wrote, various + optimisations are possible, such as the simplifications + $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r \Rightarrow r$, + $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow r$. While these + simplifications can speed up the algorithms considerably in many + cases, they do not solve fundamentally the ``growth problem'' with + derivatives. To see this let us return to the example \begin{lemma} @@ -879,7 +1011,7 @@ @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}\\ @{thm[mode=Axiom] ss4}\qquad @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}\\ - @{thm[mode=Rule] ss6[of "r\<^sub>1" "r\<^sub>2" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}\\ + @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}\\ \end{tabular} \end{center} \caption{???}\label{SimpRewrites} diff -r fb23e3fd12e5 -r b7199d6c672d thys2/Paper/document/root.bib --- a/thys2/Paper/document/root.bib Tue Feb 08 01:25:26 2022 +0000 +++ b/thys2/Paper/document/root.bib Tue Feb 08 14:29:41 2022 +0000 @@ -16,6 +16,7 @@ } + @InProceedings{AusafDyckhoffUrban2016, author = {F.~Ausaf and R.~Dyckhoff and C.~Urban}, title = {{POSIX} {L}exing with {D}erivatives of {R}egular {E}xpressions ({P}roof {P}earl)}, @@ -338,3 +339,15 @@ year = {2013} } +@inproceedings{RibeiroAgda2017, +author = {R.~Ribeiro and A.~Du Bois}, +title = {{C}ertified {B}it-{C}oded {R}egular {E}xpression {P}arsing}, +year = {2017}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +booktitle = {Proc.~of the 21st Brazilian Symposium on Programming Languages}, +articleno = {4}, +numpages = {8} +} + + diff -r fb23e3fd12e5 -r b7199d6c672d thys2/Paper/document/root.tex --- a/thys2/Paper/document/root.tex Tue Feb 08 01:25:26 2022 +0000 +++ b/thys2/Paper/document/root.tex Tue Feb 08 14:29:41 2022 +0000 @@ -63,7 +63,7 @@ \author{Chengsong Tan}{King's College London}{chengsong.tan@kcl.ac.uk}{}{} \author{Christian Urban}{King's College London}{christian.urban@kcl.ac.uk}{}{} \authorrunning{C.~Tan and C.~Urban} -\keywords{POSIX matching, Derivatives of Regular Expressions, Isabelle/HOL} +\keywords{POSIX matching and lexing, Derivatives of Regular Expressions, Isabelle/HOL} \category{} \ccsdesc[100]{Design and analysis of algorithms} \ccsdesc[100]{Formal languages and automata theory} @@ -75,15 +75,17 @@ \maketitle \begin{abstract} - Sulzmann and Lu described a lexing algorithm that calculates + Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of \emph{how} a regular expression matches a string---that is, which part of the string is matched by which part - of the regular expression. The purpose of the bitcodes is to generate POSIX values incrementally while + of the regular expression. This information is needed in the + context of lexing in order to extract and to classify tokens. + The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an ``aggressive'' simplification function that keeps the size of - derivatives finite. Without simplification the size derivatives can grow + derivatives finite. Without simplification the size of some derivatives can grow arbitrarily big resulting in an extremely slow lexing algorithm. In this paper we describe a variant of Sulzmann and Lu's algorithm: Our algorithm is a recursive functional program, whereas Sulzmann diff -r fb23e3fd12e5 -r b7199d6c672d thys2/paper.pdf Binary file thys2/paper.pdf has changed