lexing: changeset 423:b7199d6c672d

--- a/thys/Paper/Paper.thy	Tue Feb 08 01:25:26 2022 +0000
+++ b/thys/Paper/Paper.thy	Tue Feb 08 14:29:41 2022 +0000
@@ -431,9 +431,11 @@
   \end{proposition}
 
   In general there is more than one value associated with a regular
-  expression. In case of POSIX matching the problem is to calculate the
-  unique value that satisfies the (informal) POSIX rules from the
-  Introduction. Graphically the POSIX value calculation algorithm by
+  expression (meaining regular expressions can typically match more
+  than one string). But even given a string from the language of the
+  regular expression, there are generally more than one way how the
+  regular expression can match this string. POSIX lexing is about identifying
+  a unique value that satisfies the (informal) POSIX. Graphically the POSIX value calculation algorithm by
   Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
   where the path from the left to the right involving @{term derivatives}/@{const
   nullable} is the first phase of the algorithm (calculating successive

--- a/thys2/Paper/Paper.thy	Tue Feb 08 01:25:26 2022 +0000
+++ b/thys2/Paper/Paper.thy	Tue Feb 08 14:29:41 2022 +0000
@@ -18,10 +18,13 @@
 abbreviation 
   "der_syn r c \<equiv> der c r"
 abbreviation 
+ "ders_syn r s \<equiv> ders s r"  
+abbreviation 
   "bder_syn r c \<equiv> bder c r"  
 
 notation (latex output)
   der_syn ("_\\_" [79, 1000] 76) and
+  ders_syn ("_\\_" [79, 1000] 76) and
   bder_syn ("_\\_" [79, 1000] 76) and
   bders ("_\\_" [79, 1000] 76) and
   bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and
@@ -31,7 +34,7 @@
   CH ("_" [1000] 80) and
   ALT ("_ + _" [77,77] 78) and
   SEQ ("_ \<cdot> _" [77,77] 78) and
-  STAR ("_\<^sup>\<star>" [79] 78) and
+  STAR ("_\<^sup>*" [79] 78) and
 
   val.Void ("Empty" 78) and
   val.Char ("Char _" [1000] 78) and
@@ -92,23 +95,31 @@
 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
 then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
-of a mechanised correctness proof of Brzozowski's matcher in HOL4 by
+of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
 of the work by Krauss and Nipkow \cite{Krauss2011}.  And another one
 in Coq is given by Coquand and Siles \cite{Coquand2012}.
-
-There are two difficulties with derivative-based matchers and also
-lexers: First, Brzozowski's original matcher only generates a yes/no
-answer for whether a regular expression matches a string or not.
-Sulzmann and Lu~\cite{Sulzmann2014} overcome this difficulty by
-cleverly extending Brzozowski's matching algorithm to POSIX
-lexing. This extended version generates additional information on
-\emph{how} a regular expression matches a string. They achieve this by
+Also Ribeiro and Du Bois give one in Agda \cite{RibeiroAgda2017}.
 
 
-
+However, there are two difficulties with derivative-based matchers:
+First, Brzozowski's original matcher only generates a yes/no answer
+for whether a regular expression matches a string or not.  This is too
+little information in the context of lexing where separate tokens must
+be identified and also classified (for example as keywords
+or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
+difficulty by cleverly extending Brzozowski's matching
+algorithm. Their extended version generates additional information on
+\emph{how} a regular expression matches a string following the POSIX
+rules for regular expression matching. They achieve this by adding a
+second ``phase'' to Brzozowski's algorithm involving an injection
+function.  In our own earlier work we provided the formal
+specification of what POSIX matching means and proved in Isabelle/HOL
+the correctness
+of Sulzmann and Lu's extended algorithm accordingly
+\cite{AusafDyckhoffUrban2016}.
 
-The second problem is that Brzozowski's derivatives can 
+The second difficulty is that Brzozowski's derivatives can 
 grow to arbitrarily big sizes. For example if we start with the
 regular expression \mbox{@{text "(a + aa)\<^sup>*"}} and take
 successive derivatives according to the character $a$, we end up with
@@ -121,31 +132,39 @@
 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
-& $\ll$ & \ldots
+& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
 \end{tabular}
 \end{center}
  
 \noindent where after around 35 steps we run out of memory on a
-typical computer (we define the precise details of the derivative
-operation later).  Clearly, the notation involving $\ZERO$s and
-$\ONE$s already suggests simplification rules that can be applied to
-regular regular expressions, for example $\ZERO{}r \Rightarrow \ZERO$,
-$\ONE{}r \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r
-\Rightarrow r$. While such simple-minded reductions have been proved
-in our earlier work to preserve the correctness of Sulzmann and Lu's
-algorithm, they unfortunately do \emph{not} help with limiting the
-gowth of the derivatives shown above: yes, the growth is slowed, but the
-derivatives can still grow beyond any finite bound.
+typical computer (we shall define shortly the precise details of our
+regular expressions and the derivative operation).  Clearly, the
+notation involving $\ZERO$s and $\ONE$s already suggests
+simplification rules that can be applied to regular regular
+expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
+\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
+r$. While such simple-minded simplifications have been proved in our
+earlier work to preserve the correctness of Sulzmann and Lu's
+algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
+\emph{not} help with limiting the growth of the derivatives shown
+above: the growth is slowed, but the derivatives can still grow quickly
+beyond any finite bound.
 
 
 
-Sulzmann and Lu introduce a
-bitcoded version of their lexing algorithm. They make some claims
-about the correctness and speed of this version, but do
-not provide any supporting proof arguments, not even
-``pencil-and-paper'' arguments. They wrote about their bitcoded
-``incremental parsing method'' (that is the algorithm to be studied in this
-section):
+Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
+\cite{Sulzmann2014} where they introduce bitcoded
+regular expressions. In this version, POSIX values are
+represented as bitsequences and such sequences are incrementally generated
+when derivatives are calculated. The compact representation
+of bitsequences and regular expressions allows them to define a more
+``aggressive'' simplification method that keeps the size of the
+derivatives finite no matter what the length of the string is.
+They make some informal claims about the correctness and linear behaviour
+of this version, but do not provide any supporting proof arguments, not
+even ``pencil-and-paper'' arguments. They write about their bitcoded
+\emph{incremental parsing method} (that is the algorithm to be formalised
+in this paper):
 
 \begin{quote}\it
   ``Correctness Claim: We further claim that the incremental parsing
@@ -155,33 +174,68 @@
   have to work out all proof details.''
 \end{quote}  
 
-
+\noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL
+the derivative-based lexing algorithm of Sulzmann and Lu
+\cite{Sulzmann2014} where regular expressions are annotated with
+bitsequences. We define the crucial simplification function as a
+recursive function, instead of a fix-point operation. One objective of
+the simplification function is to remove duplicates of regular
+expressions.  For this Sulzmann and Lu use in their paper the standard
+@{text nub} function from Haskell's list library, but this function
+does not achieve the intended objective with bitcoded regular expressions.  The
+reason is that in the bitcoded setting, each copy generally has a
+different bitcode annotation---so @{text nub} would never ``fire''.
+Inspired by Scala's library for lists, we shall instead use a @{text
+distinctBy} function that finds duplicates under an erasing function
+that deletes bitcodes.
+We shall also introduce our own argument and definitions for
+establishing the correctness of the bitcoded algorithm when 
+simplifications are included.\medskip
 
+\noindent In this paper, we shall first briefly introduce the basic notions
+of regular expressions and describe the basic definitions
+of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves
+as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove
+the correctness for the bitcoded algorithm without simplification, and
+after that extend the proof to include simplification. 
 
+*}
+
+section {* Background *}
 
-If a regular expression matches a string, then in general there is more
-than one way of how the string is matched. There are two commonly used
-disambiguation strategies to generate a unique answer: one is called
-GREEDY matching \cite{Frisch2004} and the other is POSIX
-matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}.
-For example consider the string @{term xy} and the regular expression
-\mbox{@{term "STAR (ALT (ALT x y) xy)"}}. Either the string can be
-matched in two `iterations' by the single letter-regular expressions
-@{term x} and @{term y}, or directly in one iteration by @{term xy}. The
-first case corresponds to GREEDY matching, which first matches with the
-left-most symbol and only matches the next symbol in case of a mismatch
-(this is greedy in the sense of preferring instant gratification to
-delayed repletion). The second case is POSIX matching, which prefers the
-longest match.
+text {*
+  In our Isabelle/HOL formalisation strings are lists of characters with
+  the empty string being represented by the empty list, written $[]$,
+  and list-cons being written as $\_\!::\!\_\,$; string
+  concatenation is $\_ \,@\, \_\,$. We often use the usual
+  bracket notation for lists also for strings; for example a string
+  consisting of just a single character $c$ is written $[c]$.   
+  Our regular expressions are defined as usual as the elements of the following inductive
+  datatype:
 
+  \begin{center}
+  @{text "r :="}
+  @{const "ZERO"} $\mid$
+  @{const "ONE"} $\mid$
+  @{term "CH c"} $\mid$
+  @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
+  @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
+  @{term "STAR r"} 
+  \end{center}
 
-The derivative has the property (which may almost be
-regarded as its specification) that, for every string @{term s} and
-regular expression @{term r} and character @{term c}, one has @{term
-"cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}.
+  \noindent where @{const ZERO} stands for the regular expression that does
+  not match any string, @{const ONE} for the regular expression that matches
+  only the empty string and @{term c} for matching a character literal.
+  The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively.
+  The
+  \emph{language} of a regular expression, written $L$, is defined as usual
+  and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}).
 
-
-
+  Central to Brzozowski's regular expression matcher are two functions
+  called @{text nullable} and \emph{derivative}. The latter is written
+  $r\backslash c$ for the derivative of the regular expression $r$
+  w.r.t.~the character $c$. Both functions are defined by recursion over
+  regular expressions.  
 
 \begin{center}
 \begin{tabular}{cc}
@@ -208,8 +262,182 @@
 \end{tabular}  
 \end{center}
 
+  \noindent
+  We can extend this definition to give derivatives w.r.t.~strings:
+
+  \begin{center}
+  \begin{tabular}{cc}
+  \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+  @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}
+  \end{tabular}
+  &
+  \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+  @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}
+  \end{tabular}
+  \end{tabular}
+  \end{center}
+
+\noindent
+Using @{text nullable} and the derivative operation, we can
+define the following simple regular expression matcher:
+%
+\[
+@{text "match s r"} \;\dn\; @{term nullable}(r\backslash s)
+\]
+
+\noindent This is essentially Brzozowski's algorithm from 1964. Its
+main virtue is that the algorithm can be easily implemented as a
+functional program (either in a functional programming language or in
+a theorem prover). The correctness proof for @{text match} amounts to
+establishing the property
+%
+\begin{proposition}\label{matchcorr} 
+@{text "match s r"} \;\;\text{if and only if}\;\; $s \in L(r)$
+\end{proposition}
+
+\noindent
+It is a fun exercise to formaly prove this property in a theorem prover.
+
+The novel idea of Sulzmann and Lu is to extend this algorithm for 
+lexing, where it is important to find out which part of the string
+is matched by which part of the regular expression.
+For this Sulzmann and Lu presented two lexing algorithms in their paper
+  \cite{Sulzmann2014}. This first algorithm consists of two phases: first a
+  matching phase (which is Brzozowski's algorithm) and then a value
+  construction phase. The values encode \emph{how} a regular expression
+  matches a string. \emph{Values} are defined as the inductive datatype
+
+  \begin{center}
+  @{text "v :="}
+  @{const "Void"} $\mid$
+  @{term "val.Char c"} $\mid$
+  @{term "Left v"} $\mid$
+  @{term "Right v"} $\mid$
+  @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
+  @{term "Stars vs"} 
+  \end{center}  
+
+  \noindent where we use @{term vs} to stand for a list of values. The
+  string underlying a value can be calculated by a @{const flat}
+  function, written @{term "flat DUMMY"}. It traverses a value and
+  collects the characters contained in it. Sulzmann and Lu also define inductively an
+  inhabitation relation that associates values to regular expressions:
+
+  \begin{center}
+  \begin{tabular}{c}
+  \\[-8mm]
+  @{thm[mode=Axiom] Prf.intros(4)} \qquad
+  @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm]
+  @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
+  @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm]
+  @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]} \qquad
+  @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}
+  \end{tabular}
+  \end{center}
+
+  \noindent Note that no values are associated with the regular expression
+  @{term ZERO}, since it cannot match any string.
+  It is routine to establish how values ``inhabiting'' a regular
+  expression correspond to the language of a regular expression, namely
+
+  \begin{proposition}
+  @{thm L_flat_Prf}
+  \end{proposition}
+
+  In general there is more than one value inhabited by a regular
+  expression (meaning regular expressions can typically match more
+  than one string). But even when fixing a string from the language of the
+  regular expression, there are generally more than one way of how the
+  regular expression can match this string. POSIX lexing is about
+  identifying the unique value for a given regular expression and a
+  string that satisfies the informal POSIX rules (see
+  \cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}).\footnote{POSIX
+	lexing acquired its name from the fact that the corresponding
+	rules were described as part of the POSIX specification for
+	Unix-like operating systems \cite{POSIX}.} Sometimes these
+  informal rules are called \emph{maximal much rule} and \emph{rule priority}.
+  One contribution of our earlier paper is to give a convenient
+ specification for what a POSIX value is (the inductive rules are shown in
+  Figure~\ref{POSIXrules}).
 
 \begin{figure}[t]
+  \begin{center}
+  \begin{tabular}{c}
+  @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \qquad
+  @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\medskip\\
+  @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\qquad
+  @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
+  $\mprset{flushleft}
+   \inferrule
+   {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
+    @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\
+    @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
+   {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\smallskip\\
+  @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad
+  $\mprset{flushleft}
+   \inferrule
+   {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
+    @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
+    @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
+    @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
+   {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\\[-4mm]
+  \end{tabular}
+  \end{center}
+  \caption{The inductive definition of POSIX values taken from our earlier paper \cite{AusafDyckhoffUrban2016}. The ternary relation, written $(s, r) \rightarrow v$, formalises the notion
+  of given a string $s$ and a regular
+  expression $r$ what is the unique value $v$ that satisfies the informal POSIX constraints for
+  regular expression matching.}\label{POSIXrules}
+  \end{figure}
+
+  The clever idea by Sulzmann and Lu \cite{Sulzmann2014} in their first algorithm is to define
+  an injection function on values that mirrors (but inverts) the
+  construction of the derivative on regular expressions. Essentially it
+  injects back a character into a value.
+  For this they define two functions called @{text mkeps} and @{text inj}:
+ 
+  \begin{center}
+  \begin{tabular}{l}
+  \begin{tabular}{lcl}
+  @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
+  @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
+  @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
+  @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
+  \end{tabular}\smallskip\\
+
+  \begin{tabular}{lcl}
+  @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
+  @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
+      @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
+  @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
+      @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
+  @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
+      & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
+  @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
+      & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
+  @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ 
+      & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
+  @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ 
+      & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}
+  \end{tabular}
+  \end{tabular}
+  \end{center}
+
+  \noindent
+  The function @{text mkeps} is called when the last derivative is nullable, that is
+  the string to be matched is in the language of the regular expression. It generates
+  a value for how the last derivative can match the empty string. The injection function
+  then calculates the corresponding value for each intermediate derivative until
+  a value for the original regular expression is generated.
+  Graphically the algorithm by
+  Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
+  where the path from the left to the right involving @{term derivatives}/@{const
+  nullable} is the first phase of the algorithm (calculating successive
+  \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
+  left, the second phase. This picture shows the steps required when a
+  regular expression, say @{text "r\<^sub>1"}, matches the string @{term
+  "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as:
+
+  \begin{figure}[t]
 \begin{center}
 \begin{tikzpicture}[scale=2,node distance=1.3cm,
                     every node/.style={minimum size=6mm}]
@@ -234,7 +462,7 @@
 \end{center}
 \mbox{}\\[-13mm]
 
-\caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014},
+\caption{The two phases of the first algorithm by Sulzmann \& Lu \cite{Sulzmann2014},
 matching the string @{term "[a,b,c]"}. The first phase (the arrows from 
 left to right) is \Brz's matcher building successive derivatives. If the 
 last regular expression is @{term nullable}, then the functions of the 
@@ -242,183 +470,50 @@
 @{term mkeps} calculates a value @{term "v\<^sub>4"} witnessing
 how the empty string has been recognised by @{term "r\<^sub>4"}. After
 that the function @{term inj} ``injects back'' the characters of the string into
-the values.
+the values. The value @{term "v\<^sub>1"} is the result of the algorithm representing
+the POSIX value for this string and
+regular expression.
 \label{Sulz}}
 \end{figure} 
 
 
-*}
-
-section {* Background *}
-
-text {*
-  In our Isabelle/HOL formalisation strings are lists of characters with
-  the empty string being represented by the empty list, written $[]$,
-  and list-cons being written as $\_\!\_\!::\!\_\!\_\,$; string
-  concatenation is $\_\!\_ \,@\, \_\!\_\,$. Often we use the usual
-  bracket notation for lists also for strings; for example a string
-  consisting of just a single character $c$ is written $[c]$.   
-  Our egular expressions are defined as usual as the elements of the following inductive
-  datatype:
 
   \begin{center}
-  @{text "r :="}
-  @{const "ZERO"} $\mid$
-  @{const "ONE"} $\mid$
-  @{term "CH c"} $\mid$
-  @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
-  @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
-  @{term "STAR r"} 
-  \end{center}
-
-  \noindent where @{const ZERO} stands for the regular expression that does
-  not match any string, @{const ONE} for the regular expression that matches
-  only the empty string and @{term c} for matching a character literal. The
-  language of a regular expression, written $L$, is defined as usual
-  (see for example \cite{AusafDyckhoffUrban2016}).
-
-  Central to Brzozowski's regular expression matcher are two functions
-  called $\nullable$ and \emph{derivative}. The latter is written
-  $r\backslash c$ for the derivative of the regular expression $r$
-  w.r.t.~the character $c$. Both functions are defined by recursion over
-  regular expressions.  
-
-
-	\begin{center}
-		\begin{tabular}{lcl}
-			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
-			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
-			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
-			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
-			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
-			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
-		\end{tabular}
-	\end{center}
-
-  \noindent
-  The derivative function takes a regular expression, say $r$ and a
-  character, say $c$, as input and returns the derivative regular
-  expression.
-
-	\begin{center}
-		\begin{tabular}{lcl}
-			$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
-			$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
-			$d \backslash c$     & $\dn$ & 
-			$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
-			$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-			$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\
-			&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
-			&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
-			$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
-		\end{tabular}
-	\end{center}
-
-
-  Sulzmann and Lu presented two lexing algorithms in their paper from 2014
-  \cite{Sulzmann2014}. This first algorithm consists of two phases: first a
-  matching phase (which is Brzozowski's algorithm) and then a value
-  construction phase. The values encode \emph{how} a regular expression
-  matches a string. \emph{Values} are defined as the inductive datatype
-
-  \begin{center}
-  @{text "v :="}
-  @{const "Void"} $\mid$
-  @{term "val.Char c"} $\mid$
-  @{term "Left v"} $\mid$
-  @{term "Right v"} $\mid$
-  @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
-  @{term "Stars vs"} 
-  \end{center}  
-
-  \noindent where we use @{term vs} to stand for a list of
-  values.
-
-
-
-
-  \noindent Sulzmann and Lu also define inductively an inhabitation relation
-  that associates values to regular expressions:
-
-  \begin{center}
-  \begin{tabular}{c}
-  \\[-8mm]
-  @{thm[mode=Axiom] Prf.intros(4)} \qquad
-  @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm]
-  @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
-  @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm]
-  @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[4mm]
-  %%@ { t h m[mode=Axiom] Prf.intros(6)[of "r"]} \qquad  
-  @{thm[mode=Rule] Prf.intros(6)[of "r" "vs"]}
+  \begin{tabular}{lcl}
+  @{thm (lhs) lexer.simps(1)} & $\dn$ & @{thm (rhs) lexer.simps(1)}\\
+  @{thm (lhs) lexer.simps(2)} & $\dn$ & @{text "case"} @{term "lexer (der c r) s"} @{text of}\\
+                     & & \phantom{$|$} @{term "None"}  @{text "\<Rightarrow>"} @{term None}\\
+                     & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}                          
   \end{tabular}
   \end{center}
 
-  \noindent Note that no values are associated with the regular expression
-  @{term ZERO}. It is routine to establish how values ``inhabiting'' a regular
-  expression correspond to the language of a regular expression, namely
+
+We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that the algorithm by Sulzmann and Lu
+is correct. The cenral property we established relates the derivative operation to the injection function.
 
-  \begin{proposition}
-  @{thm L_flat_Prf}
-  \end{proposition}
+  \begin{proposition}\label{Posix2}
+	\textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$. 
+\end{proposition}
 
-  Sulzmann-Lu algorithm with inj. State that POSIX rules.
-  metion slg is correct.
+  \noindent
+  With this in place we were able to prove:
 
 
-  \begin{figure}[t]
-  \begin{center}
-  \begin{tabular}{c}
-  @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \qquad
-  @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\medskip\\
-  @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\qquad
-  @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
-  $\mprset{flushleft}
-   \inferrule
-   {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
-    @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\
-    @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
-   {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\\
-  @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\medskip\\
-  $\mprset{flushleft}
-   \inferrule
-   {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
-    @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
-    @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
-    @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
-   {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>
+  \begin{proposition}\mbox{}\smallskip\\\label{lexercorrect}
+  \begin{tabular}{ll}
+  (1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\
+  (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
   \end{tabular}
-  \end{center}
-  \caption{Our inductive definition of POSIX values.}\label{POSIXrules}
-  \end{figure}
-
+  \end{proposition}
 
-  \begin{center}
-  \begin{tabular}{lcl}
-  @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
-  @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
-  @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
-  @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
-  \end{tabular}
-  \end{center}
-
-  \begin{center}
-  \begin{tabular}{l@ {\hspace{5mm}}lcl}
-  \textit{(1)} & @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
-  \textit{(2)} & @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
-      @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
-  \textit{(3)} & @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
-      @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
-  \textit{(4)} & @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
-      & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
-  \textit{(5)} & @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
-      & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
-  \textit{(6)} & @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ 
-      & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
-  \textit{(7)} & @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ 
-      & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
-  \end{tabular}
-  \end{center}
-
+  \noindent
+  In fact we have shown that in the success case the generated POSIX value $v$ is
+  unique and in the failure case that there is no POSIX value $v$ that satisfies
+  $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly
+  slow in examples where the derivatives grow arbitrarily (see example from the
+  Introduction). However it can be used as a conveninet reference point for the correctness
+  proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
+  
 *}
 
 section {* Bitcoded Regular Expressions and Derivatives *}
@@ -426,7 +521,7 @@
 text {*
 
   In the second part of their paper \cite{Sulzmann2014},
-  Sulzmann and Lu describe another algorithm that generates POSIX
+  Sulzmann and Lu describe another algorithm that also generates POSIX
   values but dispences with the second phase where characters are
   injected ``back'' into values. For this they annotate bitcodes to
   regular expressions, which we define in Isabelle/HOL as the datatype
@@ -579,10 +674,10 @@
   the functions \textit{bnullable} and \textit{bmkeps}, which are the
   ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
   bitcoded regular expressions, instead of regular expressions.
-
+  %
   \begin{center}
   \begin{tabular}{@ {}c@ {}c@ {}}
-  \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+  \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
   $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{false}$\\
   $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{true}$\\
   $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{false}$\\
@@ -594,7 +689,7 @@
      $\textit{true}$
   \end{tabular}
   &
-  \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+  \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
   $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
   $\textit{bmkeps}\,(\textit{ALTs}\,bs\,r\!::\!\rs)$ & $\dn$ &
      $\textit{if}\;\textit{bnullable}\,r$\\
@@ -711,6 +806,42 @@
 \end{proof}
 
 \noindent
+The only problem left for the correctness proof is that the bitcoded algorithm
+has only a ``forward phase'' where POSIX values are generated incrementally.
+We can achive the same effect with @{text lexer} by stacking up injection
+functions in the during forward phase. An auxiliary function, called $\textit{flex}$,
+allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single
+phase.
+
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{flex}\;r\,f\,[]$ & $\dn$ & $f$\\
+  $\textit{flex}\;r\,f\,(c\!::\!s)$ & $\dn$ &
+  $\textit{flex}\,(r\backslash c)\,(\lambda v.\,f\,(\inj\,r\,c\,v))\,s$\\
+\end{tabular}    
+\end{center}    
+
+\noindent
+The point of this function is that when
+reaching the end of the string, we just need to apply the stacked
+injection functions to the value generated by $\mkeps$.
+Using this function we can recast the success case in @{text lexer} 
+as follows:
+
+\begin{proposition}\label{flex}
+If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\,
+      (\mkeps (r\backslash s))$.
+\end{proposition}
+
+\noindent
+Note we did not redefine \textit{lexer}, we just established that the
+value generated by \textit{lexer} can also be obtained by a different
+method. While this different method is not efficient (we essentially
+need to traverse the string $s$ twice, once for building the
+derivative $r\backslash s$ and another time for stacking up injection
+functions using \textit{flex}), it helped us with proving
+that incrementally building up values.
+
 This brings us to our main lemma in this section: if we build a
 derivative, say $r\backslash s$ and have a value, say $v$, inhabited
 by this derivative, then we can produce the result $\lexer$ generates
@@ -737,7 +868,7 @@
   The case for $[]$ is routine using Lemmas~\ref{codedecode}
   and~\ref{retrievecode}. In the case $s\,@\,[c]$, we can infer from
   the assumption that $\vdash v : (r\backslash s)\backslash c$
-  holds. Hence by Lemma~\ref{Posix2} we know that 
+  holds. Hence by Prop.~\ref{Posix2} we know that 
   (*) $\vdash \inj\,(r\backslash s)\,c\,v : r\backslash s$ holds too.
   By definition of $\textit{flex}$ we can unfold the left-hand side
   to be
@@ -748,7 +879,7 @@
 
   \noindent
   By induction hypothesis and (*) we can rewrite the right-hand side to
-
+  %
   \[
     \textit{decode}\,(\textit{retrieve}\,(r^\uparrow\backslash s)\;
     (\inj\,(r\backslash s)\,c\,\,v))\,r
@@ -771,7 +902,7 @@
 \end{theorem}  
 
 \begin{proof}
-  We can first expand both sides using Lemma~\ref{flex} and the
+  We can first expand both sides using Prop.~\ref{flex} and the
   definition of \textit{blexer}. This gives us two
   \textit{if}-statements, which we need to show to be equal. By 
   Lemma~\ref{bnullable}\textit{(2)} we know the \textit{if}-tests coincide:
@@ -802,7 +933,7 @@
 
 \noindent
 This establishes that the bitcoded algorithm by Sulzmann
-and Lu without simplification produces correct results. This was
+and Lu \emph{without} simplification produces correct results. This was
 only conjectured in their paper \cite{Sulzmann2014}. The next step
 is to add simplifications.
 
@@ -816,12 +947,13 @@
      Derivatives as calculated by Brzozowski’s method are usually more
      complex regular expressions than the initial one; the result is
      that the derivative-based matching and lexing algorithms are
-     often abysmally slow.
-
-     However, as Sulzmann and
-     Lu wrote, various optimisations are possible, such as the
-     simplifications of 0 + r,r + 0,1 · r and r · 1 to r. These
-     simplifications can speed up the algorithms considerably.
+     often abysmally slow. However, as Sulzmann and Lu wrote, various
+     optimisations are possible, such as the simplifications
+     $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r \Rightarrow r$,
+     $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow r$. While these
+     simplifications can speed up the algorithms considerably in many
+     cases, they do not solve fundamentally the ``growth problem'' with
+     derivatives. To see this let us return to the example 
 
 
      \begin{lemma}
@@ -879,7 +1011,7 @@
   @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}\\
   @{thm[mode=Axiom] ss4}\qquad
   @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}\\
-  @{thm[mode=Rule] ss6[of "r\<^sub>1" "r\<^sub>2" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}\\
+  @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}\\
   \end{tabular}
   \end{center}
   \caption{???}\label{SimpRewrites}

--- a/thys2/Paper/document/root.bib	Tue Feb 08 01:25:26 2022 +0000
+++ b/thys2/Paper/document/root.bib	Tue Feb 08 14:29:41 2022 +0000
@@ -16,6 +16,7 @@
 }
 
 
+
 @InProceedings{AusafDyckhoffUrban2016,
   author =   {F.~Ausaf and R.~Dyckhoff and C.~Urban},
   title = 	 {{POSIX} {L}exing with {D}erivatives of {R}egular {E}xpressions ({P}roof {P}earl)},
@@ -338,3 +339,15 @@
   year =         {2013}
 }
 
+@inproceedings{RibeiroAgda2017,
+author = {R.~Ribeiro and A.~Du Bois},
+title = {{C}ertified {B}it-{C}oded {R}egular {E}xpression {P}arsing},
+year = {2017},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+booktitle = {Proc.~of the 21st Brazilian Symposium on Programming Languages},
+articleno = {4},
+numpages = {8}
+}
+
+

--- a/thys2/Paper/document/root.tex	Tue Feb 08 01:25:26 2022 +0000
+++ b/thys2/Paper/document/root.tex	Tue Feb 08 14:29:41 2022 +0000
@@ -63,7 +63,7 @@
 \author{Chengsong Tan}{King's College London}{chengsong.tan@kcl.ac.uk}{}{}
 \author{Christian Urban}{King's College London}{christian.urban@kcl.ac.uk}{}{}
 \authorrunning{C.~Tan and C.~Urban}
-\keywords{POSIX matching, Derivatives of Regular Expressions, Isabelle/HOL}
+\keywords{POSIX matching and lexing, Derivatives of Regular Expressions, Isabelle/HOL}
 \category{}
 \ccsdesc[100]{Design and analysis of algorithms}
 \ccsdesc[100]{Formal languages and automata theory}
@@ -75,15 +75,17 @@
 \maketitle
 
 \begin{abstract}
-  Sulzmann and Lu described a lexing algorithm that calculates
+  Sulzmann and Lu describe a lexing algorithm that calculates
   Brzozowski derivatives using bitcodes annotated to regular
   expressions.  Their algorithm generates POSIX values which encode
   the information of \emph{how} a regular expression matches a
   string---that is, which part of the string is matched by which part
-  of the regular expression.  The purpose of the bitcodes is to generate POSIX values incrementally while
+  of the regular expression. This information is needed in the
+  context of lexing in order to extract and to classify tokens.
+  The purpose of the bitcodes is to generate POSIX values incrementally while
   derivatives are calculated. They also help with designing
   an ``aggressive'' simplification function that keeps the size of
-  derivatives finite. Without simplification the size derivatives can grow
+  derivatives finite. Without simplification the size of some derivatives can grow
   arbitrarily big resulting in an extremely slow lexing algorithm.  In this
   paper we describe a variant of Sulzmann and Lu's algorithm: Our
   algorithm is a recursive functional program, whereas Sulzmann

Binary file thys2/paper.pdf has changed

author	Christian Urban <christian.urban@kcl.ac.uk>
	Tue, 08 Feb 2022 14:29:41 +0000
changeset 423	b7199d6c672d
parent 422	fb23e3fd12e5
child 424	2416fdec6396

thys/Paper/Paper.thy		file \| annotate \| diff \| comparison \| revisions
thys2/Paper/Paper.thy		file \| annotate \| diff \| comparison \| revisions
thys2/Paper/document/root.bib		file \| annotate \| diff \| comparison \| revisions
thys2/Paper/document/root.tex		file \| annotate \| diff \| comparison \| revisions
thys2/paper.pdf		file \| annotate \| diff \| comparison \| revisions