thys2/Paper/Paper.thy
changeset 423 b7199d6c672d
parent 420 b66a4305749c
child 424 2416fdec6396
equal deleted inserted replaced
422:fb23e3fd12e5 423:b7199d6c672d
    16 
    16 
    17 
    17 
    18 abbreviation 
    18 abbreviation 
    19   "der_syn r c \<equiv> der c r"
    19   "der_syn r c \<equiv> der c r"
    20 abbreviation 
    20 abbreviation 
       
    21  "ders_syn r s \<equiv> ders s r"  
       
    22 abbreviation 
    21   "bder_syn r c \<equiv> bder c r"  
    23   "bder_syn r c \<equiv> bder c r"  
    22 
    24 
    23 notation (latex output)
    25 notation (latex output)
    24   der_syn ("_\\_" [79, 1000] 76) and
    26   der_syn ("_\\_" [79, 1000] 76) and
       
    27   ders_syn ("_\\_" [79, 1000] 76) and
    25   bder_syn ("_\\_" [79, 1000] 76) and
    28   bder_syn ("_\\_" [79, 1000] 76) and
    26   bders ("_\\_" [79, 1000] 76) and
    29   bders ("_\\_" [79, 1000] 76) and
    27   bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and
    30   bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and
    28 
    31 
    29   ZERO ("\<^bold>0" 81) and 
    32   ZERO ("\<^bold>0" 81) and 
    30   ONE ("\<^bold>1" 81) and 
    33   ONE ("\<^bold>1" 81) and 
    31   CH ("_" [1000] 80) and
    34   CH ("_" [1000] 80) and
    32   ALT ("_ + _" [77,77] 78) and
    35   ALT ("_ + _" [77,77] 78) and
    33   SEQ ("_ \<cdot> _" [77,77] 78) and
    36   SEQ ("_ \<cdot> _" [77,77] 78) and
    34   STAR ("_\<^sup>\<star>" [79] 78) and
    37   STAR ("_\<^sup>*" [79] 78) and
    35 
    38 
    36   val.Void ("Empty" 78) and
    39   val.Void ("Empty" 78) and
    37   val.Char ("Char _" [1000] 78) and
    40   val.Char ("Char _" [1000] 78) and
    38   val.Left ("Left _" [79] 78) and
    41   val.Left ("Left _" [79] 78) and
    39   val.Right ("Right _" [1000] 78) and
    42   val.Right ("Right _" [1000] 78) and
    90 regular expression, written @{term "der c r"}, give a simple solution
    93 regular expression, written @{term "der c r"}, give a simple solution
    91 to the problem of matching a string @{term s} with a regular
    94 to the problem of matching a string @{term s} with a regular
    92 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
    95 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
    93 succession) all the characters of the string matches the empty string,
    96 succession) all the characters of the string matches the empty string,
    94 then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
    97 then @{term r} matches @{term s} (and {\em vice versa}).  We are aware
    95 of a mechanised correctness proof of Brzozowski's matcher in HOL4 by
    98 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
    96 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
    99 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part
    97 of the work by Krauss and Nipkow \cite{Krauss2011}.  And another one
   100 of the work by Krauss and Nipkow \cite{Krauss2011}.  And another one
    98 in Coq is given by Coquand and Siles \cite{Coquand2012}.
   101 in Coq is given by Coquand and Siles \cite{Coquand2012}.
    99 
   102 Also Ribeiro and Du Bois give one in Agda \cite{RibeiroAgda2017}.
   100 There are two difficulties with derivative-based matchers and also
   103 
   101 lexers: First, Brzozowski's original matcher only generates a yes/no
   104 
   102 answer for whether a regular expression matches a string or not.
   105 However, there are two difficulties with derivative-based matchers:
   103 Sulzmann and Lu~\cite{Sulzmann2014} overcome this difficulty by
   106 First, Brzozowski's original matcher only generates a yes/no answer
   104 cleverly extending Brzozowski's matching algorithm to POSIX
   107 for whether a regular expression matches a string or not.  This is too
   105 lexing. This extended version generates additional information on
   108 little information in the context of lexing where separate tokens must
   106 \emph{how} a regular expression matches a string. They achieve this by
   109 be identified and also classified (for example as keywords
   107 
   110 or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
   108 
   111 difficulty by cleverly extending Brzozowski's matching
   109 
   112 algorithm. Their extended version generates additional information on
   110 
   113 \emph{how} a regular expression matches a string following the POSIX
   111 The second problem is that Brzozowski's derivatives can 
   114 rules for regular expression matching. They achieve this by adding a
       
   115 second ``phase'' to Brzozowski's algorithm involving an injection
       
   116 function.  In our own earlier work we provided the formal
       
   117 specification of what POSIX matching means and proved in Isabelle/HOL
       
   118 the correctness
       
   119 of Sulzmann and Lu's extended algorithm accordingly
       
   120 \cite{AusafDyckhoffUrban2016}.
       
   121 
       
   122 The second difficulty is that Brzozowski's derivatives can 
   112 grow to arbitrarily big sizes. For example if we start with the
   123 grow to arbitrarily big sizes. For example if we start with the
   113 regular expression \mbox{@{text "(a + aa)\<^sup>*"}} and take
   124 regular expression \mbox{@{text "(a + aa)\<^sup>*"}} and take
   114 successive derivatives according to the character $a$, we end up with
   125 successive derivatives according to the character $a$, we end up with
   115 a sequence of ever-growing derivatives like 
   126 a sequence of ever-growing derivatives like 
   116 
   127 
   119 \begin{tabular}{rll}
   130 \begin{tabular}{rll}
   120 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
   131 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
   121 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
   132 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
   122 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
   133 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
   123 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
   134 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
   124 & $\ll$ & \ldots
   135 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
   125 \end{tabular}
   136 \end{tabular}
   126 \end{center}
   137 \end{center}
   127  
   138  
   128 \noindent where after around 35 steps we run out of memory on a
   139 \noindent where after around 35 steps we run out of memory on a
   129 typical computer (we define the precise details of the derivative
   140 typical computer (we shall define shortly the precise details of our
   130 operation later).  Clearly, the notation involving $\ZERO$s and
   141 regular expressions and the derivative operation).  Clearly, the
   131 $\ONE$s already suggests simplification rules that can be applied to
   142 notation involving $\ZERO$s and $\ONE$s already suggests
   132 regular regular expressions, for example $\ZERO{}r \Rightarrow \ZERO$,
   143 simplification rules that can be applied to regular regular
   133 $\ONE{}r \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r
   144 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
   134 \Rightarrow r$. While such simple-minded reductions have been proved
   145 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
   135 in our earlier work to preserve the correctness of Sulzmann and Lu's
   146 r$. While such simple-minded simplifications have been proved in our
   136 algorithm, they unfortunately do \emph{not} help with limiting the
   147 earlier work to preserve the correctness of Sulzmann and Lu's
   137 gowth of the derivatives shown above: yes, the growth is slowed, but the
   148 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
   138 derivatives can still grow beyond any finite bound.
   149 \emph{not} help with limiting the growth of the derivatives shown
   139 
   150 above: the growth is slowed, but the derivatives can still grow quickly
   140 
   151 beyond any finite bound.
   141 
   152 
   142 Sulzmann and Lu introduce a
   153 
   143 bitcoded version of their lexing algorithm. They make some claims
   154 
   144 about the correctness and speed of this version, but do
   155 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
   145 not provide any supporting proof arguments, not even
   156 \cite{Sulzmann2014} where they introduce bitcoded
   146 ``pencil-and-paper'' arguments. They wrote about their bitcoded
   157 regular expressions. In this version, POSIX values are
   147 ``incremental parsing method'' (that is the algorithm to be studied in this
   158 represented as bitsequences and such sequences are incrementally generated
   148 section):
   159 when derivatives are calculated. The compact representation
       
   160 of bitsequences and regular expressions allows them to define a more
       
   161 ``aggressive'' simplification method that keeps the size of the
       
   162 derivatives finite no matter what the length of the string is.
       
   163 They make some informal claims about the correctness and linear behaviour
       
   164 of this version, but do not provide any supporting proof arguments, not
       
   165 even ``pencil-and-paper'' arguments. They write about their bitcoded
       
   166 \emph{incremental parsing method} (that is the algorithm to be formalised
       
   167 in this paper):
   149 
   168 
   150 \begin{quote}\it
   169 \begin{quote}\it
   151   ``Correctness Claim: We further claim that the incremental parsing
   170   ``Correctness Claim: We further claim that the incremental parsing
   152   method [..] in combination with the simplification steps [..]
   171   method [..] in combination with the simplification steps [..]
   153   yields POSIX parse trees. We have tested this claim
   172   yields POSIX parse trees. We have tested this claim
   154   extensively [..] but yet
   173   extensively [..] but yet
   155   have to work out all proof details.''
   174   have to work out all proof details.''
   156 \end{quote}  
   175 \end{quote}  
   157 
   176 
   158 
   177 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL
   159 
   178 the derivative-based lexing algorithm of Sulzmann and Lu
   160 
   179 \cite{Sulzmann2014} where regular expressions are annotated with
   161 
   180 bitsequences. We define the crucial simplification function as a
   162 If a regular expression matches a string, then in general there is more
   181 recursive function, instead of a fix-point operation. One objective of
   163 than one way of how the string is matched. There are two commonly used
   182 the simplification function is to remove duplicates of regular
   164 disambiguation strategies to generate a unique answer: one is called
   183 expressions.  For this Sulzmann and Lu use in their paper the standard
   165 GREEDY matching \cite{Frisch2004} and the other is POSIX
   184 @{text nub} function from Haskell's list library, but this function
   166 matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}.
   185 does not achieve the intended objective with bitcoded regular expressions.  The
   167 For example consider the string @{term xy} and the regular expression
   186 reason is that in the bitcoded setting, each copy generally has a
   168 \mbox{@{term "STAR (ALT (ALT x y) xy)"}}. Either the string can be
   187 different bitcode annotation---so @{text nub} would never ``fire''.
   169 matched in two `iterations' by the single letter-regular expressions
   188 Inspired by Scala's library for lists, we shall instead use a @{text
   170 @{term x} and @{term y}, or directly in one iteration by @{term xy}. The
   189 distinctBy} function that finds duplicates under an erasing function
   171 first case corresponds to GREEDY matching, which first matches with the
   190 that deletes bitcodes.
   172 left-most symbol and only matches the next symbol in case of a mismatch
   191 We shall also introduce our own argument and definitions for
   173 (this is greedy in the sense of preferring instant gratification to
   192 establishing the correctness of the bitcoded algorithm when 
   174 delayed repletion). The second case is POSIX matching, which prefers the
   193 simplifications are included.\medskip
   175 longest match.
   194 
   176 
   195 \noindent In this paper, we shall first briefly introduce the basic notions
   177 
   196 of regular expressions and describe the basic definitions
   178 The derivative has the property (which may almost be
   197 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves
   179 regarded as its specification) that, for every string @{term s} and
   198 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove
   180 regular expression @{term r} and character @{term c}, one has @{term
   199 the correctness for the bitcoded algorithm without simplification, and
   181 "cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}.
   200 after that extend the proof to include simplification. 
   182 
   201 
   183 
   202 *}
   184 
   203 
       
   204 section {* Background *}
       
   205 
       
   206 text {*
       
   207   In our Isabelle/HOL formalisation strings are lists of characters with
       
   208   the empty string being represented by the empty list, written $[]$,
       
   209   and list-cons being written as $\_\!::\!\_\,$; string
       
   210   concatenation is $\_ \,@\, \_\,$. We often use the usual
       
   211   bracket notation for lists also for strings; for example a string
       
   212   consisting of just a single character $c$ is written $[c]$.   
       
   213   Our regular expressions are defined as usual as the elements of the following inductive
       
   214   datatype:
       
   215 
       
   216   \begin{center}
       
   217   @{text "r :="}
       
   218   @{const "ZERO"} $\mid$
       
   219   @{const "ONE"} $\mid$
       
   220   @{term "CH c"} $\mid$
       
   221   @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
       
   222   @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
       
   223   @{term "STAR r"} 
       
   224   \end{center}
       
   225 
       
   226   \noindent where @{const ZERO} stands for the regular expression that does
       
   227   not match any string, @{const ONE} for the regular expression that matches
       
   228   only the empty string and @{term c} for matching a character literal.
       
   229   The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively.
       
   230   The
       
   231   \emph{language} of a regular expression, written $L$, is defined as usual
       
   232   and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}).
       
   233 
       
   234   Central to Brzozowski's regular expression matcher are two functions
       
   235   called @{text nullable} and \emph{derivative}. The latter is written
       
   236   $r\backslash c$ for the derivative of the regular expression $r$
       
   237   w.r.t.~the character $c$. Both functions are defined by recursion over
       
   238   regular expressions.  
   185 
   239 
   186 \begin{center}
   240 \begin{center}
   187 \begin{tabular}{cc}
   241 \begin{tabular}{cc}
   188   \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
   242   \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
   189   @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
   243   @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
   206   @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
   260   @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\
   207   \end{tabular}  
   261   \end{tabular}  
   208 \end{tabular}  
   262 \end{tabular}  
   209 \end{center}
   263 \end{center}
   210 
   264 
       
   265   \noindent
       
   266   We can extend this definition to give derivatives w.r.t.~strings:
       
   267 
       
   268   \begin{center}
       
   269   \begin{tabular}{cc}
       
   270   \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
       
   271   @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)}
       
   272   \end{tabular}
       
   273   &
       
   274   \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
       
   275   @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)}
       
   276   \end{tabular}
       
   277   \end{tabular}
       
   278   \end{center}
       
   279 
       
   280 \noindent
       
   281 Using @{text nullable} and the derivative operation, we can
       
   282 define the following simple regular expression matcher:
       
   283 %
       
   284 \[
       
   285 @{text "match s r"} \;\dn\; @{term nullable}(r\backslash s)
       
   286 \]
       
   287 
       
   288 \noindent This is essentially Brzozowski's algorithm from 1964. Its
       
   289 main virtue is that the algorithm can be easily implemented as a
       
   290 functional program (either in a functional programming language or in
       
   291 a theorem prover). The correctness proof for @{text match} amounts to
       
   292 establishing the property
       
   293 %
       
   294 \begin{proposition}\label{matchcorr} 
       
   295 @{text "match s r"} \;\;\text{if and only if}\;\; $s \in L(r)$
       
   296 \end{proposition}
       
   297 
       
   298 \noindent
       
   299 It is a fun exercise to formaly prove this property in a theorem prover.
       
   300 
       
   301 The novel idea of Sulzmann and Lu is to extend this algorithm for 
       
   302 lexing, where it is important to find out which part of the string
       
   303 is matched by which part of the regular expression.
       
   304 For this Sulzmann and Lu presented two lexing algorithms in their paper
       
   305   \cite{Sulzmann2014}. This first algorithm consists of two phases: first a
       
   306   matching phase (which is Brzozowski's algorithm) and then a value
       
   307   construction phase. The values encode \emph{how} a regular expression
       
   308   matches a string. \emph{Values} are defined as the inductive datatype
       
   309 
       
   310   \begin{center}
       
   311   @{text "v :="}
       
   312   @{const "Void"} $\mid$
       
   313   @{term "val.Char c"} $\mid$
       
   314   @{term "Left v"} $\mid$
       
   315   @{term "Right v"} $\mid$
       
   316   @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
       
   317   @{term "Stars vs"} 
       
   318   \end{center}  
       
   319 
       
   320   \noindent where we use @{term vs} to stand for a list of values. The
       
   321   string underlying a value can be calculated by a @{const flat}
       
   322   function, written @{term "flat DUMMY"}. It traverses a value and
       
   323   collects the characters contained in it. Sulzmann and Lu also define inductively an
       
   324   inhabitation relation that associates values to regular expressions:
       
   325 
       
   326   \begin{center}
       
   327   \begin{tabular}{c}
       
   328   \\[-8mm]
       
   329   @{thm[mode=Axiom] Prf.intros(4)} \qquad
       
   330   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm]
       
   331   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
       
   332   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm]
       
   333   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]} \qquad
       
   334   @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}
       
   335   \end{tabular}
       
   336   \end{center}
       
   337 
       
   338   \noindent Note that no values are associated with the regular expression
       
   339   @{term ZERO}, since it cannot match any string.
       
   340   It is routine to establish how values ``inhabiting'' a regular
       
   341   expression correspond to the language of a regular expression, namely
       
   342 
       
   343   \begin{proposition}
       
   344   @{thm L_flat_Prf}
       
   345   \end{proposition}
       
   346 
       
   347   In general there is more than one value inhabited by a regular
       
   348   expression (meaning regular expressions can typically match more
       
   349   than one string). But even when fixing a string from the language of the
       
   350   regular expression, there are generally more than one way of how the
       
   351   regular expression can match this string. POSIX lexing is about
       
   352   identifying the unique value for a given regular expression and a
       
   353   string that satisfies the informal POSIX rules (see
       
   354   \cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}).\footnote{POSIX
       
   355 	lexing acquired its name from the fact that the corresponding
       
   356 	rules were described as part of the POSIX specification for
       
   357 	Unix-like operating systems \cite{POSIX}.} Sometimes these
       
   358   informal rules are called \emph{maximal much rule} and \emph{rule priority}.
       
   359   One contribution of our earlier paper is to give a convenient
       
   360  specification for what a POSIX value is (the inductive rules are shown in
       
   361   Figure~\ref{POSIXrules}).
   211 
   362 
   212 \begin{figure}[t]
   363 \begin{figure}[t]
       
   364   \begin{center}
       
   365   \begin{tabular}{c}
       
   366   @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \qquad
       
   367   @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\medskip\\
       
   368   @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\qquad
       
   369   @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
       
   370   $\mprset{flushleft}
       
   371    \inferrule
       
   372    {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
       
   373     @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\
       
   374     @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
       
   375    {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\smallskip\\
       
   376   @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad
       
   377   $\mprset{flushleft}
       
   378    \inferrule
       
   379    {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
       
   380     @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
       
   381     @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
       
   382     @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
       
   383    {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\\[-4mm]
       
   384   \end{tabular}
       
   385   \end{center}
       
   386   \caption{The inductive definition of POSIX values taken from our earlier paper \cite{AusafDyckhoffUrban2016}. The ternary relation, written $(s, r) \rightarrow v$, formalises the notion
       
   387   of given a string $s$ and a regular
       
   388   expression $r$ what is the unique value $v$ that satisfies the informal POSIX constraints for
       
   389   regular expression matching.}\label{POSIXrules}
       
   390   \end{figure}
       
   391 
       
   392   The clever idea by Sulzmann and Lu \cite{Sulzmann2014} in their first algorithm is to define
       
   393   an injection function on values that mirrors (but inverts) the
       
   394   construction of the derivative on regular expressions. Essentially it
       
   395   injects back a character into a value.
       
   396   For this they define two functions called @{text mkeps} and @{text inj}:
       
   397  
       
   398   \begin{center}
       
   399   \begin{tabular}{l}
       
   400   \begin{tabular}{lcl}
       
   401   @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
       
   402   @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
       
   403   @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
       
   404   @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
       
   405   \end{tabular}\smallskip\\
       
   406 
       
   407   \begin{tabular}{lcl}
       
   408   @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
       
   409   @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
       
   410       @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
       
   411   @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
       
   412       @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
       
   413   @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
       
   414       & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
       
   415   @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
       
   416       & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
       
   417   @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ 
       
   418       & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
       
   419   @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ 
       
   420       & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}
       
   421   \end{tabular}
       
   422   \end{tabular}
       
   423   \end{center}
       
   424 
       
   425   \noindent
       
   426   The function @{text mkeps} is called when the last derivative is nullable, that is
       
   427   the string to be matched is in the language of the regular expression. It generates
       
   428   a value for how the last derivative can match the empty string. The injection function
       
   429   then calculates the corresponding value for each intermediate derivative until
       
   430   a value for the original regular expression is generated.
       
   431   Graphically the algorithm by
       
   432   Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz}
       
   433   where the path from the left to the right involving @{term derivatives}/@{const
       
   434   nullable} is the first phase of the algorithm (calculating successive
       
   435   \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
       
   436   left, the second phase. This picture shows the steps required when a
       
   437   regular expression, say @{text "r\<^sub>1"}, matches the string @{term
       
   438   "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as:
       
   439 
       
   440   \begin{figure}[t]
   213 \begin{center}
   441 \begin{center}
   214 \begin{tikzpicture}[scale=2,node distance=1.3cm,
   442 \begin{tikzpicture}[scale=2,node distance=1.3cm,
   215                     every node/.style={minimum size=6mm}]
   443                     every node/.style={minimum size=6mm}]
   216 \node (r1)  {@{term "r\<^sub>1"}};
   444 \node (r1)  {@{term "r\<^sub>1"}};
   217 \node (r2) [right=of r1]{@{term "r\<^sub>2"}};
   445 \node (r2) [right=of r1]{@{term "r\<^sub>2"}};
   232 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}};
   460 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}};
   233 \end{tikzpicture}
   461 \end{tikzpicture}
   234 \end{center}
   462 \end{center}
   235 \mbox{}\\[-13mm]
   463 \mbox{}\\[-13mm]
   236 
   464 
   237 \caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014},
   465 \caption{The two phases of the first algorithm by Sulzmann \& Lu \cite{Sulzmann2014},
   238 matching the string @{term "[a,b,c]"}. The first phase (the arrows from 
   466 matching the string @{term "[a,b,c]"}. The first phase (the arrows from 
   239 left to right) is \Brz's matcher building successive derivatives. If the 
   467 left to right) is \Brz's matcher building successive derivatives. If the 
   240 last regular expression is @{term nullable}, then the functions of the 
   468 last regular expression is @{term nullable}, then the functions of the 
   241 second phase are called (the top-down and right-to-left arrows): first 
   469 second phase are called (the top-down and right-to-left arrows): first 
   242 @{term mkeps} calculates a value @{term "v\<^sub>4"} witnessing
   470 @{term mkeps} calculates a value @{term "v\<^sub>4"} witnessing
   243 how the empty string has been recognised by @{term "r\<^sub>4"}. After
   471 how the empty string has been recognised by @{term "r\<^sub>4"}. After
   244 that the function @{term inj} ``injects back'' the characters of the string into
   472 that the function @{term inj} ``injects back'' the characters of the string into
   245 the values.
   473 the values. The value @{term "v\<^sub>1"} is the result of the algorithm representing
       
   474 the POSIX value for this string and
       
   475 regular expression.
   246 \label{Sulz}}
   476 \label{Sulz}}
   247 \end{figure} 
   477 \end{figure} 
   248 
   478 
   249 
   479 
       
   480 
       
   481   \begin{center}
       
   482   \begin{tabular}{lcl}
       
   483   @{thm (lhs) lexer.simps(1)} & $\dn$ & @{thm (rhs) lexer.simps(1)}\\
       
   484   @{thm (lhs) lexer.simps(2)} & $\dn$ & @{text "case"} @{term "lexer (der c r) s"} @{text of}\\
       
   485                      & & \phantom{$|$} @{term "None"}  @{text "\<Rightarrow>"} @{term None}\\
       
   486                      & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}                          
       
   487   \end{tabular}
       
   488   \end{center}
       
   489 
       
   490 
       
   491 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that the algorithm by Sulzmann and Lu
       
   492 is correct. The cenral property we established relates the derivative operation to the injection function.
       
   493 
       
   494   \begin{proposition}\label{Posix2}
       
   495 	\textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$. 
       
   496 \end{proposition}
       
   497 
       
   498   \noindent
       
   499   With this in place we were able to prove:
       
   500 
       
   501 
       
   502   \begin{proposition}\mbox{}\smallskip\\\label{lexercorrect}
       
   503   \begin{tabular}{ll}
       
   504   (1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\
       
   505   (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
       
   506   \end{tabular}
       
   507   \end{proposition}
       
   508 
       
   509   \noindent
       
   510   In fact we have shown that in the success case the generated POSIX value $v$ is
       
   511   unique and in the failure case that there is no POSIX value $v$ that satisfies
       
   512   $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly
       
   513   slow in examples where the derivatives grow arbitrarily (see example from the
       
   514   Introduction). However it can be used as a conveninet reference point for the correctness
       
   515   proof of the second algorithm by Sulzmann and Lu, which we shall describe next.
       
   516   
   250 *}
   517 *}
   251 
   518 
   252 section {* Background *}
   519 section {* Bitcoded Regular Expressions and Derivatives *}
   253 
   520 
   254 text {*
   521 text {*
   255   In our Isabelle/HOL formalisation strings are lists of characters with
       
   256   the empty string being represented by the empty list, written $[]$,
       
   257   and list-cons being written as $\_\!\_\!::\!\_\!\_\,$; string
       
   258   concatenation is $\_\!\_ \,@\, \_\!\_\,$. Often we use the usual
       
   259   bracket notation for lists also for strings; for example a string
       
   260   consisting of just a single character $c$ is written $[c]$.   
       
   261   Our egular expressions are defined as usual as the elements of the following inductive
       
   262   datatype:
       
   263 
       
   264   \begin{center}
       
   265   @{text "r :="}
       
   266   @{const "ZERO"} $\mid$
       
   267   @{const "ONE"} $\mid$
       
   268   @{term "CH c"} $\mid$
       
   269   @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$
       
   270   @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$
       
   271   @{term "STAR r"} 
       
   272   \end{center}
       
   273 
       
   274   \noindent where @{const ZERO} stands for the regular expression that does
       
   275   not match any string, @{const ONE} for the regular expression that matches
       
   276   only the empty string and @{term c} for matching a character literal. The
       
   277   language of a regular expression, written $L$, is defined as usual
       
   278   (see for example \cite{AusafDyckhoffUrban2016}).
       
   279 
       
   280   Central to Brzozowski's regular expression matcher are two functions
       
   281   called $\nullable$ and \emph{derivative}. The latter is written
       
   282   $r\backslash c$ for the derivative of the regular expression $r$
       
   283   w.r.t.~the character $c$. Both functions are defined by recursion over
       
   284   regular expressions.  
       
   285 
       
   286 
       
   287 	\begin{center}
       
   288 		\begin{tabular}{lcl}
       
   289 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   290 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   291 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   292 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   293 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   294 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   295 		\end{tabular}
       
   296 	\end{center}
       
   297 
       
   298   \noindent
       
   299   The derivative function takes a regular expression, say $r$ and a
       
   300   character, say $c$, as input and returns the derivative regular
       
   301   expression.
       
   302 
       
   303 	\begin{center}
       
   304 		\begin{tabular}{lcl}
       
   305 			$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   306 			$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   307 			$d \backslash c$     & $\dn$ & 
       
   308 			$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   309 			$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   310 			$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\
       
   311 			&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   312 			&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   313 			$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   314 		\end{tabular}
       
   315 	\end{center}
       
   316 
       
   317 
       
   318   Sulzmann and Lu presented two lexing algorithms in their paper from 2014
       
   319   \cite{Sulzmann2014}. This first algorithm consists of two phases: first a
       
   320   matching phase (which is Brzozowski's algorithm) and then a value
       
   321   construction phase. The values encode \emph{how} a regular expression
       
   322   matches a string. \emph{Values} are defined as the inductive datatype
       
   323 
       
   324   \begin{center}
       
   325   @{text "v :="}
       
   326   @{const "Void"} $\mid$
       
   327   @{term "val.Char c"} $\mid$
       
   328   @{term "Left v"} $\mid$
       
   329   @{term "Right v"} $\mid$
       
   330   @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ 
       
   331   @{term "Stars vs"} 
       
   332   \end{center}  
       
   333 
       
   334   \noindent where we use @{term vs} to stand for a list of
       
   335   values.
       
   336 
       
   337 
       
   338 
       
   339 
       
   340   \noindent Sulzmann and Lu also define inductively an inhabitation relation
       
   341   that associates values to regular expressions:
       
   342 
       
   343   \begin{center}
       
   344   \begin{tabular}{c}
       
   345   \\[-8mm]
       
   346   @{thm[mode=Axiom] Prf.intros(4)} \qquad
       
   347   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm]
       
   348   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad 
       
   349   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm]
       
   350   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[4mm]
       
   351   %%@ { t h m[mode=Axiom] Prf.intros(6)[of "r"]} \qquad  
       
   352   @{thm[mode=Rule] Prf.intros(6)[of "r" "vs"]}
       
   353   \end{tabular}
       
   354   \end{center}
       
   355 
       
   356   \noindent Note that no values are associated with the regular expression
       
   357   @{term ZERO}. It is routine to establish how values ``inhabiting'' a regular
       
   358   expression correspond to the language of a regular expression, namely
       
   359 
       
   360   \begin{proposition}
       
   361   @{thm L_flat_Prf}
       
   362   \end{proposition}
       
   363 
       
   364   Sulzmann-Lu algorithm with inj. State that POSIX rules.
       
   365   metion slg is correct.
       
   366 
       
   367 
       
   368   \begin{figure}[t]
       
   369   \begin{center}
       
   370   \begin{tabular}{c}
       
   371   @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \qquad
       
   372   @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\medskip\\
       
   373   @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\qquad
       
   374   @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
       
   375   $\mprset{flushleft}
       
   376    \inferrule
       
   377    {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
       
   378     @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\
       
   379     @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
       
   380    {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\\
       
   381   @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\medskip\\
       
   382   $\mprset{flushleft}
       
   383    \inferrule
       
   384    {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
       
   385     @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
       
   386     @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
       
   387     @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
       
   388    {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>
       
   389   \end{tabular}
       
   390   \end{center}
       
   391   \caption{Our inductive definition of POSIX values.}\label{POSIXrules}
       
   392   \end{figure}
       
   393 
       
   394 
       
   395   \begin{center}
       
   396   \begin{tabular}{lcl}
       
   397   @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
       
   398   @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
       
   399   @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
       
   400   @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
       
   401   \end{tabular}
       
   402   \end{center}
       
   403 
       
   404   \begin{center}
       
   405   \begin{tabular}{l@ {\hspace{5mm}}lcl}
       
   406   \textit{(1)} & @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
       
   407   \textit{(2)} & @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
       
   408       @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
       
   409   \textit{(3)} & @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
       
   410       @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
       
   411   \textit{(4)} & @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
       
   412       & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
       
   413   \textit{(5)} & @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ 
       
   414       & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
       
   415   \textit{(6)} & @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ 
       
   416       & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
       
   417   \textit{(7)} & @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ 
       
   418       & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
       
   419   \end{tabular}
       
   420   \end{center}
       
   421 
       
   422 *}
       
   423 
       
   424 section {* Bitcoded Regular Expressions and Derivatives *}
       
   425 
       
   426 text {*
       
   427 
   522 
   428   In the second part of their paper \cite{Sulzmann2014},
   523   In the second part of their paper \cite{Sulzmann2014},
   429   Sulzmann and Lu describe another algorithm that generates POSIX
   524   Sulzmann and Lu describe another algorithm that also generates POSIX
   430   values but dispences with the second phase where characters are
   525   values but dispences with the second phase where characters are
   431   injected ``back'' into values. For this they annotate bitcodes to
   526   injected ``back'' into values. For this they annotate bitcodes to
   432   regular expressions, which we define in Isabelle/HOL as the datatype
   527   regular expressions, which we define in Isabelle/HOL as the datatype
   433 
   528 
   434   \begin{center}
   529   \begin{center}
   577   expression by just erasing the annotated bitsequences. We omit the
   672   expression by just erasing the annotated bitsequences. We omit the
   578   straightforward definition. For defining the algorithm, we also need
   673   straightforward definition. For defining the algorithm, we also need
   579   the functions \textit{bnullable} and \textit{bmkeps}, which are the
   674   the functions \textit{bnullable} and \textit{bmkeps}, which are the
   580   ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
   675   ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
   581   bitcoded regular expressions, instead of regular expressions.
   676   bitcoded regular expressions, instead of regular expressions.
   582 
   677   %
   583   \begin{center}
   678   \begin{center}
   584   \begin{tabular}{@ {}c@ {}c@ {}}
   679   \begin{tabular}{@ {}c@ {}c@ {}}
   585   \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
   680   \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
   586   $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{false}$\\
   681   $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{false}$\\
   587   $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{true}$\\
   682   $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{true}$\\
   588   $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{false}$\\
   683   $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{false}$\\
   589   $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &
   684   $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &
   590      $\exists\, r \in \rs. \,\textit{bnullable}\,r$\\
   685      $\exists\, r \in \rs. \,\textit{bnullable}\,r$\\
   592      $\textit{bnullable}\,r_1\wedge \textit{bnullable}\,r_2$\\
   687      $\textit{bnullable}\,r_1\wedge \textit{bnullable}\,r_2$\\
   593   $\textit{bnullable}\,(\textit{STAR}\,bs\,r)$ & $\dn$ &
   688   $\textit{bnullable}\,(\textit{STAR}\,bs\,r)$ & $\dn$ &
   594      $\textit{true}$
   689      $\textit{true}$
   595   \end{tabular}
   690   \end{tabular}
   596   &
   691   &
   597   \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
   692   \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
   598   $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
   693   $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
   599   $\textit{bmkeps}\,(\textit{ALTs}\,bs\,r\!::\!\rs)$ & $\dn$ &
   694   $\textit{bmkeps}\,(\textit{ALTs}\,bs\,r\!::\!\rs)$ & $\dn$ &
   600      $\textit{if}\;\textit{bnullable}\,r$\\
   695      $\textit{if}\;\textit{bnullable}\,r$\\
   601   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,r$\\
   696   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,r$\\
   602   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,\rs$\\
   697   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,\rs$\\
   709   All properties are by induction on annotated regular expressions. There are no
   804   All properties are by induction on annotated regular expressions. There are no
   710   interesting cases.
   805   interesting cases.
   711 \end{proof}
   806 \end{proof}
   712 
   807 
   713 \noindent
   808 \noindent
       
   809 The only problem left for the correctness proof is that the bitcoded algorithm
       
   810 has only a ``forward phase'' where POSIX values are generated incrementally.
       
   811 We can achive the same effect with @{text lexer} by stacking up injection
       
   812 functions in the during forward phase. An auxiliary function, called $\textit{flex}$,
       
   813 allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single
       
   814 phase.
       
   815 
       
   816 \begin{center}
       
   817 \begin{tabular}{lcl}
       
   818   $\textit{flex}\;r\,f\,[]$ & $\dn$ & $f$\\
       
   819   $\textit{flex}\;r\,f\,(c\!::\!s)$ & $\dn$ &
       
   820   $\textit{flex}\,(r\backslash c)\,(\lambda v.\,f\,(\inj\,r\,c\,v))\,s$\\
       
   821 \end{tabular}    
       
   822 \end{center}    
       
   823 
       
   824 \noindent
       
   825 The point of this function is that when
       
   826 reaching the end of the string, we just need to apply the stacked
       
   827 injection functions to the value generated by $\mkeps$.
       
   828 Using this function we can recast the success case in @{text lexer} 
       
   829 as follows:
       
   830 
       
   831 \begin{proposition}\label{flex}
       
   832 If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\,
       
   833       (\mkeps (r\backslash s))$.
       
   834 \end{proposition}
       
   835 
       
   836 \noindent
       
   837 Note we did not redefine \textit{lexer}, we just established that the
       
   838 value generated by \textit{lexer} can also be obtained by a different
       
   839 method. While this different method is not efficient (we essentially
       
   840 need to traverse the string $s$ twice, once for building the
       
   841 derivative $r\backslash s$ and another time for stacking up injection
       
   842 functions using \textit{flex}), it helped us with proving
       
   843 that incrementally building up values.
       
   844 
   714 This brings us to our main lemma in this section: if we build a
   845 This brings us to our main lemma in this section: if we build a
   715 derivative, say $r\backslash s$ and have a value, say $v$, inhabited
   846 derivative, say $r\backslash s$ and have a value, say $v$, inhabited
   716 by this derivative, then we can produce the result $\lexer$ generates
   847 by this derivative, then we can produce the result $\lexer$ generates
   717 by applying this value to the stacked-up injection functions
   848 by applying this value to the stacked-up injection functions
   718 $\textit{flex}$ assembles. The lemma establishes that this is the same
   849 $\textit{flex}$ assembles. The lemma establishes that this is the same
   735     for this way of performing the induction.}
   866     for this way of performing the induction.}
   736 
   867 
   737   The case for $[]$ is routine using Lemmas~\ref{codedecode}
   868   The case for $[]$ is routine using Lemmas~\ref{codedecode}
   738   and~\ref{retrievecode}. In the case $s\,@\,[c]$, we can infer from
   869   and~\ref{retrievecode}. In the case $s\,@\,[c]$, we can infer from
   739   the assumption that $\vdash v : (r\backslash s)\backslash c$
   870   the assumption that $\vdash v : (r\backslash s)\backslash c$
   740   holds. Hence by Lemma~\ref{Posix2} we know that 
   871   holds. Hence by Prop.~\ref{Posix2} we know that 
   741   (*) $\vdash \inj\,(r\backslash s)\,c\,v : r\backslash s$ holds too.
   872   (*) $\vdash \inj\,(r\backslash s)\,c\,v : r\backslash s$ holds too.
   742   By definition of $\textit{flex}$ we can unfold the left-hand side
   873   By definition of $\textit{flex}$ we can unfold the left-hand side
   743   to be
   874   to be
   744   \[
   875   \[
   745     \textit{Some}\,(\textit{flex}\;r\,\textit{id}\,(s\,@\,[c])\,v) =
   876     \textit{Some}\,(\textit{flex}\;r\,\textit{id}\,(s\,@\,[c])\,v) =
   746     \textit{Some}\,(\textit{flex}\;r\,\textit{id}\,s\,(\inj\,(r\backslash s)\,c\,v))  
   877     \textit{Some}\,(\textit{flex}\;r\,\textit{id}\,s\,(\inj\,(r\backslash s)\,c\,v))  
   747   \]  
   878   \]  
   748 
   879 
   749   \noindent
   880   \noindent
   750   By induction hypothesis and (*) we can rewrite the right-hand side to
   881   By induction hypothesis and (*) we can rewrite the right-hand side to
   751 
   882   %
   752   \[
   883   \[
   753     \textit{decode}\,(\textit{retrieve}\,(r^\uparrow\backslash s)\;
   884     \textit{decode}\,(\textit{retrieve}\,(r^\uparrow\backslash s)\;
   754     (\inj\,(r\backslash s)\,c\,\,v))\,r
   885     (\inj\,(r\backslash s)\,c\,\,v))\,r
   755   \]
   886   \]
   756 
   887 
   769 \begin{theorem}
   900 \begin{theorem}
   770 $\textit{lexer}\,r\,s = \textit{blexer}\,r\,s$
   901 $\textit{lexer}\,r\,s = \textit{blexer}\,r\,s$
   771 \end{theorem}  
   902 \end{theorem}  
   772 
   903 
   773 \begin{proof}
   904 \begin{proof}
   774   We can first expand both sides using Lemma~\ref{flex} and the
   905   We can first expand both sides using Prop.~\ref{flex} and the
   775   definition of \textit{blexer}. This gives us two
   906   definition of \textit{blexer}. This gives us two
   776   \textit{if}-statements, which we need to show to be equal. By 
   907   \textit{if}-statements, which we need to show to be equal. By 
   777   Lemma~\ref{bnullable}\textit{(2)} we know the \textit{if}-tests coincide:
   908   Lemma~\ref{bnullable}\textit{(2)} we know the \textit{if}-tests coincide:
   778   \[
   909   \[
   779     \textit{bnullable}(r^\uparrow\backslash s) \;\textit{iff}\;
   910     \textit{bnullable}(r^\uparrow\backslash s) \;\textit{iff}\;
   800   \textit{None}. Therefore we can conclude the proof.
   931   \textit{None}. Therefore we can conclude the proof.
   801 \end{proof}  
   932 \end{proof}  
   802 
   933 
   803 \noindent
   934 \noindent
   804 This establishes that the bitcoded algorithm by Sulzmann
   935 This establishes that the bitcoded algorithm by Sulzmann
   805 and Lu without simplification produces correct results. This was
   936 and Lu \emph{without} simplification produces correct results. This was
   806 only conjectured in their paper \cite{Sulzmann2014}. The next step
   937 only conjectured in their paper \cite{Sulzmann2014}. The next step
   807 is to add simplifications.
   938 is to add simplifications.
   808 
   939 
   809 *}
   940 *}
   810 
   941 
   814 text {*
   945 text {*
   815 
   946 
   816      Derivatives as calculated by Brzozowski’s method are usually more
   947      Derivatives as calculated by Brzozowski’s method are usually more
   817      complex regular expressions than the initial one; the result is
   948      complex regular expressions than the initial one; the result is
   818      that the derivative-based matching and lexing algorithms are
   949      that the derivative-based matching and lexing algorithms are
   819      often abysmally slow.
   950      often abysmally slow. However, as Sulzmann and Lu wrote, various
   820 
   951      optimisations are possible, such as the simplifications
   821      However, as Sulzmann and
   952      $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r \Rightarrow r$,
   822      Lu wrote, various optimisations are possible, such as the
   953      $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow r$. While these
   823      simplifications of 0 + r,r + 0,1 · r and r · 1 to r. These
   954      simplifications can speed up the algorithms considerably in many
   824      simplifications can speed up the algorithms considerably.
   955      cases, they do not solve fundamentally the ``growth problem'' with
       
   956      derivatives. To see this let us return to the example 
   825 
   957 
   826 
   958 
   827      \begin{lemma}
   959      \begin{lemma}
   828      @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]}
   960      @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]}
   829      \end{lemma}
   961      \end{lemma}
   877   %@ { t hm[mode=Axiom] ss1}\qquad
  1009   %@ { t hm[mode=Axiom] ss1}\qquad
   878   @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}\qquad
  1010   @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}\qquad
   879   @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}\\
  1011   @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}\\
   880   @{thm[mode=Axiom] ss4}\qquad
  1012   @{thm[mode=Axiom] ss4}\qquad
   881   @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}\\
  1013   @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}\\
   882   @{thm[mode=Rule] ss6[of "r\<^sub>1" "r\<^sub>2" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}\\
  1014   @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}\\
   883   \end{tabular}
  1015   \end{tabular}
   884   \end{center}
  1016   \end{center}
   885   \caption{???}\label{SimpRewrites}
  1017   \caption{???}\label{SimpRewrites}
   886   \end{figure}
  1018   \end{figure}
   887 *}
  1019 *}