thys3/Paper.thy
changeset 647 70c10dc41606
parent 644 9f984ff20020
equal deleted inserted replaced
646:56057198e4f5 647:70c10dc41606
    95 
    95 
    96 In the last fifteen or so years, Brzozowski's derivatives of regular
    96 In the last fifteen or so years, Brzozowski's derivatives of regular
    97 expressions have sparked quite a bit of interest in the functional
    97 expressions have sparked quite a bit of interest in the functional
    98 programming and theorem prover communities.
    98 programming and theorem prover communities.
    99 Derivatives of a
    99 Derivatives of a
   100 regular expression, written @{term "der c r"}, give a simple solution
   100 regular expressions, written @{term "der c r"}, give a simple solution
   101 to the problem of matching a string @{term s} with a regular
   101 to the problem of matching a string @{term s} with a regular
   102 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
   102 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
   103 succession) all the characters of the string matches the empty string,
   103 succession) all the characters of the string matches the empty string,
   104 then @{term r} matches @{term s} (and {\em vice versa}).  
   104 then @{term r} matches @{term s} (and {\em vice versa}).  
   105 The beauty of
   105 The beauty of
   214 \noindent In this paper, we shall first briefly introduce the basic notions
   214 \noindent In this paper, we shall first briefly introduce the basic notions
   215 of regular expressions and describe the definition
   215 of regular expressions and describe the definition
   216 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves
   216 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves
   217 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove
   217 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove
   218 the correctness for the bitcoded algorithm without simplification, and
   218 the correctness for the bitcoded algorithm without simplification, and
   219 after that extend the proof to include simplification.\mbox{}\\[-6mm]
   219 after that extend the proof to include simplification.
       
   220 Our Isabelle code including the results from Sec.~5 is available
       
   221 from \textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}.
       
   222 \mbox{}\\[-6mm]
   220 
   223 
   221 *}
   224 *}
   222 
   225 
   223 section {* Background *}
   226 section {* Background *}
   224 
   227 
   254   \cite{AusafDyckhoffUrban2016}).
   257   \cite{AusafDyckhoffUrban2016}).
   255   
   258   
   256   In our work here we also add to the usual ``basic'' regular
   259   In our work here we also add to the usual ``basic'' regular
   257   expressions the \emph{bounded} regular expression @{term "NTIMES r
   260   expressions the \emph{bounded} regular expression @{term "NTIMES r
   258   n"} where the @{term n} specifies that @{term r} should match
   261   n"} where the @{term n} specifies that @{term r} should match
   259   exactly @{term n}-times. Again for brevity we omit the other bounded
   262   exactly @{term n}-times (it is not included in Sulzmann and Lu's original work). For brevity we omit the other bounded
   260   regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$
   263   regular expressions @{text "r"}$^{\{..\textit{n}\}}$,
   261   and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many
   264   @{text "r"}$^{\{\textit{n}..\}}$
       
   265   and @{text "r"}$^{\{\textit{n}..\textit{m}\}}$ which specify intervals for how many
   262   times @{text r} should match. The results presented in this paper
   266   times @{text r} should match. The results presented in this paper
   263   extend straightforwardly to them too. The importance of the bounded
   267   extend straightforwardly to them too. The importance of the bounded
   264   regular expressions is that they are often used in practical
   268   regular expressions is that they are often used in practical
   265   applications, such as Snort (a system for detecting network
   269   applications, such as Snort (a system for detecting network
   266   intrusions) and also in XML Schema definitions. According to Bj\"{o}rklund et
   270   intrusions) and also in XML Schema definitions. According to Bj\"{o}rklund et
   267   al~\cite{BjorklundMartensTimm2015}, bounded regular expressions 
   271   al~\cite{BjorklundMartensTimm2015}, bounded regular expressions 
   268   occur frequently in the latter and can have counters of up to
   272   occur frequently in the latter and can have counters of up to
   269   ten million.  The problem is that tools based on the classic notion
   273   ten million.  The problem is that tools based on the classic notion
   270   of automata need to expand @{text "r"}$^{\{n\}}$ into @{text n}
   274   of automata need to expand @{text "r"}$^{\{\textit{n}\}}$ into @{text n}
   271   connected copies of the automaton for @{text r}. This leads to very
   275   connected copies of the automaton for @{text r}. This leads to very
   272   inefficient matching algorithms or algorithms that consume large
   276   inefficient matching algorithms or algorithms that consume large
   273   amounts of memory.  A classic example is the regular expression
   277   amounts of memory.  A classic example is the regular expression
   274   \mbox{@{term "SEQ (SEQ (STAR (ALT a b)) a) (NTIMES (ALT a b) n)"}}
   278   \mbox{@{term "SEQ (SEQ (STAR (ALT a b)) a) (NTIMES (ALT a b) n)"}}
   275   where the minimal DFA requires at least $2^{n + 1}$ states (see
   279   where the minimal DFA requires at least $2^{n + 1}$ states (see
   276   \cite{CountingSet2020}). Therefore regular expression matching
   280   \cite{CountingSet2020}). Therefore regular expression matching
   277   libraries that rely on the classic notion of DFAs often impose 
   281   libraries that rely on the classic notion of DFAs often impose 
   278   adhoc limits for bounded regular expressions: For example in the
   282   adhoc limits for bounded regular expressions: For example in the
   279   regular expression matching library in the Go language and also in Google's RE2 library the regular expression
   283   regular expression matching library in the Go language and also in Google's RE2 library the regular expression
   280   @{term "NTIMES a 1001"} is not permitted, because no counter can be
   284   @{term "NTIMES a 1001"} is not permitted, because no counter can be
   281   above 1000; and in the built-in regular expression library in Rust
   285   above 1000; and in the regular expression library in Rust
   282   expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error
   286   expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error
   283   message for being too big.  These problems can of course be solved in matching
   287   message for being too big. Up until recently,\footnote{up until version 1.5.4 of the regex
       
   288   library in Rust; see also CVE-2022-24713.} Rust
       
   289   however happily generated automata for regular expressions such as 
       
   290   @{text "a\<^bsup>{0}{4294967295}\<^esup>"}. This was due to a bug
       
   291   in the algorithm that decides when a regular expression is acceptable
       
   292   or too big according to Rust's classification (it did not account for the fact that @{text "a\<^bsup>{0}\<^esup>"} and similar examples can match the empty string). We shall come back to
       
   293   this example later in the paper. 
       
   294   These problems can of course be solved in matching
   284   algorithms where automata go beyond the classic notion and for
   295   algorithms where automata go beyond the classic notion and for
   285   instance include explicit counters (see~\cite{CountingSet2020}).
   296   instance include explicit counters (e.g.~\cite{CountingSet2020}).
   286   The point here is that Brzozowski derivatives and the algorithms by
   297   The point here is that Brzozowski derivatives and the algorithms by
   287   Sulzmann and Lu can be straightforwardly extended to deal with
   298   Sulzmann and Lu can be straightforwardly extended to deal with
   288   bounded regular expressions and moreover the resulting code
   299   bounded regular expressions and moreover the resulting code
   289   still consists of only simple recursive functions and inductive
   300   still consists of only simple recursive functions and inductive
   290   datatypes. Finally, bounded regular expressions 
   301   datatypes. Finally, bounded regular expressions 
   291   do not destroy our finite boundedness property, which we shall
   302   do not destroy our finite boundedness property, which we shall
   292   prove later on.%, because during the lexing process counters will only be
   303   prove later on.
       
   304 
       
   305   %, because during the lexing process counters will only be
   293   %decremented.
   306   %decremented.
   294   
   307   
   295 
   308 
   296   Central to Brzozowski's regular expression matcher are two functions
   309   Central to Brzozowski's regular expression matcher are two functions
   297   called @{text nullable} and \emph{derivative}. The latter is written
   310   called @{text nullable} and \emph{derivative}. The latter is written
   383   inhabitation relation that associates values to regular expressions. Our
   396   inhabitation relation that associates values to regular expressions. Our
   384   version of this relation is defined by the following six rules:
   397   version of this relation is defined by the following six rules:
   385   %
   398   %
   386   \begin{center}
   399   \begin{center}
   387   \begin{tabular}{@ {}l@ {}}
   400   \begin{tabular}{@ {}l@ {}}
   388   @{thm[mode=Axiom] Prf.intros(4)}\quad
   401   @{thm[mode=Axiom] Prf.intros(4)}\qquad
   389   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad
   402   @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\qquad
   390   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\quad
   403   @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\qquad
   391   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
   404   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
   392   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad
   405   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad
   393   @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad
   406   @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad
   394   $\mprset{flushleft}\inferrule{
   407   $\mprset{flushleft}\inferrule{
   395   @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
   408   @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
   396   @{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
   409   @{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\quad
   397   @{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}
   410   @{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}
   398   }
   411   }
   399   {@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}}
   412   {@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}}
   400   $
   413   $
   401   \end{tabular}
   414   \end{tabular}
   435   Fig~\ref{POSIXrules}).
   448   Fig~\ref{POSIXrules}).
   436 
   449 
   437 \begin{figure}[t]
   450 \begin{figure}[t]
   438   \begin{center}\small%
   451   \begin{center}\small%
   439   \begin{tabular}{@ {\hspace{-2mm}}c@ {}}
   452   \begin{tabular}{@ {\hspace{-2mm}}c@ {}}
   440   \\[-4.5mm]
   453   \\[-8.5mm]
   441   @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
   454   @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
   442   @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
   455   @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
   443   @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
   456   @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
   444   @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
   457   @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
   445   $\mprset{flushleft}
   458   $\mprset{flushleft}
   446    \inferrule
   459    \inferrule
   447    {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
   460    {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
   448     @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\
   461     @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
   449     @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
   462     @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
   450    {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\smallskip\\
   463    {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\\
   451   @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad
   464   @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad
   452   $\mprset{flushleft}
   465   $\mprset{flushleft}
   453    \inferrule
   466    \inferrule
   454    {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
   467    {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
   455     @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
   468     @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
   456     @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
   469     @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
   457     @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
   470     @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
   458    {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\smallskip\\
   471    {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\\
   459   \mprset{sep=4mm} 
   472   \mprset{sep=4mm} 
   460   @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad 
   473   @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad\; 
   461   $\mprset{flushleft}
   474   $\mprset{flushleft}
   462    \inferrule
   475    \inferrule
   463    {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
   476    {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
   464     @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
   477     @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
   465     @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\
   478     @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\
   485   \begin{tabular}{@ {}lcl@ {}}
   498   \begin{tabular}{@ {}lcl@ {}}
   486   @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
   499   @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
   487   @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
   500   @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
   488   @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
   501   @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
   489   @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
   502   @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
   490   @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
   503   @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}
   491   \end{tabular}
   504   \end{tabular}
   492   \medskip\\
   505   %\end{tabular}
   493 
   506   %\end{center}
   494   %!\begin{tabular}{@ {}cc@ {}}
   507   \smallskip\\
       
   508 
       
   509   %\begin{center}
       
   510   %\begin{tabular}{@ {}cc@ {}}
   495   \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
   511   \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
   496   @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
   512   @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)[of "c"]}\\
   497   @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
   513   @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
   498       @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
   514       @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
   499   @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
   515   @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
   500       @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
   516       @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
   501   %!
   517   %!
   523   The function @{text mkeps} is run when the last derivative is nullable, that is
   539   The function @{text mkeps} is run when the last derivative is nullable, that is
   524   the string to be matched is in the language of the regular expression. It generates
   540   the string to be matched is in the language of the regular expression. It generates
   525   a value for how the last derivative can match the empty string. In case
   541   a value for how the last derivative can match the empty string. In case
   526   of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate
   542   of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate
   527   a list of exactly @{term n} copies, which is the length of the list we expect in this
   543   a list of exactly @{term n} copies, which is the length of the list we expect in this
   528   case.  The injection function
   544   case.  The injection function\footnote{While the character argument @{text c} is not
       
   545   strictly necessary in the @{text inj}-function for the fragment of regular expressions we
       
   546   use in this paper, it is necessary for extended regular expressions. For example for the range regular expression of the form @{text "[a-z]"}.
       
   547   We therefore keep this argument from the original formulation of @{text inj} by Sulzmann and Lu.}
   529   then calculates the corresponding value for each intermediate derivative until
   548   then calculates the corresponding value for each intermediate derivative until
   530   a value for the original regular expression is generated.
   549   a value for the original regular expression is generated.
   531   Graphically the algorithm by
   550   Graphically the algorithm by
   532   Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz}
   551   Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz}
   533   where the path from the left to the right involving @{term derivatives}/@{const
   552   where the path from the left to the right involving @{term derivatives}/@{const
   534   nullable} is the first phase of the algorithm (calculating successive
   553   nullable} is the first phase of the algorithm (calculating successive
   535   \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
   554   \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
   536   left, the second phase.
   555   left, the second phase.
   537 %
   556 %
   538 \begin{center}
   557 \begin{center}
   539 \begin{tikzpicture}[scale=0.99,node distance=9mm,
   558 \begin{tikzpicture}[scale=0.85,node distance=8mm,
   540                     every node/.style={minimum size=6mm}]
   559                     every node/.style={minimum size=6mm}]
   541 \node (r1)  {@{term "r\<^sub>1"}};
   560 \node (r1)  {@{term "r\<^sub>1"}};
   542 \node (r2) [right=of r1]{@{term "r\<^sub>2"}};
   561 \node (r2) [right=of r1]{@{term "r\<^sub>2"}};
   543 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}};
   562 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}};
   544 \node (r3) [right=of r2]{@{term "r\<^sub>3"}};
   563 \node (r3) [right=of r2]{@{term "r\<^sub>3"}};
   552 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {\<open>inj r\<^sub>3 c\<close>};
   571 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {\<open>inj r\<^sub>3 c\<close>};
   553 \node (v2) [left=of v3]{@{term "v\<^sub>2"}};
   572 \node (v2) [left=of v3]{@{term "v\<^sub>2"}};
   554 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {\<open>inj r\<^sub>2 b\<close>};
   573 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {\<open>inj r\<^sub>2 b\<close>};
   555 \node (v1) [left=of v2] {@{term "v\<^sub>1"}};
   574 \node (v1) [left=of v2] {@{term "v\<^sub>1"}};
   556 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {\<open>inj r\<^sub>1 a\<close>};
   575 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {\<open>inj r\<^sub>1 a\<close>};
   557 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}};
   576 \draw (r4) node[anchor=north west] {\;\raisebox{-5mm}{@{term "mkeps"}}};
   558 \end{tikzpicture}
   577 \end{tikzpicture}
   559 \end{center}
   578 \end{center}
   560 %
   579 %
   561 \noindent
   580 \noindent
   562   The picture shows the steps required when a
   581   The picture shows the steps required when a
   606 
   625 
   607   \begin{center}
   626   \begin{center}
   608   \begin{tabular}{lcl}
   627   \begin{tabular}{lcl}
   609   @{thm (lhs) lexer.simps(1)} & $\dn$ & @{thm (rhs) lexer.simps(1)}\\
   628   @{thm (lhs) lexer.simps(1)} & $\dn$ & @{thm (rhs) lexer.simps(1)}\\
   610   @{thm (lhs) lexer.simps(2)} & $\dn$ & @{text "case"} @{term "lexer (der c r) s"} @{text of}
   629   @{thm (lhs) lexer.simps(2)} & $\dn$ & @{text "case"} @{term "lexer (der c r) s"} @{text of}
   611                      @{term "None"}  @{text "\<Rightarrow>"} @{term None}\\
   630                      @{term "None"}  @{text "\<Rightarrow>"} @{term None}
   612                      & & \hspace{27mm}$|\;$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}                          
   631                      %%& & \hspace{27mm}
       
   632 		     $|\;$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}                          
   613   \end{tabular}
   633   \end{tabular}
   614   \end{center}
   634   \end{center}
   615 
   635 
   616 
   636 
   617 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
   637 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
   625 
   645 
   626   \noindent
   646   \noindent
   627   With this in place we were able to prove:
   647   With this in place we were able to prove:
   628 
   648 
   629   \begin{proposition}\mbox{}\label{lexercorrect}
   649   \begin{proposition}\mbox{}\label{lexercorrect}
   630   \textrm{(1)} @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}.\\
   650   \textrm{(1)}\; @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}.\\
   631   \mbox{\hspace{23.5mm}}\textrm{(2)}\; @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}.
   651   \mbox{\hspace{29.5mm}}\textrm{(2)}\; @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}.
   632   %
   652   %
   633   % \smallskip\\
   653   % \smallskip\\
   634   %\begin{tabular}{ll}
   654   %\begin{tabular}{ll}
   635   %(1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\
   655   %(1) & @{thm (lhs) lexer_correct_None} \;if and only if\; @{thm (rhs) lexer_correct_None}\\
   636   %(2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
   656   %(2) & @{thm (lhs) lexer_correct_Some} \;if and only if\; @{thm (rhs) lexer_correct_Some}\\
   637   %\end{tabular}
   657   %\end{tabular}
   638   \end{proposition}
   658   \end{proposition}
   639 
   659 
   640   \noindent
   660   \noindent
   641   In fact we have shown that, in the success case, the generated POSIX value $v$ is
   661   In fact we have shown that, in the success case, the generated POSIX value $v$ is
   655   Sulzmann and Lu describe another algorithm that also generates POSIX
   675   Sulzmann and Lu describe another algorithm that also generates POSIX
   656   values but dispenses with the second phase where characters are
   676   values but dispenses with the second phase where characters are
   657   injected ``back'' into values. For this they annotate bitcodes to
   677   injected ``back'' into values. For this they annotate bitcodes to
   658   regular expressions, which we define in Isabelle/HOL as the datatype\medskip
   678   regular expressions, which we define in Isabelle/HOL as the datatype\medskip
   659 
   679 
   660   %!\begin{center}
   680   \begin{center}
   661   \noindent
   681   %\noindent
   662   \begin{minipage}{1.01\textwidth}
   682   %\begin{minipage}{0.9\textwidth}
   663   \,@{term breg} $\,::=\,$ @{term "AZERO"}
   683   \,@{term breg} $\,::=\,$ @{term "AZERO"}
   664                $\,\mid$ @{term "AONE bs"}
   684                $\,\mid$ @{term "AONE bs"}
   665                $\,\mid$ @{term "ACHAR bs c"}
   685                $\,\mid$ @{term "ACHAR bs c"}
   666                $\,\mid$ @{term "AALTs bs rs"}
   686                $\,\mid$ @{term "AALTs bs rs"}
   667                $\,\mid$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}
   687                $\,\mid$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}
   668                $\,\mid$ @{term "ASTAR bs r"}
   688                $\,\mid$ @{term "ASTAR bs r"}
   669 	       $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}\hfill\mbox{}
   689 	       $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}%\hfill\mbox{}
   670   \end{minipage}\medskip	       
   690   %\end{minipage}\medskip	       
   671   %!\end{center}
   691   \end{center}
   672 
   692 
   673   \noindent where @{text bs} stands for bitsequences; @{text r},
   693   \noindent where @{text bs} stands for bitsequences; @{text r},
   674   @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
   694   @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
   675   expressions; and @{text rs} for lists of bitcoded regular
   695   expressions; and @{text rs} for lists of bitcoded regular
   676   expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
   696   expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
   700   \end{tabular}
   720   \end{tabular}
   701   \end{tabular}
   721   \end{tabular}
   702   \end{center}
   722   \end{center}
   703    
   723    
   704   \noindent
   724   \noindent
   705   As can be seen, this coding is ``lossy'' in the sense that we do not
   725   As can be seen, this coding is ``lossy'' in the sense that it does not
   706   record explicitly character values and also not sequence values (for
   726   record explicitly character values and also not sequence values (for
   707   them we just append two bitsequences). However, the
   727   them it just appends two bitsequences). However, the
   708   different alternatives for @{text Left}, respectively @{text Right}, are recorded as @{text Z} and
   728   different alternatives for @{text Left}, respectively @{text Right}, are recorded as @{text Z} and
   709   @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate
   729   @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate
   710   if there is still a value coming in the list of @{text Stars}, whereas @{text S}
   730   if there is still a value coming in the list of @{text Stars}, whereas @{text S}
   711   indicates the end of the list. The lossiness makes the process of
   731   indicates the end of the list. The lossiness makes the process of
   712   decoding a bit more involved, but the point is that if we have a
   732   decoding a bit more involved, but the point is that if we have a
   856   the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the
   876   the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the
   857   ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
   877   ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
   858   bitcoded regular expressions.
   878   bitcoded regular expressions.
   859   %
   879   %
   860   \begin{center}
   880   \begin{center}
   861   \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{6mm}}c@ {}}
   881   \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{10mm}}c@ {}}
   862   \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l}
   882   \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l}
   863   $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\
   883   $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\
   864   $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\
   884   $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\
   865   $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\
   885   $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\
   866   $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &\\
   886   $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &\\
   931   can define Sulzmann and Lu's bitcoded lexer, which we call \textit{blexer}:
   951   can define Sulzmann and Lu's bitcoded lexer, which we call \textit{blexer}:
   932   %
   952   %
   933   \begin{center}
   953   \begin{center}
   934 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
   954 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
   935   $\textit{blexer}\;r\,s$ & $\dn$ &
   955   $\textit{blexer}\;r\,s$ & $\dn$ &
   936       $\textit{let}\;r_{der} = (r^\uparrow)\backslash s\;\textit{in}$\\                
   956       $\textit{let}\;r_{der} = (r^\uparrow)\backslash s\;\textit{in}$\\               
   937   & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
   957   & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$
   938   & & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$\\
   958       $\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$
   939   & & $\;\;\;\textit{else}\;\textit{None}$
   959       $\;\textit{else}\;\textit{None}$
   940 \end{tabular}
   960 \end{tabular}
   941 \end{center}
   961 \end{center}
   942 
   962 
   943   \noindent
   963   \noindent
   944 This bitcoded lexer first internalises the regular expression $r$ and then
   964 This bitcoded lexer first internalises the regular expression $r$ and then
   989 is the same as if we first erase the bitcoded regular expression and
  1009 is the same as if we first erase the bitcoded regular expression and
   990 then perform the ``standard'' derivative operation.
  1010 then perform the ``standard'' derivative operation.
   991 
  1011 
   992 \begin{lemma}\label{bnullable}%\mbox{}\smallskip\\
  1012 \begin{lemma}\label{bnullable}%\mbox{}\smallskip\\
   993 \textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\    
  1013 \textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\    
   994 \mbox{\hspace{17mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
  1014 \mbox{\hspace{21.5mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
   995 \mbox{\hspace{17mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$
  1015 \mbox{\hspace{21.5mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$
   996 %\begin{tabular}{ll}
  1016 %\begin{tabular}{ll}
   997 %\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\    
  1017 %\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\    
   998 %\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
  1018 %\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
   999 %\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$.
  1019 %\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ \;provided\; $\textit{nullable}(r^\downarrow)$.
  1000 %\end{tabular}  
  1020 %\end{tabular}  
  1001 \end{lemma}
  1021 \end{lemma}
  1002 
  1022 
  1003 %\begin{proof}
  1023 %\begin{proof}
  1004 %  All properties are by induction on annotated regular expressions.
  1024 %  All properties are by induction on annotated regular expressions.
  1028 Using this function we can recast the success case in @{text lexer} 
  1048 Using this function we can recast the success case in @{text lexer} 
  1029 as follows:
  1049 as follows:
  1030 
  1050 
  1031 \begin{lemma}\label{flex}
  1051 \begin{lemma}\label{flex}
  1032 If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\,
  1052 If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\,
  1033       (\mkeps (r\backslash s))$.
  1053       (\textit{mkeps}\,(r\backslash s))$.
  1034 \end{lemma}
  1054 \end{lemma}
  1035 
  1055 
  1036 \noindent
  1056 \noindent
  1037 Note we did not redefine \textit{lexer}, we just established that the
  1057 Note we did not redefine \textit{lexer}, we just established that the
  1038 value generated by \textit{lexer} can also be obtained by a different
  1058 value generated by \textit{lexer} can also be obtained by a different
  1109      cannot make any further simplifications. This is a problem because
  1129      cannot make any further simplifications. This is a problem because
  1110      the outermost alternatives contains two copies of the same
  1130      the outermost alternatives contains two copies of the same
  1111      regular expression (underlined with $r$). These copies will
  1131      regular expression (underlined with $r$). These copies will
  1112      spawn new copies in later derivative steps and they in turn even more copies. This
  1132      spawn new copies in later derivative steps and they in turn even more copies. This
  1113      destroys any hope of taming the size of the derivatives.  But the
  1133      destroys any hope of taming the size of the derivatives.  But the
  1114      second copy of $r$ in \eqref{derivex} will never contribute to a
  1134      second copy of $r$ in~\eqref{derivex} will never contribute to a
  1115      value, because POSIX lexing will always prefer matching a string
  1135      value, because POSIX lexing will always prefer matching a string
  1116      with the first copy. So it could be safely removed without affecting the correctness of the algorithm.
  1136      with the first copy. So it could be safely removed without affecting the correctness of the algorithm.
  1117      The issue with the simple-minded
  1137      The issue with the simple-minded
  1118      simplification rules above is that the rule $r + r \Rightarrow r$
  1138      simplification rules above is that the rule $r + r \Rightarrow r$
  1119      will never be applicable because as can be seen in this example the
  1139      will never be applicable because as can be seen in this example the
  1128      expressions is that they can be easily modified such that simplification does not
  1148      expressions is that they can be easily modified such that simplification does not
  1129      interfere with the value constructions. For example we can ``flatten'', or
  1149      interfere with the value constructions. For example we can ``flatten'', or
  1130      de-nest, or spill out, @{text ALTs} as follows
  1150      de-nest, or spill out, @{text ALTs} as follows
  1131      %
  1151      %
  1132      \[
  1152      \[
  1133      @{term "ALTs bs\<^sub>1 (((ALTs bs\<^sub>2 rs\<^sub>2)) # rs\<^sub>1)"}
  1153      @{text "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) :: rs\<^sub>1)"}
  1134      \quad\xrightarrow{bsimp}\quad
  1154      \quad\xrightarrow{bsimp}\quad
  1135      @{text "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
  1155      @{text "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
  1136      \]
  1156      \]
  1137 
  1157 
  1138      \noindent
  1158      \noindent
  1169 
  1189 
  1170      \begin{center}
  1190      \begin{center}
  1171      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
  1191      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
  1172      @{thm (lhs) distinctWith.simps(1)} & $\dn$ & @{thm (rhs) distinctWith.simps(1)}\\
  1192      @{thm (lhs) distinctWith.simps(1)} & $\dn$ & @{thm (rhs) distinctWith.simps(1)}\\
  1173      @{thm (lhs) distinctWith.simps(2)} & $\dn$ &
  1193      @{thm (lhs) distinctWith.simps(2)} & $\dn$ &
  1174      @{text "if (\<exists> y \<in> acc. eq x y)"} \\
  1194      @{text "if (\<exists> y \<in> acc. eq x y)"}
  1175      & & @{text "then distinctWith xs eq acc"}\\
  1195      @{text "then distinctWith xs eq acc"}\\
  1176      & & @{text "else x :: distinctWith xs eq ({x} \<union> acc)"}
  1196      & & @{text "else x :: distinctWith xs eq ({x} \<union> acc)"}
  1177      \end{tabular}
  1197      \end{tabular}
  1178      \end{center}
  1198      \end{center}
  1179 
  1199 
  1180      \noindent where we scan the list from left to right (because we
  1200      \noindent where we scan the list from left to right (because we
  1188      @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions
  1208      @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions
  1189      before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from
  1209      before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from
  1190      bitcoded regular expressions to @{text bool}:
  1210      bitcoded regular expressions to @{text bool}:
  1191      %
  1211      %
  1192      \begin{center}
  1212      \begin{center}
  1193      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
  1213      \begin{tabular}{@ {}cc@ {}}
       
  1214      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {\hspace{1mm}}}
  1194      @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\
  1215      @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\
  1195      @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\
  1216      @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\
       
  1217      @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\
       
  1218       @{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\
       
  1219      \mbox{}
       
  1220      \end{tabular}
       
  1221      &
       
  1222      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
  1196      @{thm (lhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]} & $\dn$ & @{thm (rhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]}\\
  1223      @{thm (lhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]} & $\dn$ & @{thm (rhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]}\\
  1197      @{thm (lhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]} & $\dn$ &
  1224      @{thm (lhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]} & $\dn$ &
  1198      @{thm (rhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]}\\
  1225      @{thm (rhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]}\\
  1199      @{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\
       
  1200      @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ &
       
  1201      @{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}\\
       
  1202      @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\
       
  1203      @{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\
  1226      @{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\
       
  1227      @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ & \\
       
  1228      \multicolumn{3}{r}{@{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}}
       
  1229      \end{tabular}
  1204      \end{tabular}
  1230      \end{tabular}
  1205      \end{center}
  1231      \end{center}
  1206 
  1232 
  1207      \noindent
  1233      \noindent
  1208      where all other cases are set to @{text False}.
  1234      where all other cases are set to @{text False}.
  1210      but is needed in order to make the removal of unnecessary copies
  1236      but is needed in order to make the removal of unnecessary copies
  1211      to work properly.
  1237      to work properly.
  1212 
  1238 
  1213      Our simplification function depends on three more helper functions, one is called
  1239      Our simplification function depends on three more helper functions, one is called
  1214      @{text flts} and analyses lists of regular expressions coming from alternatives.
  1240      @{text flts} and analyses lists of regular expressions coming from alternatives.
  1215      It is defined as follows:
  1241      It is defined by four clauses as follows:
  1216 
  1242 
  1217      \begin{center}
  1243      \begin{center}
  1218      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
  1244      \begin{tabular}{c}
  1219      \multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad
  1245      @{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad
  1220      @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\smallskip\\
  1246      @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)} \qquad
  1221      @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
  1247      @{text "flts ((ALTs bs' rs') :: rs"}
  1222      \multicolumn{3}{@ {}c}{@{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"}
  1248      %@{ thm (lhs) flts.simps(3)[of "bs'" "rs'"]}
  1223      \hspace{5mm}(otherwise)}
  1249      $\dn$ @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\smallskip\\
       
  1250      @{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"}\hspace{5mm}(otherwise)
  1224      \end{tabular}
  1251      \end{tabular}
  1225      \end{center}
  1252      \end{center}
  1226 
  1253 
  1227      \noindent
  1254      \noindent
  1228      The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and
  1255      The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and
  1229      the third ``de-nests'' alternatives (but retaining the
  1256      the third ``de-nests'' alternatives (but retains the
  1230      bitsequence @{text "bs'"} accumulated in the inner alternative). There are
  1257      bitsequence @{text "bs'"} accumulated in the inner alternative). There are
  1231      some corner cases to be considered when the resulting list inside an alternative is
  1258      some corner cases to be considered when the resulting list inside an alternative is
  1232      empty or a singleton list. We take care of those cases in the
  1259      empty or a singleton list. We take care of those cases in the
  1233      @{text "bsimpALTs"} function; similarly we define a helper function that simplifies
  1260      @{text "bsimpALTs"} function; similarly we define a helper function that simplifies
  1234      sequences according to the usual rules about @{text ZERO}s and @{text ONE}s:
  1261      sequences according to the usual rules about @{text ZERO}s and @{text ONE}s:
  1264      \end{tabular}
  1291      \end{tabular}
  1265      \end{center}
  1292      \end{center}
  1266 
  1293 
  1267      \noindent
  1294      \noindent
  1268      We believe our recursive function @{term bsimp} simplifies bitcoded regular
  1295      We believe our recursive function @{term bsimp} simplifies bitcoded regular
  1269      expressions as intended by Sulzmann and Lu. There is no point in applying the
  1296      expressions as intended by Sulzmann and Lu with the small addition of removing ``useless'' @{text ONE}s
       
  1297      in sequence regular
       
  1298      expressions. There is no point in applying the
  1270      @{text bsimp} function repeatedly (like the simplification in their paper which needs to be
  1299      @{text bsimp} function repeatedly (like the simplification in their paper which needs to be
  1271      applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent,
  1300      applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent,
  1272      that is
  1301      that is
  1273 
  1302 
  1274      \begin{proposition}
  1303      \begin{proposition}
  1302 
  1331 
  1303      \begin{center}
  1332      \begin{center}
  1304 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
  1333 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
  1305   $\textit{blexer}^+\;r\,s$ & $\dn$ &
  1334   $\textit{blexer}^+\;r\,s$ & $\dn$ &
  1306       $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\                
  1335       $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\                
  1307       & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
  1336       & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$
  1308       & & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\;\textit{else}\;\textit{None}$
  1337       $\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\textit{else}\;\textit{None}$
  1309 \end{tabular}
  1338 \end{tabular}
  1310 \end{center}
  1339 \end{center}
  1311 
  1340 
  1312        \noindent
  1341        \noindent
  1313        Note that in $\textit{blexer}^+$ the derivative $r_{der}$ is calculated
  1342        Note that in $\textit{blexer}^+$ the derivative $r_{der}$ is calculated
  1341      expression anymore. So unless one can somehow
  1370      expression anymore. So unless one can somehow
  1342      synchronise the change in the simplified regular expressions with
  1371      synchronise the change in the simplified regular expressions with
  1343      the original POSIX value, there is no hope of appealing to @{text retrieve} in the
  1372      the original POSIX value, there is no hope of appealing to @{text retrieve} in the
  1344      correctness argument for @{term blexer_simp}.
  1373      correctness argument for @{term blexer_simp}.
  1345 
  1374 
  1346      We found it more helpful to introduce the rewriting systems shown in
  1375      For our proof we found it more helpful to introduce the rewriting systems shown in
  1347      Fig~\ref{SimpRewrites}. The idea is to generate 
  1376      Fig~\ref{SimpRewrites}. The idea is to generate 
  1348      simplified regular expressions in small steps (unlike the @{text bsimp}-function which
  1377      simplified regular expressions in small steps (unlike the @{text bsimp}-function which
  1349      does the same in a big step), and show that each of
  1378      does the same in a big step), and show that each of
  1350      the small steps preserves the bitcodes that lead to the POSIX value.
  1379      the small steps preserves the bitcodes that lead to the POSIX value.
  1351      The rewrite system is organised such that $\leadsto$ is for bitcoded regular
  1380      The rewrite system is organised such that $\leadsto$ is for bitcoded regular
  1433      we shall show next.
  1462      we shall show next.
  1434 
  1463 
  1435    \begin{figure}[t]
  1464    \begin{figure}[t]
  1436   \begin{center}
  1465   \begin{center}
  1437   \begin{tabular}{@ {\hspace{-8mm}}c@ {}}
  1466   \begin{tabular}{@ {\hspace{-8mm}}c@ {}}
  1438   \\[-7mm]
  1467   \\[-8mm]
  1439   @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad
  1468   @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\quad
  1440   @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad
  1469   @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\quad
  1441   @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\
  1470   @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\
  1442   @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad
  1471   @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad
  1443   @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\
  1472   @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\
  1509 where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
  1538 where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
  1510 In (3) we know that  $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is 
  1539 In (3) we know that  $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is 
  1511 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
  1540 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
  1512 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
  1541 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
  1513 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
  1542 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
  1514 We reason similarly for @{text STAR}.\smallskip
  1543 We reason similarly for @{text STAR} and @{text NT}.\smallskip
  1515 
  1544 
  1516 
  1545 
  1517 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
  1546 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
  1518 far from the actual bound we can expect. We can do better than this, but this does not improve
  1547 far from the actual bound we can expect. We can do better than this, but this does not improve
  1519 the finiteness property we are proving. If we are interested in a polynomial bound,
  1548 the finiteness property we are proving. If we are interested in a polynomial bound,
  1534 
  1563 
  1535 \noindent
  1564 \noindent
  1536 Essentially it matches the string with the longer @{text "Right"}-alternative in the
  1565 Essentially it matches the string with the longer @{text "Right"}-alternative in the
  1537 first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence). 
  1566 first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence). 
  1538 If we add the simplification above, then we obtain the following value
  1567 If we add the simplification above, then we obtain the following value
  1539 %
       
  1540 \[
       
  1541 @{term "Seq (Left (Char a)) (Left (Char b))"}
  1568 @{term "Seq (Left (Char a)) (Left (Char b))"}
  1542 \]
       
  1543 
       
  1544 \noindent
       
  1545 where the @{text "Left"}-alternatives get priority. However, this violates
  1569 where the @{text "Left"}-alternatives get priority. However, this violates
  1546 the POSIX rules and we have not been able to
  1570 the POSIX rules and we have not been able to
  1547 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm]
  1571 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm]
  1548 
  1572 
  1549 Note that while Antimirov was able to give a bound on the \emph{size}
  1573 Note also that while Antimirov was able to give a bound on the \emph{size}
  1550 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound
  1574 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound
  1551 on the \emph{number} of derivatives, provided they are quotient via
  1575 on the \emph{number} of derivatives, provided they are quotient via
  1552 ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one
  1576 ACI rules \cite{Brzozowski1964}. Brzozowski's result is crucial when one
  1553 uses derivatives for obtaining an automaton (it essentially bounds
  1577 uses his derivatives for obtaining a DFA (it essentially bounds
  1554 the number of states). However, this result does \emph{not}
  1578 the number of states). However, this result does \emph{not}
  1555 transfer to our setting where we are interested in the \emph{size} of the
  1579 transfer to our setting where we are interested in the \emph{size} of the
  1556 derivatives. For example it is not true for our derivatives that the
  1580 derivatives. For example it is \emph{not} true for our derivatives that the
  1557 set of of derivatives $r \backslash s$ for a given $r$ and all strings
  1581 set of derivatives $r \backslash s$ for a given $r$ and all strings
  1558 $s$ is finite (even when simplified). This is because for our annotated regular expressions
  1582 $s$ is finite (even when simplified). This is because for our annotated regular expressions
  1559 the bitcode annotation is dependent on the number of iterations that
  1583 the bitcode annotation is dependent on the number of iterations that
  1560 are needed for STAR-regular expressions. This is not a problem for us: Since we intend to do lexing
  1584 are needed for @{text STAR}-regular expressions. This is not a problem for us: Since we intend to do lexing
  1561 by calculating (as fast as possible) derivatives, the bound on the size
  1585 by calculating (as fast as possible) derivatives, the bound on the size
  1562 of the derivatives is important, not the number of derivatives.
  1586 of the derivatives is important, not their number. % of derivatives.
  1563 
  1587 
  1564 
  1588 
  1565 *}
  1589 *}
  1566 
  1590 
  1567 
  1591 
  1572    We set out in this work to prove in Isabelle/HOL the correctness of
  1596    We set out in this work to prove in Isabelle/HOL the correctness of
  1573    the second POSIX lexing algorithm by Sulzmann and Lu
  1597    the second POSIX lexing algorithm by Sulzmann and Lu
  1574    \cite{Sulzmann2014}. This follows earlier work where we established
  1598    \cite{Sulzmann2014}. This follows earlier work where we established
  1575    the correctness of the first algorithm
  1599    the correctness of the first algorithm
  1576    \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
  1600    \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
  1577    introduce our own specification about what POSIX values are,
  1601    introduce our own specification for POSIX values,
  1578    because the informal definition given by Sulzmann and Lu did not
  1602    because the informal definition given by Sulzmann and Lu did not
  1579    stand up to formal proof. Also for the second algorithm we needed
  1603    stand up to formal proof. Also for the second algorithm we needed
  1580    to introduce our own definitions and proof ideas in order to establish the
  1604    to introduce our own definitions and proof ideas in order to
  1581    correctness.  Our interest in the second algorithm 
  1605    establish the correctness.  Our interest in the second algorithm
  1582    lies in the fact that by using bitcoded regular expressions and an aggressive
  1606    lies in the fact that by using bitcoded regular expressions and an
  1583    simplification method there is a chance that the derivatives
  1607    aggressive simplification method there is a chance that the
  1584    can be kept universally small  (we established in this paper that
  1608    derivatives can be kept universally small (we established in this
  1585    for a given $r$ they can be kept finitely bounded for all strings).
  1609    paper that for a given $r$ they can be kept finitely bounded for
       
  1610    \emph{all} strings).  Our formalisation is approximately 7500 lines of
       
  1611    Isabelle code. A little more than half of this code concerns the finiteness bound
       
  1612    obtained in Section 5. This slight ``bloat'' in the latter part is because
       
  1613    we had to introduce an intermediate datatype for annotated regular expressions and repeat many
       
  1614    definitions for this intermediate datatype. But overall we think this made our
       
  1615    formalisation work smoother. The code of our formalisation can be found at
       
  1616    \textcolor{darkblue}{\url{https://github.com/urbanchr/posix}}.
  1586    %This is important if one is after
  1617    %This is important if one is after
  1587    %an efficient POSIX lexing algorithm based on derivatives.
  1618    %an efficient POSIX lexing algorithm based on derivatives.
  1588 
  1619 
  1589    Having proved the correctness of the POSIX lexing algorithm, which
  1620    Having proved the correctness of the POSIX lexing algorithm, which
  1590    lessons have we learned? Well, we feel this is a very good example
  1621    lessons have we learned? Well, we feel this is a very good example
  1595    obscure, examples.
  1626    obscure, examples.
  1596    %We found that from an implementation
  1627    %We found that from an implementation
  1597    %point-of-view it is really important to have the formal proofs of
  1628    %point-of-view it is really important to have the formal proofs of
  1598    %the corresponding properties at hand.
  1629    %the corresponding properties at hand.
  1599 
  1630 
  1600    With the results reported here, we can of course only make a claim about the correctness
  1631    With the results reported here, we can of course only make a claim
  1601    of the algorithm and the sizes of the
  1632    about the correctness of the algorithm and the sizes of the
  1602    derivatives, not about the efficiency or runtime of our version of
  1633    derivatives, not about the efficiency or runtime of our version of
  1603    Sulzman and Lu's algorithm. But we found the size is an important
  1634    Sulzmann and Lu's algorithm. But we found the size is an important
  1604    first indicator about efficiency: clearly if the derivatives can
  1635    first indicator about efficiency: clearly if the derivatives can
  1605    grow to arbitrarily big sizes and the algorithm needs to traverse
  1636    grow to arbitrarily big sizes and the algorithm needs to traverse
  1606    the derivatives possibly several times, then the algorithm will be
  1637    the derivatives possibly several times, then the algorithm will be
  1607    slow---excruciatingly slow that is. Other works seems to make
  1638    slow---excruciatingly slow that is. Other works seem to make
  1608    stronger claims, but during our work we have developed a healthy
  1639    stronger claims, but during our formalisation work we have
  1609    suspicion when for example experimental data is used to back up
  1640    developed a healthy suspicion when for example experimental data is
  1610    efficiency claims. For instance Sulzmann and Lu write about their
  1641    used to back up efficiency claims. For instance Sulzmann and Lu
  1611    equivalent of @{term blexer_simp} \textit{``...we can incrementally
  1642    write about their equivalent of @{term blexer_simp} \textit{``...we
  1612    compute bitcoded parse trees in linear time in the size of the
  1643    can incrementally compute bitcoded parse trees in linear time in
  1613    input''} \cite[Page 14]{Sulzmann2014}.  Given the growth of the
  1644    the size of the input''} \cite[Page 14]{Sulzmann2014}.  Given the
  1614    derivatives in some cases even after aggressive simplification,
  1645    growth of the derivatives in some cases even after aggressive
  1615    this is a hard to believe claim. A similar claim about a
  1646    simplification, this is a hard to believe claim. A similar claim
  1616    theoretical runtime of @{text "O(n\<^sup>2)"} is made for the
  1647    about a theoretical runtime of @{text "O(n\<^sup>2)"} for one
  1617    Verbatim lexer, which calculates tokens according to POSIX
  1648    specific list of regular expressions is made for the Verbatim
       
  1649    lexer, which calculates tokens according to POSIX
  1618    rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's
  1650    rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's
  1619    derivatives like in our work.  The authors write: \textit{``The
  1651    derivatives like in our work.  About their empirical data, the authors write:
  1620    results of our empirical tests [..] confirm that Verbatim has
  1652    \textit{``The results of our empirical tests [..] confirm that
  1621    @{text "O(n\<^sup>2)"} time complexity.''}
  1653    Verbatim has @{text "O(n\<^sup>2)"} time complexity.''}
  1622    \cite[Section~VII]{verbatim}.  While their correctness proof for
  1654    \cite[Section~VII]{verbatim}.  While their correctness proof for
  1623    Verbatim is formalised in Coq, the claim about the runtime
  1655    Verbatim is formalised in Coq, the claim about the runtime
  1624    complexity is only supported by some emperical evidence obtained by
  1656    complexity is only supported by some empirical evidence obtained by
  1625    using the code extraction facilities of Coq.  Given our observation
  1657    using the code extraction facilities of Coq.  Given our observation
  1626    with the ``growth problem'' of derivatives, we tried out their
  1658    with the ``growth problem'' of derivatives, this runtime bound
  1627    extracted OCaml code with the example \mbox{@{text "(a +
  1659    is unlikely to hold universally: indeed we tried out their extracted OCaml
  1628    aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5
  1660    code with the example \mbox{@{text "(a + aa)\<^sup>*"}} as a single
  1629    minutes to tokenise a string of 40 $a$'s and that increased to
  1661    lexing rule, and it took for us around 5 minutes to tokenise a
  1630    approximately 19 minutes when the string is 50 $a$'s long. Taking
  1662    string of 40 $a$'s and that increased to approximately 19 minutes
  1631    into account that derivatives are not simplified in the Verbatim
  1663    when the string is 50 $a$'s long. Taking into account that
  1632    lexer, such numbers are not surprising.  Clearly our result of
  1664    derivatives are not simplified in the Verbatim lexer, such numbers
  1633    having finite derivatives might sound rather weak in this context
  1665    are not surprising.  Clearly our result of having finite
  1634    but we think such effeciency claims really require further
  1666    derivatives might sound rather weak in this context but we think
  1635    scrutiny.
  1667    such efficiency claims really require further scrutiny.  The
  1636 
  1668    contribution of this paper is to make sure derivatives do not grow
  1637    The contribution of this paper is to make sure derivatives do not
  1669    arbitrarily big (universally). In the example \mbox{@{text "(a +
  1638    grow arbitrarily big (universially) In the example \mbox{@{text "(a
  1670    aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or
  1639    + aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or
       
  1640    less. The result is that lexing a string of, say, 50\,000 a's with
  1671    less. The result is that lexing a string of, say, 50\,000 a's with
  1641    the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes
  1672    the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes
  1642    approximately 10 seconds with our Scala implementation of the
  1673    approximately 10 seconds with our Scala implementation of the
  1643    presented algorithm.
  1674    presented algorithm.
  1644 
  1675 
  1661    derivatives is never greater than 5 in this example. Even in the
  1692    derivatives is never greater than 5 in this example. Even in the
  1662    example from Section 2, where Rust raises an error message, namely
  1693    example from Section 2, where Rust raises an error message, namely
  1663    \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for
  1694    \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for
  1664    our derivatives is a moderate 14.
  1695    our derivatives is a moderate 14.
  1665 
  1696 
       
  1697    Let us also return to the example @{text
       
  1698    "a\<^bsup>{0}{4294967295}\<^esup>"} which until recently Rust
       
  1699    deemed acceptable. But this was due to a bug. It turns out that it took Rust
       
  1700    more than 11 minutes to generate an automaton for this regular
       
  1701    expression and then to determine that a string of just one(!)
       
  1702    @{text a} does \emph{not} match this regular expression.  Therefore
       
  1703    it is probably a wise choice that in newer versions of Rust's
       
  1704    regular expression library such regular expressions are declared as
       
  1705    ``too big'' and raise an error message. While this is clearly
       
  1706    a contrived example, the safety guaranties Rust wants to provide necessitate
       
  1707    this conservative approach.
       
  1708    However, with the derivatives and simplifications we have shown
       
  1709    in this paper, the example can be solved with ease:
       
  1710    it essentially only involves operations on
       
  1711    integers and our Scala implementation takes only a few seconds to
       
  1712    find out that this string, or even much larger strings, do not match.
       
  1713 
  1666    Let us also compare our work to the verified Verbatim++ lexer where
  1714    Let us also compare our work to the verified Verbatim++ lexer where
  1667    the authors of the Verbatim lexer introduced a number of
  1715    the authors of the Verbatim lexer introduced a number of
  1668    improvements and optimisations, for example memoisation
  1716    improvements and optimisations, for example memoisation
  1669    \cite{verbatimpp}. However, unlike Verbatim, which works with
  1717    \cite{verbatimpp}. However, unlike Verbatim, which works with
  1670    derivatives like in our work, Verbatim++ compiles first a regular
  1718    derivatives like in our work, Verbatim++ compiles first a regular
  1679    translation using the traditional notion of DFAs so that we can improve on this. Therefore we
  1727    translation using the traditional notion of DFAs so that we can improve on this. Therefore we
  1680    prefer to stick with calculating derivatives, but attempt to make
  1728    prefer to stick with calculating derivatives, but attempt to make
  1681    this calculation (in the future) as fast as possible. What we can guaranty
  1729    this calculation (in the future) as fast as possible. What we can guaranty
  1682    with the presented work is that the maximum size of the derivatives
  1730    with the presented work is that the maximum size of the derivatives
  1683    for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala
  1731    for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala
  1684    implementation only needs a few seconds for this example and matching 50\,000 a's.
  1732    implementation again only needs a few seconds for this example and matching 50\,000 a's, say.
  1685    %
  1733    %
  1686    %
  1734    %
  1687    %Possible ideas are
  1735    %Possible ideas are
  1688    %zippers which have been used in the context of parsing of
  1736    %zippers which have been used in the context of parsing of
  1689    %context-free grammars \cite{zipperparser}.
  1737    %context-free grammars \cite{zipperparser}.
  1690    \medskip
  1738    %\\[-5mm]  %\smallskip
  1691 
  1739    
  1692    \noindent
  1740    %\noindent
  1693    Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
  1741    %Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
  1694    %\\[-10mm]
  1742    %%\\[-10mm]
  1695 
  1743 
  1696 
  1744 
  1697 %%\bibliographystyle{plain}
  1745 %%\bibliographystyle{plain}
  1698 \bibliography{root}
  1746 \bibliography{root}
  1699 *}
  1747 *}