thys3/Paper.thy
changeset 642 6c13f76c070b
parent 617 3ea6a38c33b5
child 643 9580bae0500d
equal deleted inserted replaced
641:cf7a5c863831 642:6c13f76c070b
    95 
    95 
    96 In the last fifteen or so years, Brzozowski's derivatives of regular
    96 In the last fifteen or so years, Brzozowski's derivatives of regular
    97 expressions have sparked quite a bit of interest in the functional
    97 expressions have sparked quite a bit of interest in the functional
    98 programming and theorem prover communities.
    98 programming and theorem prover communities.
    99 Derivatives of a
    99 Derivatives of a
   100 regular expression, written @{term "der c r"}, give a simple solution
   100 regular expressions, written @{term "der c r"}, give a simple solution
   101 to the problem of matching a string @{term s} with a regular
   101 to the problem of matching a string @{term s} with a regular
   102 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
   102 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in
   103 succession) all the characters of the string matches the empty string,
   103 succession) all the characters of the string matches the empty string,
   104 then @{term r} matches @{term s} (and {\em vice versa}).  
   104 then @{term r} matches @{term s} (and {\em vice versa}).  
   105 The beauty of
   105 The beauty of
   278   adhoc limits for bounded regular expressions: For example in the
   278   adhoc limits for bounded regular expressions: For example in the
   279   regular expression matching library in the Go language and also in Google's RE2 library the regular expression
   279   regular expression matching library in the Go language and also in Google's RE2 library the regular expression
   280   @{term "NTIMES a 1001"} is not permitted, because no counter can be
   280   @{term "NTIMES a 1001"} is not permitted, because no counter can be
   281   above 1000; and in the built-in regular expression library in Rust
   281   above 1000; and in the built-in regular expression library in Rust
   282   expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error
   282   expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error
   283   message for being too big.  These problems can of course be solved in matching
   283   message for being too big. Up until recently,\footnote{up until version 1.5.4 of the regex
       
   284   library in Rust; see also CVE-2022-24713.} Rust
       
   285   however happily generated an automaton for the regular expression
       
   286   @{text "a\<^bsup>{0}{4294967295}\<^esup>"}. This was due to a bug
       
   287   in the algorithm that decides when a regular expression is acceptable
       
   288   or too big according to Rust's classification. We shall come back to
       
   289   this example later in the paper. 
       
   290   These problems can of course be solved in matching
   284   algorithms where automata go beyond the classic notion and for
   291   algorithms where automata go beyond the classic notion and for
   285   instance include explicit counters (see~\cite{CountingSet2020}).
   292   instance include explicit counters (see for example~\cite{CountingSet2020}).
   286   The point here is that Brzozowski derivatives and the algorithms by
   293   The point here is that Brzozowski derivatives and the algorithms by
   287   Sulzmann and Lu can be straightforwardly extended to deal with
   294   Sulzmann and Lu can be straightforwardly extended to deal with
   288   bounded regular expressions and moreover the resulting code
   295   bounded regular expressions and moreover the resulting code
   289   still consists of only simple recursive functions and inductive
   296   still consists of only simple recursive functions and inductive
   290   datatypes. Finally, bounded regular expressions 
   297   datatypes. Finally, bounded regular expressions 
   291   do not destroy our finite boundedness property, which we shall
   298   do not destroy our finite boundedness property, which we shall
   292   prove later on.%, because during the lexing process counters will only be
   299   prove later on.
       
   300 
       
   301   %, because during the lexing process counters will only be
   293   %decremented.
   302   %decremented.
   294   
   303   
   295 
   304 
   296   Central to Brzozowski's regular expression matcher are two functions
   305   Central to Brzozowski's regular expression matcher are two functions
   297   called @{text nullable} and \emph{derivative}. The latter is written
   306   called @{text nullable} and \emph{derivative}. The latter is written
   391   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
   400   @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
   392   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad
   401   @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad
   393   @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad
   402   @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad
   394   $\mprset{flushleft}\inferrule{
   403   $\mprset{flushleft}\inferrule{
   395   @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
   404   @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
   396   @{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
   405   @{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\quad
   397   @{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}
   406   @{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}
   398   }
   407   }
   399   {@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}}
   408   {@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}}
   400   $
   409   $
   401   \end{tabular}
   410   \end{tabular}
   435   Fig~\ref{POSIXrules}).
   444   Fig~\ref{POSIXrules}).
   436 
   445 
   437 \begin{figure}[t]
   446 \begin{figure}[t]
   438   \begin{center}\small%
   447   \begin{center}\small%
   439   \begin{tabular}{@ {\hspace{-2mm}}c@ {}}
   448   \begin{tabular}{@ {\hspace{-2mm}}c@ {}}
   440   \\[-4.5mm]
   449   \\[-8.5mm]
   441   @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
   450   @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
   442   @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
   451   @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
   443   @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
   452   @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
   444   @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
   453   @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
   445   $\mprset{flushleft}
   454   $\mprset{flushleft}
   446    \inferrule
   455    \inferrule
   447    {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
   456    {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
   448     @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\
   457     @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad
   449     @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
   458     @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}
   450    {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\smallskip\\
   459    {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\\
   451   @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad
   460   @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad
   452   $\mprset{flushleft}
   461   $\mprset{flushleft}
   453    \inferrule
   462    \inferrule
   454    {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
   463    {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
   455     @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
   464     @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad
   456     @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
   465     @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\
   457     @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
   466     @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}
   458    {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\smallskip\\
   467    {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\\
   459   \mprset{sep=4mm} 
   468   \mprset{sep=4mm} 
   460   @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad 
   469   @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad\; 
   461   $\mprset{flushleft}
   470   $\mprset{flushleft}
   462    \inferrule
   471    \inferrule
   463    {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
   472    {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
   464     @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
   473     @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad
   465     @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\
   474     @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\
   485   \begin{tabular}{@ {}lcl@ {}}
   494   \begin{tabular}{@ {}lcl@ {}}
   486   @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
   495   @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
   487   @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
   496   @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
   488   @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
   497   @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
   489   @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
   498   @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
   490   @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
   499   @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}
   491   \end{tabular}
   500   \end{tabular}
   492   \medskip\\
   501   %\end{tabular}
   493 
   502   %\end{center}
   494   %!\begin{tabular}{@ {}cc@ {}}
   503   \smallskip\\
       
   504 
       
   505   %\begin{center}
       
   506   %\begin{tabular}{@ {}cc@ {}}
   495   \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
   507   \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
   496   @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
   508   @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)[of "c"]}\\
   497   @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
   509   @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & 
   498       @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
   510       @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
   499   @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
   511   @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & 
   500       @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
   512       @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
   501   %!
   513   %!
   523   The function @{text mkeps} is run when the last derivative is nullable, that is
   535   The function @{text mkeps} is run when the last derivative is nullable, that is
   524   the string to be matched is in the language of the regular expression. It generates
   536   the string to be matched is in the language of the regular expression. It generates
   525   a value for how the last derivative can match the empty string. In case
   537   a value for how the last derivative can match the empty string. In case
   526   of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate
   538   of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate
   527   a list of exactly @{term n} copies, which is the length of the list we expect in this
   539   a list of exactly @{term n} copies, which is the length of the list we expect in this
   528   case.  The injection function
   540   case.  The injection function\footnote{While the character argument @{text c} is not
       
   541   strictly necessary in the @{text inj}-function for the fragment of regular expressions we
       
   542   use in this paper, it is necessary for extended regular expressions, for example for the range regular expression of the form @{text "[a-z]"}.
       
   543   We therefore keep this argument from the original formulation of @{text inj} by Sulzmann and Lu.}
   529   then calculates the corresponding value for each intermediate derivative until
   544   then calculates the corresponding value for each intermediate derivative until
   530   a value for the original regular expression is generated.
   545   a value for the original regular expression is generated.
   531   Graphically the algorithm by
   546   Graphically the algorithm by
   532   Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz}
   547   Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz}
   533   where the path from the left to the right involving @{term derivatives}/@{const
   548   where the path from the left to the right involving @{term derivatives}/@{const
   534   nullable} is the first phase of the algorithm (calculating successive
   549   nullable} is the first phase of the algorithm (calculating successive
   535   \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
   550   \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to
   536   left, the second phase.
   551   left, the second phase.
   537 %
   552 %
   538 \begin{center}
   553 \begin{center}
   539 \begin{tikzpicture}[scale=0.99,node distance=9mm,
   554 \begin{tikzpicture}[scale=0.85,node distance=8mm,
   540                     every node/.style={minimum size=6mm}]
   555                     every node/.style={minimum size=6mm}]
   541 \node (r1)  {@{term "r\<^sub>1"}};
   556 \node (r1)  {@{term "r\<^sub>1"}};
   542 \node (r2) [right=of r1]{@{term "r\<^sub>2"}};
   557 \node (r2) [right=of r1]{@{term "r\<^sub>2"}};
   543 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}};
   558 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {@{term "der a DUMMY"}};
   544 \node (r3) [right=of r2]{@{term "r\<^sub>3"}};
   559 \node (r3) [right=of r2]{@{term "r\<^sub>3"}};
   552 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {\<open>inj r\<^sub>3 c\<close>};
   567 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {\<open>inj r\<^sub>3 c\<close>};
   553 \node (v2) [left=of v3]{@{term "v\<^sub>2"}};
   568 \node (v2) [left=of v3]{@{term "v\<^sub>2"}};
   554 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {\<open>inj r\<^sub>2 b\<close>};
   569 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {\<open>inj r\<^sub>2 b\<close>};
   555 \node (v1) [left=of v2] {@{term "v\<^sub>1"}};
   570 \node (v1) [left=of v2] {@{term "v\<^sub>1"}};
   556 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {\<open>inj r\<^sub>1 a\<close>};
   571 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {\<open>inj r\<^sub>1 a\<close>};
   557 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}};
   572 \draw (r4) node[anchor=north west] {\;\raisebox{-5mm}{@{term "mkeps"}}};
   558 \end{tikzpicture}
   573 \end{tikzpicture}
   559 \end{center}
   574 \end{center}
   560 %
   575 %
   561 \noindent
   576 \noindent
   562   The picture shows the steps required when a
   577   The picture shows the steps required when a
   606 
   621 
   607   \begin{center}
   622   \begin{center}
   608   \begin{tabular}{lcl}
   623   \begin{tabular}{lcl}
   609   @{thm (lhs) lexer.simps(1)} & $\dn$ & @{thm (rhs) lexer.simps(1)}\\
   624   @{thm (lhs) lexer.simps(1)} & $\dn$ & @{thm (rhs) lexer.simps(1)}\\
   610   @{thm (lhs) lexer.simps(2)} & $\dn$ & @{text "case"} @{term "lexer (der c r) s"} @{text of}
   625   @{thm (lhs) lexer.simps(2)} & $\dn$ & @{text "case"} @{term "lexer (der c r) s"} @{text of}
   611                      @{term "None"}  @{text "\<Rightarrow>"} @{term None}\\
   626                      @{term "None"}  @{text "\<Rightarrow>"} @{term None}
   612                      & & \hspace{27mm}$|\;$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}                          
   627                      %%& & \hspace{27mm}
       
   628 		     $|\;$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}                          
   613   \end{tabular}
   629   \end{tabular}
   614   \end{center}
   630   \end{center}
   615 
   631 
   616 
   632 
   617 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
   633 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
   625 
   641 
   626   \noindent
   642   \noindent
   627   With this in place we were able to prove:
   643   With this in place we were able to prove:
   628 
   644 
   629   \begin{proposition}\mbox{}\label{lexercorrect}
   645   \begin{proposition}\mbox{}\label{lexercorrect}
   630   \textrm{(1)} @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}.\\
   646   \textrm{(1)}\; @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}.\\
   631   \mbox{\hspace{23.5mm}}\textrm{(2)}\; @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}.
   647   \mbox{\hspace{29.5mm}}\textrm{(2)}\; @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}.
   632   %
   648   %
   633   % \smallskip\\
   649   % \smallskip\\
   634   %\begin{tabular}{ll}
   650   %\begin{tabular}{ll}
   635   %(1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\
   651   %(1) & @{thm (lhs) lexer_correct_None} \;if and only if\; @{thm (rhs) lexer_correct_None}\\
   636   %(2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
   652   %(2) & @{thm (lhs) lexer_correct_Some} \;if and only if\; @{thm (rhs) lexer_correct_Some}\\
   637   %\end{tabular}
   653   %\end{tabular}
   638   \end{proposition}
   654   \end{proposition}
   639 
   655 
   640   \noindent
   656   \noindent
   641   In fact we have shown that, in the success case, the generated POSIX value $v$ is
   657   In fact we have shown that, in the success case, the generated POSIX value $v$ is
   655   Sulzmann and Lu describe another algorithm that also generates POSIX
   671   Sulzmann and Lu describe another algorithm that also generates POSIX
   656   values but dispenses with the second phase where characters are
   672   values but dispenses with the second phase where characters are
   657   injected ``back'' into values. For this they annotate bitcodes to
   673   injected ``back'' into values. For this they annotate bitcodes to
   658   regular expressions, which we define in Isabelle/HOL as the datatype\medskip
   674   regular expressions, which we define in Isabelle/HOL as the datatype\medskip
   659 
   675 
   660   %!\begin{center}
   676   \begin{center}
   661   \noindent
   677   %\noindent
   662   \begin{minipage}{1.01\textwidth}
   678   %\begin{minipage}{0.9\textwidth}
   663   \,@{term breg} $\,::=\,$ @{term "AZERO"}
   679   \,@{term breg} $\,::=\,$ @{term "AZERO"}
   664                $\,\mid$ @{term "AONE bs"}
   680                $\,\mid$ @{term "AONE bs"}
   665                $\,\mid$ @{term "ACHAR bs c"}
   681                $\,\mid$ @{term "ACHAR bs c"}
   666                $\,\mid$ @{term "AALTs bs rs"}
   682                $\,\mid$ @{term "AALTs bs rs"}
   667                $\,\mid$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}
   683                $\,\mid$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}
   668                $\,\mid$ @{term "ASTAR bs r"}
   684                $\,\mid$ @{term "ASTAR bs r"}
   669 	       $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}\hfill\mbox{}
   685 	       $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}%\hfill\mbox{}
   670   \end{minipage}\medskip	       
   686   %\end{minipage}\medskip	       
   671   %!\end{center}
   687   \end{center}
   672 
   688 
   673   \noindent where @{text bs} stands for bitsequences; @{text r},
   689   \noindent where @{text bs} stands for bitsequences; @{text r},
   674   @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
   690   @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
   675   expressions; and @{text rs} for lists of bitcoded regular
   691   expressions; and @{text rs} for lists of bitcoded regular
   676   expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
   692   expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
   700   \end{tabular}
   716   \end{tabular}
   701   \end{tabular}
   717   \end{tabular}
   702   \end{center}
   718   \end{center}
   703    
   719    
   704   \noindent
   720   \noindent
   705   As can be seen, this coding is ``lossy'' in the sense that we do not
   721   As can be seen, this coding is ``lossy'' in the sense that it does not
   706   record explicitly character values and also not sequence values (for
   722   record explicitly character values and also not sequence values (for
   707   them we just append two bitsequences). However, the
   723   them it just appends two bitsequences). However, the
   708   different alternatives for @{text Left}, respectively @{text Right}, are recorded as @{text Z} and
   724   different alternatives for @{text Left}, respectively @{text Right}, are recorded as @{text Z} and
   709   @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate
   725   @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate
   710   if there is still a value coming in the list of @{text Stars}, whereas @{text S}
   726   if there is still a value coming in the list of @{text Stars}, whereas @{text S}
   711   indicates the end of the list. The lossiness makes the process of
   727   indicates the end of the list. The lossiness makes the process of
   712   decoding a bit more involved, but the point is that if we have a
   728   decoding a bit more involved, but the point is that if we have a
   931   can define Sulzmann and Lu's bitcoded lexer, which we call \textit{blexer}:
   947   can define Sulzmann and Lu's bitcoded lexer, which we call \textit{blexer}:
   932   %
   948   %
   933   \begin{center}
   949   \begin{center}
   934 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
   950 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
   935   $\textit{blexer}\;r\,s$ & $\dn$ &
   951   $\textit{blexer}\;r\,s$ & $\dn$ &
   936       $\textit{let}\;r_{der} = (r^\uparrow)\backslash s\;\textit{in}$\\                
   952       $\textit{let}\;r_{der} = (r^\uparrow)\backslash s\;\textit{in}$\\               
   937   & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
   953   & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$
   938   & & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$\\
   954       $\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$
   939   & & $\;\;\;\textit{else}\;\textit{None}$
   955       $\;\textit{else}\;\textit{None}$
   940 \end{tabular}
   956 \end{tabular}
   941 \end{center}
   957 \end{center}
   942 
   958 
   943   \noindent
   959   \noindent
   944 This bitcoded lexer first internalises the regular expression $r$ and then
   960 This bitcoded lexer first internalises the regular expression $r$ and then
   989 is the same as if we first erase the bitcoded regular expression and
  1005 is the same as if we first erase the bitcoded regular expression and
   990 then perform the ``standard'' derivative operation.
  1006 then perform the ``standard'' derivative operation.
   991 
  1007 
   992 \begin{lemma}\label{bnullable}%\mbox{}\smallskip\\
  1008 \begin{lemma}\label{bnullable}%\mbox{}\smallskip\\
   993 \textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\    
  1009 \textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\    
   994 \mbox{\hspace{17mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
  1010 \mbox{\hspace{21.5mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
   995 \mbox{\hspace{17mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$
  1011 \mbox{\hspace{21.5mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$
   996 %\begin{tabular}{ll}
  1012 %\begin{tabular}{ll}
   997 %\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\    
  1013 %\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\    
   998 %\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
  1014 %\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
   999 %\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$.
  1015 %\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ \;provided\; $\textit{nullable}(r^\downarrow)$.
  1000 %\end{tabular}  
  1016 %\end{tabular}  
  1001 \end{lemma}
  1017 \end{lemma}
  1002 
  1018 
  1003 %\begin{proof}
  1019 %\begin{proof}
  1004 %  All properties are by induction on annotated regular expressions.
  1020 %  All properties are by induction on annotated regular expressions.
  1028 Using this function we can recast the success case in @{text lexer} 
  1044 Using this function we can recast the success case in @{text lexer} 
  1029 as follows:
  1045 as follows:
  1030 
  1046 
  1031 \begin{lemma}\label{flex}
  1047 \begin{lemma}\label{flex}
  1032 If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\,
  1048 If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\,
  1033       (\mkeps (r\backslash s))$.
  1049       (\textit{mkeps}\,(r\backslash s))$.
  1034 \end{lemma}
  1050 \end{lemma}
  1035 
  1051 
  1036 \noindent
  1052 \noindent
  1037 Note we did not redefine \textit{lexer}, we just established that the
  1053 Note we did not redefine \textit{lexer}, we just established that the
  1038 value generated by \textit{lexer} can also be obtained by a different
  1054 value generated by \textit{lexer} can also be obtained by a different
  1109      cannot make any further simplifications. This is a problem because
  1125      cannot make any further simplifications. This is a problem because
  1110      the outermost alternatives contains two copies of the same
  1126      the outermost alternatives contains two copies of the same
  1111      regular expression (underlined with $r$). These copies will
  1127      regular expression (underlined with $r$). These copies will
  1112      spawn new copies in later derivative steps and they in turn even more copies. This
  1128      spawn new copies in later derivative steps and they in turn even more copies. This
  1113      destroys any hope of taming the size of the derivatives.  But the
  1129      destroys any hope of taming the size of the derivatives.  But the
  1114      second copy of $r$ in \eqref{derivex} will never contribute to a
  1130      second copy of $r$ in~\eqref{derivex} will never contribute to a
  1115      value, because POSIX lexing will always prefer matching a string
  1131      value, because POSIX lexing will always prefer matching a string
  1116      with the first copy. So it could be safely removed without affecting the correctness of the algorithm.
  1132      with the first copy. So it could be safely removed without affecting the correctness of the algorithm.
  1117      The issue with the simple-minded
  1133      The issue with the simple-minded
  1118      simplification rules above is that the rule $r + r \Rightarrow r$
  1134      simplification rules above is that the rule $r + r \Rightarrow r$
  1119      will never be applicable because as can be seen in this example the
  1135      will never be applicable because as can be seen in this example the
  1128      expressions is that they can be easily modified such that simplification does not
  1144      expressions is that they can be easily modified such that simplification does not
  1129      interfere with the value constructions. For example we can ``flatten'', or
  1145      interfere with the value constructions. For example we can ``flatten'', or
  1130      de-nest, or spill out, @{text ALTs} as follows
  1146      de-nest, or spill out, @{text ALTs} as follows
  1131      %
  1147      %
  1132      \[
  1148      \[
  1133      @{term "ALTs bs\<^sub>1 (((ALTs bs\<^sub>2 rs\<^sub>2)) # rs\<^sub>1)"}
  1149      @{text "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) :: rs\<^sub>1)"}
  1134      \quad\xrightarrow{bsimp}\quad
  1150      \quad\xrightarrow{bsimp}\quad
  1135      @{text "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
  1151      @{text "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
  1136      \]
  1152      \]
  1137 
  1153 
  1138      \noindent
  1154      \noindent
  1169 
  1185 
  1170      \begin{center}
  1186      \begin{center}
  1171      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
  1187      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
  1172      @{thm (lhs) distinctWith.simps(1)} & $\dn$ & @{thm (rhs) distinctWith.simps(1)}\\
  1188      @{thm (lhs) distinctWith.simps(1)} & $\dn$ & @{thm (rhs) distinctWith.simps(1)}\\
  1173      @{thm (lhs) distinctWith.simps(2)} & $\dn$ &
  1189      @{thm (lhs) distinctWith.simps(2)} & $\dn$ &
  1174      @{text "if (\<exists> y \<in> acc. eq x y)"} \\
  1190      @{text "if (\<exists> y \<in> acc. eq x y)"}
  1175      & & @{text "then distinctWith xs eq acc"}\\
  1191      @{text "then distinctWith xs eq acc"}\\
  1176      & & @{text "else x :: distinctWith xs eq ({x} \<union> acc)"}
  1192      & & @{text "else x :: distinctWith xs eq ({x} \<union> acc)"}
  1177      \end{tabular}
  1193      \end{tabular}
  1178      \end{center}
  1194      \end{center}
  1179 
  1195 
  1180      \noindent where we scan the list from left to right (because we
  1196      \noindent where we scan the list from left to right (because we
  1188      @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions
  1204      @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions
  1189      before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from
  1205      before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from
  1190      bitcoded regular expressions to @{text bool}:
  1206      bitcoded regular expressions to @{text bool}:
  1191      %
  1207      %
  1192      \begin{center}
  1208      \begin{center}
  1193      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
  1209      \begin{tabular}{@ {}cc@ {}}
       
  1210      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {\hspace{1mm}}}
  1194      @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\
  1211      @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\
  1195      @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\
  1212      @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\
       
  1213      @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\
       
  1214       @{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\
       
  1215      \mbox{}
       
  1216      \end{tabular}
       
  1217      &
       
  1218      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
  1196      @{thm (lhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]} & $\dn$ & @{thm (rhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]}\\
  1219      @{thm (lhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]} & $\dn$ & @{thm (rhs) eq1.simps(3)[of DUMMY "c" DUMMY "d"]}\\
  1197      @{thm (lhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]} & $\dn$ &
  1220      @{thm (lhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]} & $\dn$ &
  1198      @{thm (rhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]}\\
  1221      @{thm (rhs) eq1.simps(4)[of DUMMY "r\<^sub>1\<^sub>1" "r\<^sub>1\<^sub>2" DUMMY "r\<^sub>2\<^sub>1" "r\<^sub>2\<^sub>2"]}\\
  1199      @{thm (lhs) eq1.simps(5)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(5)[of DUMMY DUMMY]}\\
       
  1200      @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ &
       
  1201      @{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}\\
       
  1202      @{thm (lhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(7)[of DUMMY "r\<^sub>1" DUMMY "r\<^sub>2"]}\\
       
  1203      @{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\
  1222      @{thm (lhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]} & $\dn$ & @{thm (rhs) eq1.simps(8)[of DUMMY "r\<^sub>1" "n\<^sub>1" DUMMY "r\<^sub>2" "n\<^sub>2"]}\\
       
  1223      @{thm (lhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]} & $\dn$ & \\
       
  1224      \multicolumn{3}{r}{@{thm (rhs) eq1.simps(6)[of DUMMY "r\<^sub>1" "rs\<^sub>1" DUMMY "r\<^sub>2" "rs\<^sub>2"]}}
       
  1225      \end{tabular}
  1204      \end{tabular}
  1226      \end{tabular}
  1205      \end{center}
  1227      \end{center}
  1206 
  1228 
  1207      \noindent
  1229      \noindent
  1208      where all other cases are set to @{text False}.
  1230      where all other cases are set to @{text False}.
  1210      but is needed in order to make the removal of unnecessary copies
  1232      but is needed in order to make the removal of unnecessary copies
  1211      to work properly.
  1233      to work properly.
  1212 
  1234 
  1213      Our simplification function depends on three more helper functions, one is called
  1235      Our simplification function depends on three more helper functions, one is called
  1214      @{text flts} and analyses lists of regular expressions coming from alternatives.
  1236      @{text flts} and analyses lists of regular expressions coming from alternatives.
  1215      It is defined as follows:
  1237      It is defined by four clauses as follows:
  1216 
  1238 
  1217      \begin{center}
  1239      \begin{center}
  1218      \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
  1240      \begin{tabular}{c}
  1219      \multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad
  1241      @{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad
  1220      @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\smallskip\\
  1242      @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)} \qquad
  1221      @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
  1243      @{text "flts ((ALTs bs' rs') :: rs"}
  1222      \multicolumn{3}{@ {}c}{@{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"}
  1244      %@{ thm (lhs) flts.simps(3)[of "bs'" "rs'"]}
  1223      \hspace{5mm}(otherwise)}
  1245      $\dn$ @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\smallskip\\
       
  1246      @{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"}\hspace{5mm}(otherwise)
  1224      \end{tabular}
  1247      \end{tabular}
  1225      \end{center}
  1248      \end{center}
  1226 
  1249 
  1227      \noindent
  1250      \noindent
  1228      The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and
  1251      The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and
  1229      the third ``de-nests'' alternatives (but retaining the
  1252      the third ``de-nests'' alternatives (but retains the
  1230      bitsequence @{text "bs'"} accumulated in the inner alternative). There are
  1253      bitsequence @{text "bs'"} accumulated in the inner alternative). There are
  1231      some corner cases to be considered when the resulting list inside an alternative is
  1254      some corner cases to be considered when the resulting list inside an alternative is
  1232      empty or a singleton list. We take care of those cases in the
  1255      empty or a singleton list. We take care of those cases in the
  1233      @{text "bsimpALTs"} function; similarly we define a helper function that simplifies
  1256      @{text "bsimpALTs"} function; similarly we define a helper function that simplifies
  1234      sequences according to the usual rules about @{text ZERO}s and @{text ONE}s:
  1257      sequences according to the usual rules about @{text ZERO}s and @{text ONE}s:
  1264      \end{tabular}
  1287      \end{tabular}
  1265      \end{center}
  1288      \end{center}
  1266 
  1289 
  1267      \noindent
  1290      \noindent
  1268      We believe our recursive function @{term bsimp} simplifies bitcoded regular
  1291      We believe our recursive function @{term bsimp} simplifies bitcoded regular
  1269      expressions as intended by Sulzmann and Lu. There is no point in applying the
  1292      expressions as intended by Sulzmann and Lu with the small addition of removing ``useless'' @{text ONE}s
       
  1293      in sequence regular
       
  1294      expressions. There is no point in applying the
  1270      @{text bsimp} function repeatedly (like the simplification in their paper which needs to be
  1295      @{text bsimp} function repeatedly (like the simplification in their paper which needs to be
  1271      applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent,
  1296      applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent,
  1272      that is
  1297      that is
  1273 
  1298 
  1274      \begin{proposition}
  1299      \begin{proposition}
  1302 
  1327 
  1303      \begin{center}
  1328      \begin{center}
  1304 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
  1329 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
  1305   $\textit{blexer}^+\;r\,s$ & $\dn$ &
  1330   $\textit{blexer}^+\;r\,s$ & $\dn$ &
  1306       $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\                
  1331       $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\                
  1307       & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
  1332       & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$
  1308       & & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\;\textit{else}\;\textit{None}$
  1333       $\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\textit{else}\;\textit{None}$
  1309 \end{tabular}
  1334 \end{tabular}
  1310 \end{center}
  1335 \end{center}
  1311 
  1336 
  1312        \noindent
  1337        \noindent
  1313        Note that in $\textit{blexer}^+$ the derivative $r_{der}$ is calculated
  1338        Note that in $\textit{blexer}^+$ the derivative $r_{der}$ is calculated
  1341      expression anymore. So unless one can somehow
  1366      expression anymore. So unless one can somehow
  1342      synchronise the change in the simplified regular expressions with
  1367      synchronise the change in the simplified regular expressions with
  1343      the original POSIX value, there is no hope of appealing to @{text retrieve} in the
  1368      the original POSIX value, there is no hope of appealing to @{text retrieve} in the
  1344      correctness argument for @{term blexer_simp}.
  1369      correctness argument for @{term blexer_simp}.
  1345 
  1370 
  1346      We found it more helpful to introduce the rewriting systems shown in
  1371      For our proof we found it more helpful to introduce the rewriting systems shown in
  1347      Fig~\ref{SimpRewrites}. The idea is to generate 
  1372      Fig~\ref{SimpRewrites}. The idea is to generate 
  1348      simplified regular expressions in small steps (unlike the @{text bsimp}-function which
  1373      simplified regular expressions in small steps (unlike the @{text bsimp}-function which
  1349      does the same in a big step), and show that each of
  1374      does the same in a big step), and show that each of
  1350      the small steps preserves the bitcodes that lead to the POSIX value.
  1375      the small steps preserves the bitcodes that lead to the POSIX value.
  1351      The rewrite system is organised such that $\leadsto$ is for bitcoded regular
  1376      The rewrite system is organised such that $\leadsto$ is for bitcoded regular
  1509 where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
  1534 where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
  1510 In (3) we know that  $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is 
  1535 In (3) we know that  $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is 
  1511 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
  1536 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
  1512 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
  1537 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
  1513 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
  1538 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
  1514 We reason similarly for @{text STAR}.\smallskip
  1539 We reason similarly for @{text STAR} and @{text NT}.\smallskip
  1515 
  1540 
  1516 
  1541 
  1517 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
  1542 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
  1518 far from the actual bound we can expect. We can do better than this, but this does not improve
  1543 far from the actual bound we can expect. We can do better than this, but this does not improve
  1519 the finiteness property we are proving. If we are interested in a polynomial bound,
  1544 the finiteness property we are proving. If we are interested in a polynomial bound,
  1534 
  1559 
  1535 \noindent
  1560 \noindent
  1536 Essentially it matches the string with the longer @{text "Right"}-alternative in the
  1561 Essentially it matches the string with the longer @{text "Right"}-alternative in the
  1537 first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence). 
  1562 first sequence (and then the `rest' with the empty regular expression @{const ONE} from the second sequence). 
  1538 If we add the simplification above, then we obtain the following value
  1563 If we add the simplification above, then we obtain the following value
  1539 %
       
  1540 \[
       
  1541 @{term "Seq (Left (Char a)) (Left (Char b))"}
  1564 @{term "Seq (Left (Char a)) (Left (Char b))"}
  1542 \]
       
  1543 
       
  1544 \noindent
       
  1545 where the @{text "Left"}-alternatives get priority. However, this violates
  1565 where the @{text "Left"}-alternatives get priority. However, this violates
  1546 the POSIX rules and we have not been able to
  1566 the POSIX rules and we have not been able to
  1547 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm]
  1567 reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm]
  1548 
  1568 
  1549 Note that while Antimirov was able to give a bound on the \emph{size}
  1569 Note that while Antimirov was able to give a bound on the \emph{size}
  1550 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound
  1570 of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound
  1551 on the \emph{number} of derivatives, provided they are quotient via
  1571 on the \emph{number} of derivatives, provided they are quotient via
  1552 ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one
  1572 ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one
  1553 uses derivatives for obtaining an automaton (it essentially bounds
  1573 uses his derivatives for obtaining a DFA (it essentially bounds
  1554 the number of states). However, this result does \emph{not}
  1574 the number of states). However, this result does \emph{not}
  1555 transfer to our setting where we are interested in the \emph{size} of the
  1575 transfer to our setting where we are interested in the \emph{size} of the
  1556 derivatives. For example it is not true for our derivatives that the
  1576 derivatives. For example it is \emph{not} true for our derivatives that the
  1557 set of of derivatives $r \backslash s$ for a given $r$ and all strings
  1577 set of derivatives $r \backslash s$ for a given $r$ and all strings
  1558 $s$ is finite (even when simplified). This is because for our annotated regular expressions
  1578 $s$ is finite (even when simplified). This is because for our annotated regular expressions
  1559 the bitcode annotation is dependent on the number of iterations that
  1579 the bitcode annotation is dependent on the number of iterations that
  1560 are needed for STAR-regular expressions. This is not a problem for us: Since we intend to do lexing
  1580 are needed for @{text STAR}-regular expressions. This is not a problem for us: Since we intend to do lexing
  1561 by calculating (as fast as possible) derivatives, the bound on the size
  1581 by calculating (as fast as possible) derivatives, the bound on the size
  1562 of the derivatives is important, not the number of derivatives.
  1582 of the derivatives is important, not their number. % of derivatives.
  1563 
  1583 
  1564 
  1584 
  1565 *}
  1585 *}
  1566 
  1586 
  1567 
  1587 
  1575    the correctness of the first algorithm
  1595    the correctness of the first algorithm
  1576    \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
  1596    \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
  1577    introduce our own specification about what POSIX values are,
  1597    introduce our own specification about what POSIX values are,
  1578    because the informal definition given by Sulzmann and Lu did not
  1598    because the informal definition given by Sulzmann and Lu did not
  1579    stand up to formal proof. Also for the second algorithm we needed
  1599    stand up to formal proof. Also for the second algorithm we needed
  1580    to introduce our own definitions and proof ideas in order to establish the
  1600    to introduce our own definitions and proof ideas in order to
  1581    correctness.  Our interest in the second algorithm 
  1601    establish the correctness.  Our interest in the second algorithm
  1582    lies in the fact that by using bitcoded regular expressions and an aggressive
  1602    lies in the fact that by using bitcoded regular expressions and an
  1583    simplification method there is a chance that the derivatives
  1603    aggressive simplification method there is a chance that the
  1584    can be kept universally small  (we established in this paper that
  1604    derivatives can be kept universally small (we established in this
  1585    for a given $r$ they can be kept finitely bounded for all strings).
  1605    paper that for a given $r$ they can be kept finitely bounded for
       
  1606    all strings).  Our formalisation is approximately 7500 lines of
       
  1607    Isabelle code. A little more than half of this code concerns the finiteness bound
       
  1608    obtained in Section 5. This slight ``bloat'' in the latter part is because
       
  1609    we had to introduce an intermediate datatype for annotated regular expressions and repeat many
       
  1610    definitions for this intermediate datatype. But overall this made our
       
  1611    formalisation work smoother.
  1586    %This is important if one is after
  1612    %This is important if one is after
  1587    %an efficient POSIX lexing algorithm based on derivatives.
  1613    %an efficient POSIX lexing algorithm based on derivatives.
  1588 
  1614 
  1589    Having proved the correctness of the POSIX lexing algorithm, which
  1615    Having proved the correctness of the POSIX lexing algorithm, which
  1590    lessons have we learned? Well, we feel this is a very good example
  1616    lessons have we learned? Well, we feel this is a very good example
  1595    obscure, examples.
  1621    obscure, examples.
  1596    %We found that from an implementation
  1622    %We found that from an implementation
  1597    %point-of-view it is really important to have the formal proofs of
  1623    %point-of-view it is really important to have the formal proofs of
  1598    %the corresponding properties at hand.
  1624    %the corresponding properties at hand.
  1599 
  1625 
  1600    With the results reported here, we can of course only make a claim about the correctness
  1626    With the results reported here, we can of course only make a claim
  1601    of the algorithm and the sizes of the
  1627    about the correctness of the algorithm and the sizes of the
  1602    derivatives, not about the efficiency or runtime of our version of
  1628    derivatives, not about the efficiency or runtime of our version of
  1603    Sulzman and Lu's algorithm. But we found the size is an important
  1629    Sulzmann and Lu's algorithm. But we found the size is an important
  1604    first indicator about efficiency: clearly if the derivatives can
  1630    first indicator about efficiency: clearly if the derivatives can
  1605    grow to arbitrarily big sizes and the algorithm needs to traverse
  1631    grow to arbitrarily big sizes and the algorithm needs to traverse
  1606    the derivatives possibly several times, then the algorithm will be
  1632    the derivatives possibly several times, then the algorithm will be
  1607    slow---excruciatingly slow that is. Other works seems to make
  1633    slow---excruciatingly slow that is. Other works seem to make
  1608    stronger claims, but during our work we have developed a healthy
  1634    stronger claims, but during our formalisation work we have
  1609    suspicion when for example experimental data is used to back up
  1635    developed a healthy suspicion when for example experimental data is
  1610    efficiency claims. For instance Sulzmann and Lu write about their
  1636    used to back up efficiency claims. For instance Sulzmann and Lu
  1611    equivalent of @{term blexer_simp} \textit{``...we can incrementally
  1637    write about their equivalent of @{term blexer_simp} \textit{``...we
  1612    compute bitcoded parse trees in linear time in the size of the
  1638    can incrementally compute bitcoded parse trees in linear time in
  1613    input''} \cite[Page 14]{Sulzmann2014}.  Given the growth of the
  1639    the size of the input''} \cite[Page 14]{Sulzmann2014}.  Given the
  1614    derivatives in some cases even after aggressive simplification,
  1640    growth of the derivatives in some cases even after aggressive
  1615    this is a hard to believe claim. A similar claim about a
  1641    simplification, this is a hard to believe claim. A similar claim
  1616    theoretical runtime of @{text "O(n\<^sup>2)"} is made for the
  1642    about a theoretical runtime of @{text "O(n\<^sup>2)"} for one
  1617    Verbatim lexer, which calculates tokens according to POSIX
  1643    specific list of regular expressions is made for the Verbatim
       
  1644    lexer, which calculates tokens according to POSIX
  1618    rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's
  1645    rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's
  1619    derivatives like in our work.  The authors write: \textit{``The
  1646    derivatives like in our work.  About their empirical data, the authors write:
  1620    results of our empirical tests [..] confirm that Verbatim has
  1647    \textit{``The results of our empirical tests [..] confirm that
  1621    @{text "O(n\<^sup>2)"} time complexity.''}
  1648    Verbatim has @{text "O(n\<^sup>2)"} time complexity.''}
  1622    \cite[Section~VII]{verbatim}.  While their correctness proof for
  1649    \cite[Section~VII]{verbatim}.  While their correctness proof for
  1623    Verbatim is formalised in Coq, the claim about the runtime
  1650    Verbatim is formalised in Coq, the claim about the runtime
  1624    complexity is only supported by some emperical evidence obtained by
  1651    complexity is only supported by some empirical evidence obtained by
  1625    using the code extraction facilities of Coq.  Given our observation
  1652    using the code extraction facilities of Coq.  Given our observation
  1626    with the ``growth problem'' of derivatives, we tried out their
  1653    with the ``growth problem'' of derivatives, this runtime bound
  1627    extracted OCaml code with the example \mbox{@{text "(a +
  1654    is unlikely to hold universally: indeed we tried out their extracted OCaml
  1628    aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5
  1655    code with the example \mbox{@{text "(a + aa)\<^sup>*"}} as a single
  1629    minutes to tokenise a string of 40 $a$'s and that increased to
  1656    lexing rule, and it took for us around 5 minutes to tokenise a
  1630    approximately 19 minutes when the string is 50 $a$'s long. Taking
  1657    string of 40 $a$'s and that increased to approximately 19 minutes
  1631    into account that derivatives are not simplified in the Verbatim
  1658    when the string is 50 $a$'s long. Taking into account that
  1632    lexer, such numbers are not surprising.  Clearly our result of
  1659    derivatives are not simplified in the Verbatim lexer, such numbers
  1633    having finite derivatives might sound rather weak in this context
  1660    are not surprising.  Clearly our result of having finite
  1634    but we think such effeciency claims really require further
  1661    derivatives might sound rather weak in this context but we think
  1635    scrutiny.
  1662    such efficiency claims really require further scrutiny.  The
  1636 
  1663    contribution of this paper is to make sure derivatives do not grow
  1637    The contribution of this paper is to make sure derivatives do not
  1664    arbitrarily big (universally). In the example \mbox{@{text "(a +
  1638    grow arbitrarily big (universially) In the example \mbox{@{text "(a
  1665    aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or
  1639    + aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or
       
  1640    less. The result is that lexing a string of, say, 50\,000 a's with
  1666    less. The result is that lexing a string of, say, 50\,000 a's with
  1641    the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes
  1667    the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes
  1642    approximately 10 seconds with our Scala implementation of the
  1668    approximately 10 seconds with our Scala implementation of the
  1643    presented algorithm.
  1669    presented algorithm.
  1644 
  1670 
  1661    derivatives is never greater than 5 in this example. Even in the
  1687    derivatives is never greater than 5 in this example. Even in the
  1662    example from Section 2, where Rust raises an error message, namely
  1688    example from Section 2, where Rust raises an error message, namely
  1663    \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for
  1689    \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for
  1664    our derivatives is a moderate 14.
  1690    our derivatives is a moderate 14.
  1665 
  1691 
       
  1692    Let us also return to the example @{text
       
  1693    "a\<^bsup>{0}{4294967295}\<^esup>"} which until recently Rust
       
  1694    deemed acceptable. But this was due to a bug. It turns out that it took Rust
       
  1695    more than 11 minutes to generate an automaton for this regular
       
  1696    expression and then to determine that a string of just one(!)
       
  1697    @{text a} does \emph{not} match this regular expression.  Therefore
       
  1698    it is probably a wise choice that in newer versions of Rust's
       
  1699    regular expression library such regular expressions are declared as
       
  1700    ``too big''. While this is clearly
       
  1701    a contrived example, the safety guaranties Rust wants to provide necessitate
       
  1702    this conservative approach.
       
  1703    However, with the derivatives and simplifications we have shown
       
  1704    in this paper, the example can be solved with ease:
       
  1705    it essentially only involves operations on
       
  1706    integers and our Scala implementation takes only a few seconds to
       
  1707    find out that this string, or even much larger strings, do not match.
       
  1708 
  1666    Let us also compare our work to the verified Verbatim++ lexer where
  1709    Let us also compare our work to the verified Verbatim++ lexer where
  1667    the authors of the Verbatim lexer introduced a number of
  1710    the authors of the Verbatim lexer introduced a number of
  1668    improvements and optimisations, for example memoisation
  1711    improvements and optimisations, for example memoisation
  1669    \cite{verbatimpp}. However, unlike Verbatim, which works with
  1712    \cite{verbatimpp}. However, unlike Verbatim, which works with
  1670    derivatives like in our work, Verbatim++ compiles first a regular
  1713    derivatives like in our work, Verbatim++ compiles first a regular
  1679    translation using the traditional notion of DFAs so that we can improve on this. Therefore we
  1722    translation using the traditional notion of DFAs so that we can improve on this. Therefore we
  1680    prefer to stick with calculating derivatives, but attempt to make
  1723    prefer to stick with calculating derivatives, but attempt to make
  1681    this calculation (in the future) as fast as possible. What we can guaranty
  1724    this calculation (in the future) as fast as possible. What we can guaranty
  1682    with the presented work is that the maximum size of the derivatives
  1725    with the presented work is that the maximum size of the derivatives
  1683    for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala
  1726    for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala
  1684    implementation only needs a few seconds for this example and matching 50\,000 a's.
  1727    implementation again only needs a few seconds for this example and matching 50\,000 a's, say.
  1685    %
  1728    %
  1686    %
  1729    %
  1687    %Possible ideas are
  1730    %Possible ideas are
  1688    %zippers which have been used in the context of parsing of
  1731    %zippers which have been used in the context of parsing of
  1689    %context-free grammars \cite{zipperparser}.
  1732    %context-free grammars \cite{zipperparser}.
  1690    \medskip
  1733    \\[-4mm]  %\smallskip
  1691 
  1734    
  1692    \noindent
  1735    \noindent
  1693    Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
  1736    Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}.
  1694    %\\[-10mm]
  1737    \\[-10mm]
  1695 
  1738 
  1696 
  1739 
  1697 %%\bibliographystyle{plain}
  1740 %%\bibliographystyle{plain}
  1698 \bibliography{root}
  1741 \bibliography{root}
  1699 *}
  1742 *}