lexing: comparison thys3/Paper.thy

equal deleted inserted replaced

-:d5e9bcb384ec
+:8881a09a06fd
 ``aggressive'' simplification method that keeps the size of the
 derivatives finitely bounded no matter what the length of the string is.
 They make some informal claims about the correctness and linear behaviour
 of this version, but do not provide any supporting proof arguments, not
 even ``pencil-and-paper'' arguments. They write about their bitcoded
-\emph{incremental parsing method} (that is the algorithm to be formalised
+\emph{incremental parsing method} (that is the algorithm to be fixed and formalised
 in this paper):
 %
 \begin{quote}\it
 ``Correctness Claim: We further claim that the incremental parsing
 method [..] in combination with the simplification steps [..]
 yields POSIX parse trees. We have tested this claim
 extensively [..] but yet
 have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
 \end{quote}
-\noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL
+\noindent{}\textbf{Contributions:} We fill this gap by implementing in Isabelle/HOL
-the derivative-based lexing algorithm of Sulzmann and Lu
+our version of the derivative-based lexing algorithm of Sulzmann and Lu
 \cite{Sulzmann2014} where regular expressions are annotated with
 bitsequences. We define the crucial simplification function as a
 recursive function, without the need of a fixpoint operation. One objective of
 the simplification function is to remove duplicates of regular
 expressions.  For this Sulzmann and Lu use in their paper the standard
 reason is that in the bitcoded setting, each copy generally has a
 different bitcode annotation---so @{text nub} would never ``fire''.
 Inspired by Scala's library for lists, we shall instead use a @{text
 distinctWith} function that finds duplicates under an ``erasing'' function
 which deletes bitcodes before comparing regular expressions.
-We shall also introduce our \emph{own} argument and definitions for
+We shall also introduce our \emph{own} arguments and definitions for
 establishing the correctness of the bitcoded algorithm when
 simplifications are included. Finally we
 establish that the size of derivatives can be finitely bounded.\medskip%\footnote{ In this paper, we shall first briefly introduce the basic notions
 %of regular expressions and describe the definition
 %of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves
 $r\backslash c$ for the derivative of the regular expression $r$
 w.r.t.~the character $c$. Both functions are defined by recursion over
 regular expressions.
 %
 \begin{center}
-\begin{tabular}{c @ {\hspace{-1mm}}c}
+\begin{tabular}{c @ {\hspace{-9mm}}c}
 \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\
 @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\
 @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\
 @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\
 \end{tabular}
 \end{center}
 \noindent
 We can extend this definition to give derivatives w.r.t.~strings,
-namely @{thm (lhs) ders.simps(1)}$\dn$@{thm (rhs) ders.simps(1)}
+namely as @{thm (lhs) ders.simps(1)}$\dn$@{thm (rhs) ders.simps(1)}
 and @{thm (lhs) ders.simps(2)}$\dn$@{thm (rhs) ders.simps(2)}.
 Using @{text nullable} and the derivative operation, we can
 define a simple regular expression matcher, namely
-$@{text "match s r"} \;\dn\; @{term nullable}(r\backslash s)$.
+$@{text "match s r"} \dn @{term nullable}(r\backslash s)$.
 This is essentially Brzozowski's algorithm from 1964. Its
 main virtue is that the algorithm can be easily implemented as a
 functional program (either in a functional programming language or in
 a theorem prover). The correctness of @{text match} amounts to
 establishing the property:%\footnote{It is a fun exercise to formally prove this property in a theorem prover.}
 Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}.
 The novel idea of Sulzmann and Lu is to extend this algorithm for
 lexing, where it is important to find out which part of the string
 is matched by which part of the regular expression.
-For this Sulzmann and Lu presented two lexing algorithms in their paper
+For this Sulzmann and Lu presented two lexing algorithms in their
-\cite{Sulzmann2014}. The first algorithm consists of two phases: first a
+paper~\cite{Sulzmann2014}. The first algorithm consists of two phases: first a
 matching phase (which is Brzozowski's algorithm) and then a value
 construction phase. The values encode \emph{how} a regular expression
 matches a string. \emph{Values} are defined as the inductive datatype
 %
 \begin{center}
 Sulzmann and Lu also define inductively an
 inhabitation relation that associates values to regular expressions. Our
 version of this relation is defined the following six rules for the values:
 %
 \begin{center}
-\begin{tabular}{@ {}c@ {}}
+\begin{tabular}{@ {}l@ {}}
-@{thm[mode=Axiom] Prf.intros(4)}\qquad
+@{thm[mode=Axiom] Prf.intros(4)}\quad
+@{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad
+@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\quad
+@{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
 @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad
-@{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\qquad
-@{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\qquad
-@{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\
 @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad
 $\mprset{flushleft}\inferrule{
 @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
-@{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\quad
+@{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}\\\\
 @{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}
 }
 {@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r"  "vs\<^sub>2" "n"]}}
 $
 \end{tabular}
 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\
 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\
 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\
 @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\
-\end{tabular}\medskip\\
+\end{tabular}
+\medskip\\
-\begin{tabular}{@ {}cc@ {}}
+%!\begin{tabular}{@ {}cc@ {}}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
 @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\
 @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ &
 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\
 @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ &
 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
+%!
+@{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
+& @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
+@{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
+& @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
+@{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ &
+@{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\
+%!
 @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$
 & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\
 @{thm (lhs) injval.simps(8)[of "r" "n" "c" "v" "vs"]} & $\dn$
 & @{thm (rhs) injval.simps(8)[of "r" "n" "c" "v" "vs"]}
 \end{tabular}
-&
+%!&
-\begin{tabular}{@ {\hspace{-3mm}}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
+%!\begin{tabular}{@ {\hspace{-3mm}}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}}
-@{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
-& @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
+%!\end{tabular}
-@{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$
+%!\end{tabular}
-& @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\
+\end{tabular}%!\smallskip
-@{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$\\
-\multicolumn{3}{@ {}r@ {}}{@{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}}\\ \mbox{}\\ \mbox{}
-\end{tabular}
-\end{tabular}
-\end{tabular}\smallskip
 \end{center}
 \noindent
 The function @{text mkeps} is run when the last derivative is nullable, that is
 the string to be matched is in the language of the regular expression. It generates
 \begin{center}
 \begin{tabular}{lcl}
 @{thm (lhs) lexer.simps(1)} & $\dn$ & @{thm (rhs) lexer.simps(1)}\\
 @{thm (lhs) lexer.simps(2)} & $\dn$ & @{text "case"} @{term "lexer (der c r) s"} @{text of}
 @{term "None"}  @{text "\<Rightarrow>"} @{term None}\\
-& & \hspace{24mm}$|\;$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}
+& & \hspace{27mm}$|\;$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"}
 \end{tabular}
 \end{center}
 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that
 \noindent
 With this in place we were able to prove:
 \begin{proposition}\mbox{}\label{lexercorrect}
 \textrm{(1)} @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}.\\
-\mbox{\hspace{29mm}}\textrm{(2)}\; @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}.
+\mbox{\hspace{23.5mm}}\textrm{(2)}\; @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}.
 %
 % \smallskip\\
 %\begin{tabular}{ll}
 %(1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\
 %(2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
 In the second part of their paper \cite{Sulzmann2014},
 Sulzmann and Lu describe another algorithm that also generates POSIX
 values but dispenses with the second phase where characters are
 injected ``back'' into values. For this they annotate bitcodes to
-regular expressions, which we define in Isabelle/HOL as the datatype
+regular expressions, which we define in Isabelle/HOL as the datatype\medskip
-\begin{center}
+%!\begin{center}
-@{term breg} $\;::=\;$ @{term "AZERO"}
+\noindent
-$\;\mid\;$ @{term "AONE bs"}
+\begin{minipage}{1.01\textwidth}
-$\;\mid\;$ @{term "ACHAR bs c"}
+\,@{term breg} $\,::=\,$ @{term "AZERO"}
-$\;\mid\;$ @{term "AALTs bs rs"}
+$\,\mid$ @{term "AONE bs"}
-$\;\mid\;$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}
+$\,\mid$ @{term "ACHAR bs c"}
-$\;\mid\;$ @{term "ASTAR bs r"}
+$\,\mid$ @{term "AALTs bs rs"}
-	       $\;\mid\;$ @{term "ANTIMES bs r n"}
+$\,\mid$ @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}
-\end{center}
+$\,\mid$ @{term "ASTAR bs r"}
+	       $\,\mid$ \mbox{@{term "ANTIMES bs r n"}}\hfill\mbox{}
+\end{minipage}\medskip
+%!\end{center}
 \noindent where @{text bs} stands for bitsequences; @{text r},
 @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
 expressions; and @{text rs} for lists of bitcoded regular
 expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
 Sulzmann and Lu follow Nielsen and Henglein \cite{NielsenHenglein2011}
 and define a coding
 function for how values can be coded into bitsequences.
 \begin{center}
-\begin{tabular}{cc}
+\begin{tabular}{c@ {\hspace{5mm}}c}
 \begin{tabular}{lcl}
 @{thm (lhs) code.simps(1)} & $\dn$ & @{thm (rhs) code.simps(1)}\\
 @{thm (lhs) code.simps(2)} & $\dn$ & @{thm (rhs) code.simps(2)}\\
 @{thm (lhs) code.simps(3)} & $\dn$ & @{thm (rhs) code.simps(3)}\\
 @{thm (lhs) code.simps(4)} & $\dn$ & @{thm (rhs) code.simps(4)}
 @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate
 if there is still a value coming in the list of @{text Stars}, whereas @{text S}
 indicates the end of the list. The lossiness makes the process of
 decoding a bit more involved, but the point is that if we have a
 regular expression \emph{and} a bitsequence of a corresponding value,
-then we can always decode the value accurately (see Fig.~\ref{decode}).
+then we can always decode the value accurately~(see Fig.~\ref{decode}).
 The function \textit{decode} checks whether all of the bitsequence is
 consumed and returns the corresponding value as @{term "Some v"}; otherwise
 it fails with @{text "None"}. We can establish that for a value $v$
 inhabited by a regular expression $r$, the decoding of its
 bitsequence never fails (see also \cite{NielsenHenglein2011}).
 %The decoding can be
 %defined by using two functions called $\textit{decode}'$ and
 %\textit{decode}:
-\begin{figure}
+\begin{figure}[t]
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
 $\textit{decode}'\,bs\,(\ONE)$ & $\;\dn\;$ & $(\Empty, bs)$\\
 $\textit{decode}'\,bs\,(c)$ & $\;\dn\;$ & $(\Char\,c, bs)$\\
 $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
 $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
 $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ &
 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
 & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$
 \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\\
-$\textit{decode}'\,bs\,(r^{\{n\}})$ & $\dn$ & $\textit{decode}'\,bs\,r^*$\smallskip\medskip\\
+$\textit{decode}'\,bs\,(r^{\{n\}})$ & $\dn$ & $\textit{decode}'\,bs\,r^*$\smallskip\smallskip\\
 $\textit{decode}\,bs\,r$ & $\dn$ &
 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
-& & $\;\;\;\,\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;\textit{else}\;\textit{None}$\\[-4mm]
+& & $\;\;\;\,\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;\textit{else}\;\textit{None}$\\[-5mm]
 \end{tabular}
 \end{center}
 \caption{Two functions, called $\textit{decode}'$ and \textit{decode}, for decoding a value from a bitsequence with the help of a regular expression.\\[-5mm]}\label{decode}
 \end{figure}
 Sulzmann and Lu define the function \emph{internalise}
 in order to transform (standard) regular expressions into annotated
 regular expressions. We write this operation as $r^\uparrow$.
 This internalisation uses the following
-\emph{fuse} function.
+\emph{fuse} function.\medskip
 %
-\begin{center}
+%!\begin{center}
-\begin{tabular}{@ {}cc@ {}}
+\noindent\begin{minipage}{\textwidth}
+\begin{tabular}{@ {}c@ {\hspace{3mm}}c@ {}}
 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 $\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
 $\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ &
 $\textit{ONE}\,(bs\,@\,bs')$\\
 $\textit{fuse}\,bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
 $\textit{STAR}\,(bs\,@\,bs')\,r$\\
 $\textit{fuse}\,bs\,(\textit{NT}\,bs'\,r\,n)$ & $\dn$ &
 $\textit{NT}\,(bs\,@\,bs')\,r\,n$
 \end{tabular}
 \end{tabular}
-\end{center}
+\end{minipage}\medskip
+%!\end{center}
 \noindent
 A regular expression can then be \emph{internalised} into a bitcoded
 regular expression as follows:
 %
+%!\begin{center}
+%!\begin{tabular}{@ {}c@ {\hspace{2mm}}c@ {\hspace{2mm}}c@ {}}
+%!\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+%!$(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
+%!$(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$
+%!\end{tabular}
+%!&
+%!\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+%!$(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
+%!$(r^*)^\uparrow$ & $\dn$ & $\textit{STAR}\;[]\,r^\uparrow$\\
+%!$(r^{\{n\}})^\uparrow$ & $\dn$ &
+%!      $\textit{NT}\;[]\,r^\uparrow\,n$
+%!\end{tabular}
+%!&
+%!\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+%!$(r_1 + r_2)^\uparrow$ & $\dn$ &
+%!       $\textit{ALT}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\,
+%!                          (\textit{fuse}\,[\S]\,r_2^\uparrow)$\\
+%!$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
+%!       $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$
+%!\end{tabular}
+%!\end{tabular}
+%!\end{center}
+%!
 \begin{center}
-\begin{tabular}{@ {}ccc@ {}}
+\begin{tabular}{@ {}c@ {\hspace{7mm}}c@ {}}
 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
-$(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$
+$(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
-\end{tabular}
-&
-\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
 $(r^*)^\uparrow$ & $\dn$ & $\textit{STAR}\;[]\,r^\uparrow$\\
-$(r^{\{n\}})^\uparrow$ & $\dn$ &
-$\textit{NT}\;[]\,r^\uparrow\,n$
 \end{tabular}
 &
 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 $(r_1 + r_2)^\uparrow$ & $\dn$ &
 $\textit{ALT}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\,
 (\textit{fuse}\,[\S]\,r_2^\uparrow)$\\
 $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
-$\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$
+$\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
+$(r^{\{n\}})^\uparrow$ & $\dn$ &
+$\textit{NT}\;[]\,r^\uparrow\,n$
 \end{tabular}
 \end{tabular}
 \end{center}
 \noindent
 the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the
 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
 bitcoded regular expressions.
 %
 \begin{center}
-\begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{0mm}}c@ {}}
+\begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{6mm}}c@ {}}
 \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l}
 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\
 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\
 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\
-$\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &
+$\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &\\
-$\exists\, r \in \rs. \,\textit{bnullable}\,r$\\
+\multicolumn{3}{l}{$\quad\exists\, r \in \rs. \,\textit{bnullable}\,r$}\\
-$\textit{bnullable}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ &
+$\textit{bnullable}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ &\\
-$\textit{bnullable}\,r_1\wedge \textit{bnullable}\,r_2$\\
+\multicolumn{3}{l}{$\quad\textit{bnullable}\,r_1\wedge \textit{bnullable}\,r_2$}\\
 $\textit{bnullable}\,(\textit{STAR}\,bs\,r)$ & $\dn$ &
 $\textit{True}$\\
 $\textit{bnullable}\,(\textit{NT}\,bs\,r\,n)$ & $\dn$ &\\
-\multicolumn{3}{l}{$\textit{if}\;n = 0\;\textit{then}\;\textit{True}\;
+\multicolumn{3}{l}{$\quad\textit{if}\;n = 0\;\textit{then}\;\textit{True}\;
 \textit{else}\;\textit{bnullable}\,r$}\\
 \end{tabular}
 &
 \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l@ {}}
 $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
 $\textit{bmkeps}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ &
 $bs\,@\,\textit{bmkepss}\,\rs$\\
 $\textit{bmkeps}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ &\\
-\multicolumn{3}{l}{$bs \,@\,\textit{bmkeps}\,r_1\,@\, \textit{bmkeps}\,r_2$}\\
+\multicolumn{3}{l}{$\quad{}bs \,@\,\textit{bmkeps}\,r_1\,@\, \textit{bmkeps}\,r_2$}\\
 $\textit{bmkeps}\,(\textit{STAR}\,bs\,r)$ & $\dn$ &
 $bs \,@\, [\S]$\\
 $\textit{bmkeps}\,(\textit{NT}\,bs\,r\,n)$ & $\dn$ &\\
-\multicolumn{3}{l@ {}}{$\textit{if}\;n=0\;\textit{then}\;bs \,@\, [\S]$}\\
+\multicolumn{3}{l@ {}}{$\quad\textit{if}\;n=0\;\textit{then}\;bs \,@\, [\S]$}\\
-\multicolumn{3}{l@ {}}{$\textit{else}\,bs \,@\, [\Z] \,@\, \textit{bmkeps}\,r\,@\,
+\multicolumn{3}{l@ {}}{$\quad\textit{else}\,bs \,@\, [\Z] \,@\, \textit{bmkeps}\,r\,@\,
 \textit{bmkeps}\,(@{term "ANTIMES [] r (n - 1)"})$}\\
 $\textit{bmkepss}\,(r\!::\!\rs)$ & $\dn$ &\\
-\multicolumn{3}{l}{$\textit{if}\;\textit{bnullable}\,r\,\textit{then}\;\textit{bmkeps}\,r\;
+\multicolumn{3}{l}{$\quad\textit{if}\;\textit{bnullable}\,r\,\textit{then}\;\textit{bmkeps}\,r\;
 \textit{else}\;\textit{bmkepss}\,\rs$}
 \end{tabular}
 \end{tabular}
 \end{center}
 bitcoded regular expression. This derivative function calculates the
 derivative but at the same time also the incremental part of the bitsequences
 that contribute to constructing a POSIX value.
 %
 \begin{center}
-\begin{tabular}{@ {}lcl@ {}}
+\begin{tabular}{@ {}l@ {\hspace{-1mm}}cl@ {}}
 $(\textit{ZERO})\backslash c$ & $\;\dn\;$ & $\textit{ZERO}$\\
 $(\textit{ONE}\;bs)\backslash c$ & $\;\dn\;$ & $\textit{ZERO}$\\
 $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
 $\textit{if}\;c=d\; \;\textit{then}\;
 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
 can define Sulzmann and Lu's bitcoded lexer, which we call \textit{blexer}:
 %
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
 $\textit{blexer}\;r\,s$ & $\dn$ &
-$\textit{let}\;r_{der} = (r^\uparrow)\backslash s\;\textit{in}$
+$\textit{let}\;r_{der} = (r^\uparrow)\backslash s\;\textit{in}$\\
-$\;\;\textit{if}\; \textit{bnullable}(r_{der}) \;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r
+& & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
-\;\;\textit{else}\;\textit{None}$
+& & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$\\
+& & $\;\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 \noindent
 This bitcoded lexer first internalises the regular expression $r$ and then
 is the same as if we first erase the bitcoded regular expression and
 then perform the ``standard'' derivative operation.
 \begin{lemma}\label{bnullable}%\mbox{}\smallskip\\
 \textit{(1)} $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\
-\mbox{\hspace{22mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
+\mbox{\hspace{17mm}}\textit{(2)} $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
-\mbox{\hspace{22mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$
+\mbox{\hspace{17mm}}\textit{(3)} $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$
 %\begin{tabular}{ll}
 %\textit{(1)} & $(r\backslash s)^\downarrow = (r^\downarrow)\backslash s$\\
 %\textit{(2)} & $\textit{bnullable}(r)$ iff $\textit{nullable}(r^\downarrow)$\\
 %\textit{(3)} & $\textit{bmkeps}(r) = \textit{retrieve}\,r\,(\textit{mkeps}\,(r^\downarrow))$ provided $\textit{nullable}(r^\downarrow)$.
 %\end{tabular}
 that $\textit{flex}$ assembles. The lemma establishes that this is the same
 value as if we build the annotated derivative $r^\uparrow\backslash s$
 and then retrieve the corresponding bitcoded version, followed by a
 decoding step.
-\begin{lemma}[Main Lemma]\label{mainlemma}\it
+\begin{lemma}[Main Lemma]\label{mainlemma}\mbox{}\\\it
 If $\vdash v : r\backslash s$ then
 $\textit{Some}\,(\textit{flex}\,r\,\textit{id}\,s\,v) =
 \textit{decode}(\textit{retrieve}\,(r^\uparrow \backslash s)\,v)\,r$
 \end{lemma}
 spawn new copies in later derivative steps and they in turn even more copies. This
 destroys any hope of taming the size of the derivatives.  But the
 second copy of $r$ in \eqref{derivex} will never contribute to a
 value, because POSIX lexing will always prefer matching a string
 with the first copy. So it could be safely removed without affecting the correctness of the algorithm.
-The dilemma with the simple-minded
+The issue with the simple-minded
 simplification rules above is that the rule $r + r \Rightarrow r$
 will never be applicable because as can be seen in this example the
 regular expressions are not next to each other but separated by another regular expression.
 But here is where Sulzmann and Lu's representation of generalised
 acc"}} where @{text eq} is an user-defined equivalence relation that
 compares two elements in @{text rs}. We define this function in
 Isabelle/HOL as
 \begin{center}
-\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
+\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
 @{thm (lhs) distinctWith.simps(1)} & $\dn$ & @{thm (rhs) distinctWith.simps(1)}\\
 @{thm (lhs) distinctWith.simps(2)} & $\dn$ &
-@{text "if (\<exists> y \<in> acc. eq x y)"} & @{text "then distinctWith xs eq acc"}\\
+@{text "if (\<exists> y \<in> acc. eq x y)"} \\
-& & & @{text "else x :: distinctWith xs eq ({x} \<union> acc)"}
+& & @{text "then distinctWith xs eq acc"}\\
+& & @{text "else x :: distinctWith xs eq ({x} \<union> acc)"}
 \end{tabular}
 \end{center}
 \noindent where we scan the list from left to right (because we
 have to remove later copies). In @{text distinctWith}, @{text eq} is intended to be an
 \begin{center}
 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 \multicolumn{3}{@ {}c}{@{thm (lhs) flts.simps(1)} $\dn$ @{thm (rhs) flts.simps(1)} \qquad\qquad\qquad\qquad
 @{thm (lhs) flts.simps(2)} $\dn$ @{thm (rhs) flts.simps(2)}}\smallskip\\
 @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\
+\multicolumn{3}{@ {}c}{@{text "flts (r :: rs)"} $\dn$ @{text "r :: flts rs"}
+\hspace{5mm}(otherwise)}
 \end{tabular}
 \end{center}
 \noindent
 The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and
 empty or a singleton list. We take care of those cases in the
 @{text "bsimpALTs"} function; similarly we define a helper function that simplifies
 sequences according to the usual rules about @{text ZERO}s and @{text ONE}s:
 \begin{center}
-\begin{tabular}{c@ {\hspace{5mm}}c}
+\begin{tabular}{@ {}c@ {\hspace{4mm}}c@ {}}
-\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
 @{text "bsimpALTs bs []"}  & $\dn$ & @{text "ZERO"}\\
 @{text "bsimpALTs bs [r]"} & $\dn$ & @{text "fuse bs r"}\\
 @{text "bsimpALTs bs rs"}  & $\dn$ & @{text "ALTs bs rs"}\\
 \mbox{}\\
 \end{tabular}
 &
-\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
 @{text "bsimpSEQ bs _ ZERO"}  & $\dn$ & @{text "ZERO"}\\
 @{text "bsimpSEQ bs ZERO _"} & $\dn$ & @{text "ZERO"}\\
 @{text "bsimpSEQ bs\<^sub>1 (ONE bs\<^sub>2) r\<^sub>2"}
 & $\dn$ & @{text "fuse (bs\<^sub>1 @ bs\<^sub>2) r\<^sub>2"}\\
 @{text "bsimpSEQ bs r\<^sub>1 r\<^sub>2"} & $\dn$ &  @{text "SEQ bs r\<^sub>1 r\<^sub>2"}
 and use it in the improved lexing algorithm defined as
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
 $\textit{blexer}^+\;r\,s$ & $\dn$ &
-$\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$
+$\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\
-$\;\;\textit{if}\; \textit{bnullable}(r_{der}) \;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r
+& & $\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
-\;\;\textit{else}\;\textit{None}$
+& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$\\
+& & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 \noindent The remaining task is to show that @{term blexer} and
 @{term "blexer_simp"} generate the same answers.
 expressions to be removed provided a regular expression earlier in the list can
 match the same strings.}\label{SimpRewrites}
 \end{figure}
 *}
-section {* Finiteness of Derivatives *}
+section {* Finite Bound for the Size of Derivatives *}
 text {*
 In this section let us sketch our argument for why the size of the simplified
 derivatives with the aggressive simplification function can be finitely bounded. Suppose
 Note that the bound $N$ is a bound for \emph{all} strings, no matter how long they are.
 We establish this bound by induction on $r$. The base cases for @{term AZERO},
 @{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is
 for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction
 hypotheses state $\exists N_1. \forall s. \; \llbracket{}@{term "bders_simp r\<^sub>1 s"}\rrbracket \leq N_1$ and
-$\exists N_2. \forall s. \; \llbracket@{term "bders_simp r\<^sub>2 s"}\rrbracket \leq N_2$. We can reason as follows
+$\exists N_2. \forall s. \; \llbracket@{term "bders_simp r\<^sub>2 s"}\rrbracket \leq N_2$. We can reason as follows\medskip
 %
-\begin{center}
+%!\begin{center}
+\noindent\begin{minipage}{\textwidth}
 \begin{tabular}{lcll}
 & & $ \llbracket@{term "bders_simp (ASEQ bs r\<^sub>1 r\<^sub>2) s"}\rrbracket$\\
 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;((@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}) ::
-[@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in \textit{Suffix}(@{text r}_1, s)]))\rrbracket $ & (1) \\
+[@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in \textit{Suf}(@{text r}_1, s)]))\rrbracket $ & (1) \\
 & $\leq$ &
 $\llbracket\textit{distinctWith}\,((@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}) ::
-[@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in \textit{Suffix}(@{text r}_1, s)])\,\approx\;@{term "{}"}\rrbracket + 1 $ & (2) \\
+[@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in \textit{Suf}(@{text r}_1, s)])\,\approx\;@{term "{}"}\rrbracket + 1 $ & (2) \\
 & $\leq$ & $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket +
-\llbracket\textit{distinctWith}\,[@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in \textit{Suffix}(@{text r}_1, s)]\,\approx\;@{term "{}"}\rrbracket + 1 $ & (3) \\
+\llbracket\textit{distinctWith}\,[@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in \textit{Suf}(@{text r}_1, s)]\,\approx\;@{term "{}"}\rrbracket + 1 $ & (3) \\
 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
-\llbracket\textit{distinctWith}\,[@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in \textit{Suffix}(@{text r}_1, s)]\,\approx\;@{term "{}"}\rrbracket$ & (4)\\
+\llbracket\textit{distinctWith}\,[@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in \textit{Suf}(@{text r}_1, s)]\,\approx\;@{term "{}"}\rrbracket$ & (4)\\
 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
 \end{tabular}
-\end{center}
+\end{minipage}\medskip
+%!\end{center}
 % tell Chengsong about Indian paper of closed forms of derivatives
 \noindent
-where in (1) the $\textit{Suffix}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
+where in (1) the $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
 In (3) we know that  $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is
 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
 We reason similarly for @{text STAR}.\smallskip
 \]
 \noindent
 where the @{text "Left"}-alternatives get priority. However, this violates
 the POSIX rules and we have not been able to
-reconcile this problem. Therefore we leave better bounds for future work.\\[-6.5mm]
+reconcile this problem. Therefore we leave better bounds for future work.%!\\[-6.5mm]
+Note that while Antimirov was able to give a bound on the \emph{size}
+of his partial derivatives~\cite{Antimirov95}, Brzozowski gave a bound
+on the \emph{number} of derivatives, provided they are quotient via
+ACI rules \cite{Brzozowski1964}. Brozozowski's result is crucial when one
+uses derivatives for obtaining an automaton (it essentially bounds
+the number of states). However, this result does \emph{not}
+transfer to our setting where we are interested in the \emph{size} of the
+derivatives. For example it is not true for our derivatives that the
+set of of derivatives $r \backslash s$ for a given $r$ and all strings
+$s$ is finite. This is because for our annotated regular expressions
+the bitcode annotation is dependent on the number of iterations that
+are needed for STAR-regular expressions. Since we want to do lexing
+by calculating (as fast as possible) derivatives, the bound on the size
+of the derivatives is important, not the number of derivatives.
 *}
 section {* Conclusion *}
 %\\[-10mm]
 %%\bibliographystyle{plain}
 \bibliography{root}
+*}
+(*<*)
+text {*
 \newpage
 \appendix
 \section{Some Proofs}
 By straightforward mutual induction on the definition of $\leadsto$ and $\stackrel{s}{\leadsto}$.
 The case for $LD$ holds because @{term "L (erase (bder c r\<^sub>2)) \<subseteq> L (erase (bder c r\<^sub>1))"}
 if and only if @{term "L (erase (r\<^sub>2)) \<subseteq> L (erase (r\<^sub>1))"}.
 \end{proof}
 *}
+(*>*)
 (*<*)
 end
 (*>*)

changeset 615	8881a09a06fd
parent 599	a5f666410101
child 616	8907d4b6316d