--- a/thys2/Paper/Paper.thy	Sat Feb 05 18:24:37 2022 +0000
+++ b/thys2/Paper/Paper.thy	Sun Feb 06 00:02:04 2022 +0000
@@ -247,6 +247,7 @@
 
 text {*
 
+  In the second part of their paper \cite{Sulzmann2014},
   Sulzmann and Lu describe another algorithm that generates POSIX
   values but dispences with the second phase where characters are
   injected ``back'' into values. For this they annotate bitcodes to
@@ -254,8 +255,7 @@
 
   \begin{center}
   \begin{tabular}{lcl}
-  @{term breg} & $::=$ & @{term "AZERO"}\\
-               & $\mid$ & @{term "AONE bs"}\\
+  @{term breg} & $::=$ & @{term "AZERO"} $\quad\mid\quad$ @{term "AONE bs"}\\
                & $\mid$ & @{term "ACHAR bs c"}\\
                & $\mid$ & @{term "AALTs bs rs"}\\
                & $\mid$ & @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}\\
@@ -264,46 +264,243 @@
   \end{center}
 
   \noindent where @{text bs} stands for a bitsequences; @{text r},
-  @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for annotated regular
-  expressions; and @{text rs} for a list of annotated regular
-  expressions.  In contrast to Sulzmann and Lu we generalise the
-  alternative regular expressions to lists, instead of just having
-  binary regular expressions.  The idea with annotated regular
-  expressions is to incrementally generate the value information by
-  recording bitsequences. Sulzmann and Lu then  
-  define a coding function for how values can be coded into bitsequences.
+  @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
+  expressions; and @{text rs} for lists of bitcoded regular
+  expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
+  is just an abbreviation for @{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}. 
+  For bitsequences we just use lists made up of the
+  constants @{text Z} and @{text S}.  The idea with bitcoded regular
+  expressions is to incrementally generate the value information (for
+  example @{text Left} and @{text Right}) as bitsequences
+  as part of the regular expression constructors. 
+  Sulzmann and Lu then define a coding
+  function for how values can be coded into bitsequences.
 
   \begin{center}
+  \begin{tabular}{cc}
   \begin{tabular}{lcl}
   @{thm (lhs) code.simps(1)} & $\dn$ & @{thm (rhs) code.simps(1)}\\
   @{thm (lhs) code.simps(2)} & $\dn$ & @{thm (rhs) code.simps(2)}\\
   @{thm (lhs) code.simps(3)} & $\dn$ & @{thm (rhs) code.simps(3)}\\
-  @{thm (lhs) code.simps(4)} & $\dn$ & @{thm (rhs) code.simps(4)}\\
+  @{thm (lhs) code.simps(4)} & $\dn$ & @{thm (rhs) code.simps(4)}
+  \end{tabular}
+  &
+  \begin{tabular}{lcl}
   @{thm (lhs) code.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) code.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\
   @{thm (lhs) code.simps(6)} & $\dn$ & @{thm (rhs) code.simps(6)}\\
-  @{thm (lhs) code.simps(7)} & $\dn$ & @{thm (rhs) code.simps(7)}
+  @{thm (lhs) code.simps(7)} & $\dn$ & @{thm (rhs) code.simps(7)}\\
+  \mbox{\phantom{XX}}\\
+  \end{tabular}
   \end{tabular}
   \end{center}
+   
+  \noindent
+  As can be seen, this coding is ``lossy'' in the sense that we do not
+  record explicitly character values and also not sequence values (for
+  them we just append two bitsequences). We do, however, record the
+  different alternatives for @{text Left}, respectively @{text Right}, as @{text Z} and
+  @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate
+  if there is still a value coming in the list of @{text Stars}, whereas @{text S}
+  indicates the end of the list. The lossiness makes the process of
+  decoding a bit more involved, but the point is that if we have a
+  regular expression \emph{and} a bitsequence of a corresponding value,
+  then we can always decode the value accurately. The decoding can be
+  defined by using two functions called $\textit{decode}'$ and
+  \textit{decode}:
 
+  \begin{center}
+  \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
+  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
+  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
+  $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
+       (\Left\,v, bs_1)$\\
+  $\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
+       (\Right\,v, bs_1)$\\                           
+  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
+        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
+  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$
+        \hspace{2mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
+  $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+  $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & 
+         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
+  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$
+        \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
+  $\textit{decode}\,bs\,r$ & $\dn$ &
+     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
+  & & \hspace{7mm}$\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
+       \textit{else}\;\textit{None}$   
+  \end{tabular}    
+  \end{center}
+
+  \noindent
+  The function \textit{decode} checks whether all of the bitsequence is
+  consumed and returns the corresponding value as @{term "Some v"}; otherwise
+  it fails with @{text "None"}. We can establish that for a value $v$
+  inhabited by a regular expression $r$, the decoding of its
+  bitsequence never fails.
+
+\begin{lemma}\label{codedecode}\it
+  If $\;\vdash v : r$ then
+  $\;\textit{decode}\,(\textit{code}\, v)\,r = \textit{Some}\, v$.
+\end{lemma}
+
+\begin{proof}
+  This follows from the property that
+  $\textit{decode}'\,((\textit{code}\,v) \,@\, bs)\,r = (v, bs)$ holds
+  for any bit-sequence $bs$ and $\vdash v : r$. This property can be
+  easily proved by induction on $\vdash v : r$.
+\end{proof}  
+
+  Sulzmann and Lu define the function \emph{internalise}
+  in order to transform standard regular expressions into annotated
+  regular expressions. We write this operation as $r^\uparrow$.
+  This internalisation uses the following
+  \emph{fuse} function.	     
+
+  \begin{center}
+  \begin{tabular}{lcl}
+  $\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
+  $\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ &
+     $\textit{ONE}\,(bs\,@\,bs')$\\
+  $\textit{fuse}\,bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
+     $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
+  $\textit{fuse}\,bs\,(\textit{ALTs}\,bs'\,rs)$ & $\dn$ &
+     $\textit{ALTs}\,(bs\,@\,bs')\,rs$\\
+  $\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,r_1\,r_2)$ & $\dn$ &
+     $\textit{SEQ}\,(bs\,@\,bs')\,r_1\,r_2$\\
+  $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,r)$ & $\dn$ &
+     $\textit{STAR}\,(bs\,@\,bs')\,r$
+  \end{tabular}    
+  \end{center}    
+
+  \noindent
+  A regular expression can then be \emph{internalised} into a bitcoded
+  regular expression as follows.
+
+  \begin{center}
+  \begin{tabular}{lcl}
+  $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
+  $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
+  $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
+  $(r_1 + r_2)^\uparrow$ & $\dn$ &
+         $\textit{ALT}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\,
+                            (\textit{fuse}\,[\S]\,r_2^\uparrow)$\\
+  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
+         $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
+  $(r^*)^\uparrow$ & $\dn$ &
+         $\textit{STAR}\;[]\,r^\uparrow$\\
+  \end{tabular}    
+  \end{center}    
+
+  \noindent
+  There is also an \emph{erase}-function, written $a^\downarrow$, which
+  transforms a bitcoded regular expression into a (standard) regular
+  expression by just erasing the annotated bitsequences. We omit the
+  straightforward definition. For defining the algorithm, we also need
+  the functions \textit{bnullable} and \textit{bmkeps}, which are the
+  ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
+  bitcoded regular expressions, instead of regular expressions.
+
+  \begin{center}
+  \begin{tabular}{lcl}
+  $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{false}$ \textbf{fix}\\
+  $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{true}$\\
+  $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{false}$\\
+  $\textit{bnullable}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ &
+     $\textit{bnullable}\,a_1\vee \textit{bnullable}\,a_2$\\
+  $\textit{bnullable}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ &
+     $\textit{bnullable}\,a_1\wedge \textit{bnullable}\,a_2$\\
+  $\textit{bnullable}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
+     $\textit{true}$
+  \end{tabular}    
+  \end{center}    
+
+  \begin{center}
+  \begin{tabular}{lcl}
+  $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$ \textbf{fix}\\
+  $\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a_1$\\
+  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\
+  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\
+  $\textit{bmkeps}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ &
+     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
+  $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
+     $bs \,@\, [\S]$
+  \end{tabular}    
+  \end{center}    
+ 
+
+  \noindent
+  The key function in the bitcoded algorithm is the derivative of an
+  annotated regular expression. This derivative calculates the
+  derivative but at the same time also the incremental part that
+  contributes to constructing a value.	
+
+  \begin{center}
+  \begin{tabular}{@ {}lcl@ {}}
+  $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$ \textbf{fix}\\  
+  $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
+  $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
+        $\textit{if}\;c=d\; \;\textit{then}\;
+         \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
+  $(\textit{ALT}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ &
+        $\textit{ALT}\,bs\,(a_1\backslash c)\,(a_2\backslash c)$\\
+  $(\textit{SEQ}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a_1$\\
+  & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\backslash c)\,a_2)$\\
+  & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\backslash c))$\\
+  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\backslash c)\,a_2$\\
+  $(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ &
+      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\,
+       (\textit{STAR}\,[]\,r)$
+  \end{tabular}    
+  \end{center}
+
+
+  \noindent
+  This function can also be extended to strings, written $a\backslash s$,
+  just like the standard derivative.  We omit the details. Finally we
+  can define Sulzmann and Lu's bitcoded lexer, which we call \textit{blexer}: 
+
+  \noindent
+This bitcoded lexer first internalises the regular expression $r$ and then
+builds the annotated derivative according to $s$. If the derivative is
+nullable, then it extracts the bitcoded value using the
+$\textit{bmkeps}$ function. Finally it decodes the bitcoded value.  If
+the derivative is \emph{not} nullable, then $\textit{None}$ is
+returned. The task is to show that this way of calculating a value
+generates the same result as with \textit{lexer}.
+
+Before we can proceed we need to define a function, called
+\textit{retrieve}, which Sulzmann and Lu introduced for the proof.
+
+\textbf{fix}
+
+\noindent
+The idea behind this function is to retrieve a possibly partial
+bitcode from an annotated regular expression, where the retrieval is
+guided by a value.  For example if the value is $\Left$ then we
+descend into the left-hand side of an alternative (annotated) regular
+expression in order to assemble the bitcode. Similarly for
+$\Right$. The property we can show is that for a given $v$ and $r$
+with $\vdash v : r$, the retrieved bitsequence from the internalised
+regular expression is equal to the bitcoded version of $v$.
+
+\begin{lemma}\label{retrievecode}
+If $\vdash v : r$ then $\textit{code}\, v = \textit{retrieve}\,(r^\uparrow)\,v$.
+\end{lemma}
+
+*}
+
+text {*
   There is also a corresponding decoding function that takes a bitsequence
   and generates back a value. However, since the bitsequences are a ``lossy''
   coding (@{term Seq}s are not coded) the decoding function depends also
   on a regular expression in order to decode values. 
 
-  \begin{center}
-  \begin{tabular}{lcll}
-  %@{thm (lhs) decode'.simps(1)} & $\dn$ & @{thm (rhs) decode'.simps(1)}\\
-  @{thm (lhs) decode'.simps(2)} & $\dn$ & @{thm (rhs) decode'.simps(2)}\\
-  @{thm (lhs) decode'.simps(3)} & $\dn$ & @{thm (rhs) decode'.simps(3)}\\
-  @{thm (lhs) decode'.simps(4)} & $\dn$ & @{thm (rhs) decode'.simps(4)}\\
-  @{thm (lhs) decode'.simps(5)} & $\dn$ & @{thm (rhs) decode'.simps(5)}\\
-  @{thm (lhs) decode'.simps(6)} & $\dn$ & @{thm (rhs) decode'.simps(6)}\\
-  @{thm (lhs) decode'.simps(7)} & $\dn$ & @{thm (rhs) decode'.simps(7)}\\
-  @{thm (lhs) decode'.simps(8)} & $\dn$ & @{thm (rhs) decode'.simps(8)}\\
-  @{thm (lhs) decode'.simps(9)} & $\dn$ & @{thm (rhs) decode'.simps(9)}\\
-  @{thm (lhs) decode'.simps(10)} & $\dn$ & @{thm (rhs) decode'.simps(10)} & fix
-  \end{tabular}
-  \end{center}
+
 
 
   The idea of the bitcodes is to annotate them to regular expressions and generate values
changeset 416	57182b36ec01
parent 410	9261d980225d
child 418	41a2a3b63853