thys2/Paper/Paper.thy
changeset 416 57182b36ec01
parent 410 9261d980225d
child 418 41a2a3b63853
equal deleted inserted replaced
415:5c96fe5306a7 416:57182b36ec01
   245 
   245 
   246 section {* Bitcoded Regular Expressions and Derivatives *}
   246 section {* Bitcoded Regular Expressions and Derivatives *}
   247 
   247 
   248 text {*
   248 text {*
   249 
   249 
       
   250   In the second part of their paper \cite{Sulzmann2014},
   250   Sulzmann and Lu describe another algorithm that generates POSIX
   251   Sulzmann and Lu describe another algorithm that generates POSIX
   251   values but dispences with the second phase where characters are
   252   values but dispences with the second phase where characters are
   252   injected ``back'' into values. For this they annotate bitcodes to
   253   injected ``back'' into values. For this they annotate bitcodes to
   253   regular expressions, which we define in Isabelle/HOL as the datatype
   254   regular expressions, which we define in Isabelle/HOL as the datatype
   254 
   255 
   255   \begin{center}
   256   \begin{center}
   256   \begin{tabular}{lcl}
   257   \begin{tabular}{lcl}
   257   @{term breg} & $::=$ & @{term "AZERO"}\\
   258   @{term breg} & $::=$ & @{term "AZERO"} $\quad\mid\quad$ @{term "AONE bs"}\\
   258                & $\mid$ & @{term "AONE bs"}\\
       
   259                & $\mid$ & @{term "ACHAR bs c"}\\
   259                & $\mid$ & @{term "ACHAR bs c"}\\
   260                & $\mid$ & @{term "AALTs bs rs"}\\
   260                & $\mid$ & @{term "AALTs bs rs"}\\
   261                & $\mid$ & @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}\\
   261                & $\mid$ & @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}\\
   262                & $\mid$ & @{term "ASTAR bs r"}
   262                & $\mid$ & @{term "ASTAR bs r"}
   263   \end{tabular}
   263   \end{tabular}
   264   \end{center}
   264   \end{center}
   265 
   265 
   266   \noindent where @{text bs} stands for a bitsequences; @{text r},
   266   \noindent where @{text bs} stands for a bitsequences; @{text r},
   267   @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for annotated regular
   267   @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular
   268   expressions; and @{text rs} for a list of annotated regular
   268   expressions; and @{text rs} for lists of bitcoded regular
   269   expressions.  In contrast to Sulzmann and Lu we generalise the
   269   expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"}
   270   alternative regular expressions to lists, instead of just having
   270   is just an abbreviation for @{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}. 
   271   binary regular expressions.  The idea with annotated regular
   271   For bitsequences we just use lists made up of the
   272   expressions is to incrementally generate the value information by
   272   constants @{text Z} and @{text S}.  The idea with bitcoded regular
   273   recording bitsequences. Sulzmann and Lu then  
   273   expressions is to incrementally generate the value information (for
   274   define a coding function for how values can be coded into bitsequences.
   274   example @{text Left} and @{text Right}) as bitsequences
   275 
   275   as part of the regular expression constructors. 
   276   \begin{center}
   276   Sulzmann and Lu then define a coding
       
   277   function for how values can be coded into bitsequences.
       
   278 
       
   279   \begin{center}
       
   280   \begin{tabular}{cc}
   277   \begin{tabular}{lcl}
   281   \begin{tabular}{lcl}
   278   @{thm (lhs) code.simps(1)} & $\dn$ & @{thm (rhs) code.simps(1)}\\
   282   @{thm (lhs) code.simps(1)} & $\dn$ & @{thm (rhs) code.simps(1)}\\
   279   @{thm (lhs) code.simps(2)} & $\dn$ & @{thm (rhs) code.simps(2)}\\
   283   @{thm (lhs) code.simps(2)} & $\dn$ & @{thm (rhs) code.simps(2)}\\
   280   @{thm (lhs) code.simps(3)} & $\dn$ & @{thm (rhs) code.simps(3)}\\
   284   @{thm (lhs) code.simps(3)} & $\dn$ & @{thm (rhs) code.simps(3)}\\
   281   @{thm (lhs) code.simps(4)} & $\dn$ & @{thm (rhs) code.simps(4)}\\
   285   @{thm (lhs) code.simps(4)} & $\dn$ & @{thm (rhs) code.simps(4)}
       
   286   \end{tabular}
       
   287   &
       
   288   \begin{tabular}{lcl}
   282   @{thm (lhs) code.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) code.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\
   289   @{thm (lhs) code.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) code.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\
   283   @{thm (lhs) code.simps(6)} & $\dn$ & @{thm (rhs) code.simps(6)}\\
   290   @{thm (lhs) code.simps(6)} & $\dn$ & @{thm (rhs) code.simps(6)}\\
   284   @{thm (lhs) code.simps(7)} & $\dn$ & @{thm (rhs) code.simps(7)}
   291   @{thm (lhs) code.simps(7)} & $\dn$ & @{thm (rhs) code.simps(7)}\\
   285   \end{tabular}
   292   \mbox{\phantom{XX}}\\
   286   \end{center}
   293   \end{tabular}
   287 
   294   \end{tabular}
       
   295   \end{center}
       
   296    
       
   297   \noindent
       
   298   As can be seen, this coding is ``lossy'' in the sense that we do not
       
   299   record explicitly character values and also not sequence values (for
       
   300   them we just append two bitsequences). We do, however, record the
       
   301   different alternatives for @{text Left}, respectively @{text Right}, as @{text Z} and
       
   302   @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate
       
   303   if there is still a value coming in the list of @{text Stars}, whereas @{text S}
       
   304   indicates the end of the list. The lossiness makes the process of
       
   305   decoding a bit more involved, but the point is that if we have a
       
   306   regular expression \emph{and} a bitsequence of a corresponding value,
       
   307   then we can always decode the value accurately. The decoding can be
       
   308   defined by using two functions called $\textit{decode}'$ and
       
   309   \textit{decode}:
       
   310 
       
   311   \begin{center}
       
   312   \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
       
   313   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
       
   314   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
       
   315   $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   316      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       
   317        (\Left\,v, bs_1)$\\
       
   318   $\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   319      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       
   320        (\Right\,v, bs_1)$\\                           
       
   321   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
       
   322         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
       
   323   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$
       
   324         \hspace{2mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
       
   325   $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
       
   326   $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & 
       
   327          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   328   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$
       
   329         \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
       
   330   $\textit{decode}\,bs\,r$ & $\dn$ &
       
   331      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   332   & & \hspace{7mm}$\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       
   333        \textit{else}\;\textit{None}$   
       
   334   \end{tabular}    
       
   335   \end{center}
       
   336 
       
   337   \noindent
       
   338   The function \textit{decode} checks whether all of the bitsequence is
       
   339   consumed and returns the corresponding value as @{term "Some v"}; otherwise
       
   340   it fails with @{text "None"}. We can establish that for a value $v$
       
   341   inhabited by a regular expression $r$, the decoding of its
       
   342   bitsequence never fails.
       
   343 
       
   344 \begin{lemma}\label{codedecode}\it
       
   345   If $\;\vdash v : r$ then
       
   346   $\;\textit{decode}\,(\textit{code}\, v)\,r = \textit{Some}\, v$.
       
   347 \end{lemma}
       
   348 
       
   349 \begin{proof}
       
   350   This follows from the property that
       
   351   $\textit{decode}'\,((\textit{code}\,v) \,@\, bs)\,r = (v, bs)$ holds
       
   352   for any bit-sequence $bs$ and $\vdash v : r$. This property can be
       
   353   easily proved by induction on $\vdash v : r$.
       
   354 \end{proof}  
       
   355 
       
   356   Sulzmann and Lu define the function \emph{internalise}
       
   357   in order to transform standard regular expressions into annotated
       
   358   regular expressions. We write this operation as $r^\uparrow$.
       
   359   This internalisation uses the following
       
   360   \emph{fuse} function.	     
       
   361 
       
   362   \begin{center}
       
   363   \begin{tabular}{lcl}
       
   364   $\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
       
   365   $\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ &
       
   366      $\textit{ONE}\,(bs\,@\,bs')$\\
       
   367   $\textit{fuse}\,bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
       
   368      $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
       
   369   $\textit{fuse}\,bs\,(\textit{ALTs}\,bs'\,rs)$ & $\dn$ &
       
   370      $\textit{ALTs}\,(bs\,@\,bs')\,rs$\\
       
   371   $\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,r_1\,r_2)$ & $\dn$ &
       
   372      $\textit{SEQ}\,(bs\,@\,bs')\,r_1\,r_2$\\
       
   373   $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,r)$ & $\dn$ &
       
   374      $\textit{STAR}\,(bs\,@\,bs')\,r$
       
   375   \end{tabular}    
       
   376   \end{center}    
       
   377 
       
   378   \noindent
       
   379   A regular expression can then be \emph{internalised} into a bitcoded
       
   380   regular expression as follows.
       
   381 
       
   382   \begin{center}
       
   383   \begin{tabular}{lcl}
       
   384   $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
       
   385   $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
       
   386   $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
       
   387   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
   388          $\textit{ALT}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\,
       
   389                             (\textit{fuse}\,[\S]\,r_2^\uparrow)$\\
       
   390   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
   391          $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
       
   392   $(r^*)^\uparrow$ & $\dn$ &
       
   393          $\textit{STAR}\;[]\,r^\uparrow$\\
       
   394   \end{tabular}    
       
   395   \end{center}    
       
   396 
       
   397   \noindent
       
   398   There is also an \emph{erase}-function, written $a^\downarrow$, which
       
   399   transforms a bitcoded regular expression into a (standard) regular
       
   400   expression by just erasing the annotated bitsequences. We omit the
       
   401   straightforward definition. For defining the algorithm, we also need
       
   402   the functions \textit{bnullable} and \textit{bmkeps}, which are the
       
   403   ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on
       
   404   bitcoded regular expressions, instead of regular expressions.
       
   405 
       
   406   \begin{center}
       
   407   \begin{tabular}{lcl}
       
   408   $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{false}$ \textbf{fix}\\
       
   409   $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{true}$\\
       
   410   $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{false}$\\
       
   411   $\textit{bnullable}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ &
       
   412      $\textit{bnullable}\,a_1\vee \textit{bnullable}\,a_2$\\
       
   413   $\textit{bnullable}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ &
       
   414      $\textit{bnullable}\,a_1\wedge \textit{bnullable}\,a_2$\\
       
   415   $\textit{bnullable}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
       
   416      $\textit{true}$
       
   417   \end{tabular}    
       
   418   \end{center}    
       
   419 
       
   420   \begin{center}
       
   421   \begin{tabular}{lcl}
       
   422   $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$ \textbf{fix}\\
       
   423   $\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ &
       
   424      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   425   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\
       
   426   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\
       
   427   $\textit{bmkeps}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ &
       
   428      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
       
   429   $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
       
   430      $bs \,@\, [\S]$
       
   431   \end{tabular}    
       
   432   \end{center}    
       
   433  
       
   434 
       
   435   \noindent
       
   436   The key function in the bitcoded algorithm is the derivative of an
       
   437   annotated regular expression. This derivative calculates the
       
   438   derivative but at the same time also the incremental part that
       
   439   contributes to constructing a value.	
       
   440 
       
   441   \begin{center}
       
   442   \begin{tabular}{@ {}lcl@ {}}
       
   443   $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$ \textbf{fix}\\  
       
   444   $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   445   $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
       
   446         $\textit{if}\;c=d\; \;\textit{then}\;
       
   447          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
   448   $(\textit{ALT}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ &
       
   449         $\textit{ALT}\,bs\,(a_1\backslash c)\,(a_2\backslash c)$\\
       
   450   $(\textit{SEQ}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ &
       
   451      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   452   & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\backslash c)\,a_2)$\\
       
   453   & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\backslash c))$\\
       
   454   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\backslash c)\,a_2$\\
       
   455   $(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ &
       
   456       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\,
       
   457        (\textit{STAR}\,[]\,r)$
       
   458   \end{tabular}    
       
   459   \end{center}
       
   460 
       
   461 
       
   462   \noindent
       
   463   This function can also be extended to strings, written $a\backslash s$,
       
   464   just like the standard derivative.  We omit the details. Finally we
       
   465   can define Sulzmann and Lu's bitcoded lexer, which we call \textit{blexer}: 
       
   466 
       
   467   \noindent
       
   468 This bitcoded lexer first internalises the regular expression $r$ and then
       
   469 builds the annotated derivative according to $s$. If the derivative is
       
   470 nullable, then it extracts the bitcoded value using the
       
   471 $\textit{bmkeps}$ function. Finally it decodes the bitcoded value.  If
       
   472 the derivative is \emph{not} nullable, then $\textit{None}$ is
       
   473 returned. The task is to show that this way of calculating a value
       
   474 generates the same result as with \textit{lexer}.
       
   475 
       
   476 Before we can proceed we need to define a function, called
       
   477 \textit{retrieve}, which Sulzmann and Lu introduced for the proof.
       
   478 
       
   479 \textbf{fix}
       
   480 
       
   481 \noindent
       
   482 The idea behind this function is to retrieve a possibly partial
       
   483 bitcode from an annotated regular expression, where the retrieval is
       
   484 guided by a value.  For example if the value is $\Left$ then we
       
   485 descend into the left-hand side of an alternative (annotated) regular
       
   486 expression in order to assemble the bitcode. Similarly for
       
   487 $\Right$. The property we can show is that for a given $v$ and $r$
       
   488 with $\vdash v : r$, the retrieved bitsequence from the internalised
       
   489 regular expression is equal to the bitcoded version of $v$.
       
   490 
       
   491 \begin{lemma}\label{retrievecode}
       
   492 If $\vdash v : r$ then $\textit{code}\, v = \textit{retrieve}\,(r^\uparrow)\,v$.
       
   493 \end{lemma}
       
   494 
       
   495 *}
       
   496 
       
   497 text {*
   288   There is also a corresponding decoding function that takes a bitsequence
   498   There is also a corresponding decoding function that takes a bitsequence
   289   and generates back a value. However, since the bitsequences are a ``lossy''
   499   and generates back a value. However, since the bitsequences are a ``lossy''
   290   coding (@{term Seq}s are not coded) the decoding function depends also
   500   coding (@{term Seq}s are not coded) the decoding function depends also
   291   on a regular expression in order to decode values. 
   501   on a regular expression in order to decode values. 
   292 
   502 
   293   \begin{center}
   503 
   294   \begin{tabular}{lcll}
       
   295   %@{thm (lhs) decode'.simps(1)} & $\dn$ & @{thm (rhs) decode'.simps(1)}\\
       
   296   @{thm (lhs) decode'.simps(2)} & $\dn$ & @{thm (rhs) decode'.simps(2)}\\
       
   297   @{thm (lhs) decode'.simps(3)} & $\dn$ & @{thm (rhs) decode'.simps(3)}\\
       
   298   @{thm (lhs) decode'.simps(4)} & $\dn$ & @{thm (rhs) decode'.simps(4)}\\
       
   299   @{thm (lhs) decode'.simps(5)} & $\dn$ & @{thm (rhs) decode'.simps(5)}\\
       
   300   @{thm (lhs) decode'.simps(6)} & $\dn$ & @{thm (rhs) decode'.simps(6)}\\
       
   301   @{thm (lhs) decode'.simps(7)} & $\dn$ & @{thm (rhs) decode'.simps(7)}\\
       
   302   @{thm (lhs) decode'.simps(8)} & $\dn$ & @{thm (rhs) decode'.simps(8)}\\
       
   303   @{thm (lhs) decode'.simps(9)} & $\dn$ & @{thm (rhs) decode'.simps(9)}\\
       
   304   @{thm (lhs) decode'.simps(10)} & $\dn$ & @{thm (rhs) decode'.simps(10)} & fix
       
   305   \end{tabular}
       
   306   \end{center}
       
   307 
   504 
   308 
   505 
   309   The idea of the bitcodes is to annotate them to regular expressions and generate values
   506   The idea of the bitcodes is to annotate them to regular expressions and generate values
   310   incrementally. The bitcodes can be read off from the @{text breg} and then decoded into a value.
   507   incrementally. The bitcodes can be read off from the @{text breg} and then decoded into a value.
   311 
   508