diff -r 5c96fe5306a7 -r 57182b36ec01 thys2/Paper/Paper.thy --- a/thys2/Paper/Paper.thy Sat Feb 05 18:24:37 2022 +0000 +++ b/thys2/Paper/Paper.thy Sun Feb 06 00:02:04 2022 +0000 @@ -247,6 +247,7 @@ text {* + In the second part of their paper \cite{Sulzmann2014}, Sulzmann and Lu describe another algorithm that generates POSIX values but dispences with the second phase where characters are injected ``back'' into values. For this they annotate bitcodes to @@ -254,8 +255,7 @@ \begin{center} \begin{tabular}{lcl} - @{term breg} & $::=$ & @{term "AZERO"}\\ - & $\mid$ & @{term "AONE bs"}\\ + @{term breg} & $::=$ & @{term "AZERO"} $\quad\mid\quad$ @{term "AONE bs"}\\ & $\mid$ & @{term "ACHAR bs c"}\\ & $\mid$ & @{term "AALTs bs rs"}\\ & $\mid$ & @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}\\ @@ -264,46 +264,243 @@ \end{center} \noindent where @{text bs} stands for a bitsequences; @{text r}, - @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for annotated regular - expressions; and @{text rs} for a list of annotated regular - expressions. In contrast to Sulzmann and Lu we generalise the - alternative regular expressions to lists, instead of just having - binary regular expressions. The idea with annotated regular - expressions is to incrementally generate the value information by - recording bitsequences. Sulzmann and Lu then - define a coding function for how values can be coded into bitsequences. + @{text "r\<^sub>1"} and @{text "r\<^sub>2"} for bitcoded regular + expressions; and @{text rs} for lists of bitcoded regular + expressions. The binary alternative @{text "ALT bs r\<^sub>1 r\<^sub>2"} + is just an abbreviation for @{text "ALTs bs [r\<^sub>1, r\<^sub>2]"}. + For bitsequences we just use lists made up of the + constants @{text Z} and @{text S}. The idea with bitcoded regular + expressions is to incrementally generate the value information (for + example @{text Left} and @{text Right}) as bitsequences + as part of the regular expression constructors. + Sulzmann and Lu then define a coding + function for how values can be coded into bitsequences. \begin{center} + \begin{tabular}{cc} \begin{tabular}{lcl} @{thm (lhs) code.simps(1)} & $\dn$ & @{thm (rhs) code.simps(1)}\\ @{thm (lhs) code.simps(2)} & $\dn$ & @{thm (rhs) code.simps(2)}\\ @{thm (lhs) code.simps(3)} & $\dn$ & @{thm (rhs) code.simps(3)}\\ - @{thm (lhs) code.simps(4)} & $\dn$ & @{thm (rhs) code.simps(4)}\\ + @{thm (lhs) code.simps(4)} & $\dn$ & @{thm (rhs) code.simps(4)} + \end{tabular} + & + \begin{tabular}{lcl} @{thm (lhs) code.simps(5)[of "v\<^sub>1" "v\<^sub>2"]} & $\dn$ & @{thm (rhs) code.simps(5)[of "v\<^sub>1" "v\<^sub>2"]}\\ @{thm (lhs) code.simps(6)} & $\dn$ & @{thm (rhs) code.simps(6)}\\ - @{thm (lhs) code.simps(7)} & $\dn$ & @{thm (rhs) code.simps(7)} + @{thm (lhs) code.simps(7)} & $\dn$ & @{thm (rhs) code.simps(7)}\\ + \mbox{\phantom{XX}}\\ + \end{tabular} \end{tabular} \end{center} + + \noindent + As can be seen, this coding is ``lossy'' in the sense that we do not + record explicitly character values and also not sequence values (for + them we just append two bitsequences). We do, however, record the + different alternatives for @{text Left}, respectively @{text Right}, as @{text Z} and + @{text S} followed by some bitsequence. Similarly, we use @{text Z} to indicate + if there is still a value coming in the list of @{text Stars}, whereas @{text S} + indicates the end of the list. The lossiness makes the process of + decoding a bit more involved, but the point is that if we have a + regular expression \emph{and} a bitsequence of a corresponding value, + then we can always decode the value accurately. The decoding can be + defined by using two functions called $\textit{decode}'$ and + \textit{decode}: + \begin{center} + \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}} + $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ + $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ + $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & + $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; + (\Left\,v, bs_1)$\\ + $\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & + $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; + (\Right\,v, bs_1)$\\ + $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & + $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ + & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$ + \hspace{2mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ + $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ + $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & + $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ + & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$ + \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ + $\textit{decode}\,bs\,r$ & $\dn$ & + $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ + & & \hspace{7mm}$\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; + \textit{else}\;\textit{None}$ + \end{tabular} + \end{center} + + \noindent + The function \textit{decode} checks whether all of the bitsequence is + consumed and returns the corresponding value as @{term "Some v"}; otherwise + it fails with @{text "None"}. We can establish that for a value $v$ + inhabited by a regular expression $r$, the decoding of its + bitsequence never fails. + +\begin{lemma}\label{codedecode}\it + If $\;\vdash v : r$ then + $\;\textit{decode}\,(\textit{code}\, v)\,r = \textit{Some}\, v$. +\end{lemma} + +\begin{proof} + This follows from the property that + $\textit{decode}'\,((\textit{code}\,v) \,@\, bs)\,r = (v, bs)$ holds + for any bit-sequence $bs$ and $\vdash v : r$. This property can be + easily proved by induction on $\vdash v : r$. +\end{proof} + + Sulzmann and Lu define the function \emph{internalise} + in order to transform standard regular expressions into annotated + regular expressions. We write this operation as $r^\uparrow$. + This internalisation uses the following + \emph{fuse} function. + + \begin{center} + \begin{tabular}{lcl} + $\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\ + $\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ & + $\textit{ONE}\,(bs\,@\,bs')$\\ + $\textit{fuse}\,bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ & + $\textit{CHAR}\,(bs\,@\,bs')\,c$\\ + $\textit{fuse}\,bs\,(\textit{ALTs}\,bs'\,rs)$ & $\dn$ & + $\textit{ALTs}\,(bs\,@\,bs')\,rs$\\ + $\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,r_1\,r_2)$ & $\dn$ & + $\textit{SEQ}\,(bs\,@\,bs')\,r_1\,r_2$\\ + $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,r)$ & $\dn$ & + $\textit{STAR}\,(bs\,@\,bs')\,r$ + \end{tabular} + \end{center} + + \noindent + A regular expression can then be \emph{internalised} into a bitcoded + regular expression as follows. + + \begin{center} + \begin{tabular}{lcl} + $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\ + $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\ + $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\ + $(r_1 + r_2)^\uparrow$ & $\dn$ & + $\textit{ALT}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\, + (\textit{fuse}\,[\S]\,r_2^\uparrow)$\\ + $(r_1\cdot r_2)^\uparrow$ & $\dn$ & + $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\ + $(r^*)^\uparrow$ & $\dn$ & + $\textit{STAR}\;[]\,r^\uparrow$\\ + \end{tabular} + \end{center} + + \noindent + There is also an \emph{erase}-function, written $a^\downarrow$, which + transforms a bitcoded regular expression into a (standard) regular + expression by just erasing the annotated bitsequences. We omit the + straightforward definition. For defining the algorithm, we also need + the functions \textit{bnullable} and \textit{bmkeps}, which are the + ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on + bitcoded regular expressions, instead of regular expressions. + + \begin{center} + \begin{tabular}{lcl} + $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{false}$ \textbf{fix}\\ + $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{true}$\\ + $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{false}$\\ + $\textit{bnullable}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ & + $\textit{bnullable}\,a_1\vee \textit{bnullable}\,a_2$\\ + $\textit{bnullable}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ & + $\textit{bnullable}\,a_1\wedge \textit{bnullable}\,a_2$\\ + $\textit{bnullable}\,(\textit{STAR}\,bs\,a)$ & $\dn$ & + $\textit{true}$ + \end{tabular} + \end{center} + + \begin{center} + \begin{tabular}{lcl} + $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$ \textbf{fix}\\ + $\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a_1$\\ + & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\ + & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\ + $\textit{bmkeps}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ & + $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ + $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ & + $bs \,@\, [\S]$ + \end{tabular} + \end{center} + + + \noindent + The key function in the bitcoded algorithm is the derivative of an + annotated regular expression. This derivative calculates the + derivative but at the same time also the incremental part that + contributes to constructing a value. + + \begin{center} + \begin{tabular}{@ {}lcl@ {}} + $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$ \textbf{fix}\\ + $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\ + $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ & + $\textit{if}\;c=d\; \;\textit{then}\; + \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ + $(\textit{ALT}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ & + $\textit{ALT}\,bs\,(a_1\backslash c)\,(a_2\backslash c)$\\ + $(\textit{SEQ}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a_1$\\ + & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\backslash c)\,a_2)$\\ + & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\backslash c))$\\ + & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\backslash c)\,a_2$\\ + $(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ & + $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\, + (\textit{STAR}\,[]\,r)$ + \end{tabular} + \end{center} + + + \noindent + This function can also be extended to strings, written $a\backslash s$, + just like the standard derivative. We omit the details. Finally we + can define Sulzmann and Lu's bitcoded lexer, which we call \textit{blexer}: + + \noindent +This bitcoded lexer first internalises the regular expression $r$ and then +builds the annotated derivative according to $s$. If the derivative is +nullable, then it extracts the bitcoded value using the +$\textit{bmkeps}$ function. Finally it decodes the bitcoded value. If +the derivative is \emph{not} nullable, then $\textit{None}$ is +returned. The task is to show that this way of calculating a value +generates the same result as with \textit{lexer}. + +Before we can proceed we need to define a function, called +\textit{retrieve}, which Sulzmann and Lu introduced for the proof. + +\textbf{fix} + +\noindent +The idea behind this function is to retrieve a possibly partial +bitcode from an annotated regular expression, where the retrieval is +guided by a value. For example if the value is $\Left$ then we +descend into the left-hand side of an alternative (annotated) regular +expression in order to assemble the bitcode. Similarly for +$\Right$. The property we can show is that for a given $v$ and $r$ +with $\vdash v : r$, the retrieved bitsequence from the internalised +regular expression is equal to the bitcoded version of $v$. + +\begin{lemma}\label{retrievecode} +If $\vdash v : r$ then $\textit{code}\, v = \textit{retrieve}\,(r^\uparrow)\,v$. +\end{lemma} + +*} + +text {* There is also a corresponding decoding function that takes a bitsequence and generates back a value. However, since the bitsequences are a ``lossy'' coding (@{term Seq}s are not coded) the decoding function depends also on a regular expression in order to decode values. - \begin{center} - \begin{tabular}{lcll} - %@{thm (lhs) decode'.simps(1)} & $\dn$ & @{thm (rhs) decode'.simps(1)}\\ - @{thm (lhs) decode'.simps(2)} & $\dn$ & @{thm (rhs) decode'.simps(2)}\\ - @{thm (lhs) decode'.simps(3)} & $\dn$ & @{thm (rhs) decode'.simps(3)}\\ - @{thm (lhs) decode'.simps(4)} & $\dn$ & @{thm (rhs) decode'.simps(4)}\\ - @{thm (lhs) decode'.simps(5)} & $\dn$ & @{thm (rhs) decode'.simps(5)}\\ - @{thm (lhs) decode'.simps(6)} & $\dn$ & @{thm (rhs) decode'.simps(6)}\\ - @{thm (lhs) decode'.simps(7)} & $\dn$ & @{thm (rhs) decode'.simps(7)}\\ - @{thm (lhs) decode'.simps(8)} & $\dn$ & @{thm (rhs) decode'.simps(8)}\\ - @{thm (lhs) decode'.simps(9)} & $\dn$ & @{thm (rhs) decode'.simps(9)}\\ - @{thm (lhs) decode'.simps(10)} & $\dn$ & @{thm (rhs) decode'.simps(10)} & fix - \end{tabular} - \end{center} + The idea of the bitcodes is to annotate them to regular expressions and generate values