diff -r 692911c0b981 -r 3178f0e948ac ChengsongTanPhdThesis/Chapters/Bitcoded1.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Thu Jul 21 20:22:35 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Tue Aug 02 22:11:22 2022 +0100 @@ -178,11 +178,6 @@ Because of the lossiness, the process of decoding a bitlist requires additionally a regular expression. The function $\decode$ is defined as: -We define the reverse operation of $\code$, which is $\decode$. -As expected, $\decode$ not only requires the bit-codes, -but also a regular expression to guide the decoding and -fill the gaps of characters: - %\begin{definition}[Bitdecoding of Values]\mbox{} \begin{center} @@ -214,27 +209,35 @@ %\end{definition} \noindent -The function $\decode'$ returns a pair consisting of a partially decoded value and some leftover: -$\decode'$ does most of the job while $\decode$ throws -away leftover bit-codes and returns the value only. +The function $\decode'$ returns a pair consisting of +a partially decoded value and some leftover bit list that cannot +be decide yet. +The function $\decode'$ succeeds if the left-over +bit-sequence is empty. $\decode$ is terminating as $\decode'$ is terminating. -We have the property that $\decode$ and $\code$ are +$\decode'$ is terminating +because at least one of $\decode'$'s parameters will go down in terms +of size. +Assuming we have a value $v$ and regular expression $r$ +with $\vdash v:r$, +then we have the property that $\decode$ and $\code$ are reverse operations of one another: \begin{lemma} -\[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \] +\[If \vdash v : r \; then \;\decode \; (\code \; v) \; r = \textit{Some}(v) \] \end{lemma} \begin{proof} By proving a more general version of the lemma, on $\decode'$: \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \] -Then setting $ds$ to be $[]$ and unfolding $\decode$ definition -we get the lemma. +Then setting $ds$ to be $[]$ and unfolding $\decode$ definition, +we obtain the property. \end{proof} With the $\code$ and $\decode$ functions in hand, we know how to -switch between bit-codes and value--the two different representations of -lexing information. -The next step is to integrate this information into the working regular expression. +switch between bit-codes and values. +The next step is to integrate this information into regular expression. Attaching bits to the front of regular expressions is the solution Sulzamann and Lu -gave for storing partial values on the fly: +gave for storing partial values in regular expressions. +Annotated regular expressions are therefore defined as the Isabelle +datatype: \begin{center} \begin{tabular}{lcl} @@ -249,16 +252,18 @@ %(in \textit{ALTS}) \noindent -We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}. -$bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular -expressions and $as$ for a list of annotated regular expressions. -The alternative constructor ($\sum$) has been generalised to -accept a list of annotated regular expressions rather than just 2. +where $bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular +expressions and $as$ for lists of annotated regular expressions. +The alternative constructor, written, $\sum$, has been generalised to +accept a list of annotated regular expressions rather than just two. +Why is it generalised? This is because when we open up nested +alternatives, there could be more than two elements at the same level +after de-duplication, which can no longer be stored in a binary +constructor. - -The first thing we define related to bit-coded regular expressions -is how we move bits, for instance pasting it at the front of an annotated regular expression. -The operation $\fuse$ is just to attach bit-codes +The first operation we define related to bit-coded regular expressions +is how we move bits to the inside of regular expressions. +Called $\fuse$, this operation is attaches bit-codes to the front of an annotated regular expression: \begin{center} \begin{tabular}{lcl} @@ -277,14 +282,12 @@ \end{center} \noindent -With that we are able to define $\internalise$. +With \emph{fuse} we are able to define the $\internalise$ function +that translates a ``standard'' regular expression into an +annotated regular expression. +This function will be applied before we start +with the derivative phase of the algorithm. -To do lexing using annotated regular expressions, we shall first -transform the usual (un-annotated) regular expressions into annotated -regular expressions. This operation is called \emph{internalisation} and -defined as follows: - -%\begin{definition} \begin{center} \begin{tabular}{lcl} $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ @@ -303,7 +306,7 @@ \noindent We use an up arrow with postfix notation -to denote the operation, +to denote this operation. for convenience. The $\textit{internalise} \; r$ notation is more cumbersome. The opposite of $\textit{internalise}$ is @@ -312,16 +315,17 @@ regular expressions is transformed to the binary alternatives for plain regular expressions. \begin{center} - \begin{tabular}{lcr} - $\erase \; \ZERO$ & $\dn$ & $\ZERO$\\ - $\erase \; _{bs}\ONE$ & $\dn$ & $\ONE$\\ - $\erase \; _{bs}\mathbf{c}$ & $\dn$ & $\mathbf{c}$\\ - $\erase \; _{bs} a_1 \cdot a_2$ & $\dn$ & $\erase \; r_1\cdot\erase\; r_2$\\ - $\erase \; _{bs} [] $ & $\dn$ & $\ZERO$\\ - $\erase \; _{bs} [a] $ & $\dn$ & $\erase \; a$\\ - $\erase \; _{bs} \sum [a_1, \; a_2]$ & $\dn$ & $\erase \; a_1 +\erase \; a_2$\\ - $\erase \; _{bs} \sum (a :: as)$ & $\dn$ & $\erase \; a + \erase \; _{[]} \sum as$\\ - $\erase \; _{bs} a^*$ & $\dn$ & $(\erase \; a)^*$ + \begin{tabular}{lcl} + $\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\ + $( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\ + $( _{bs}\mathbf{c} )_\downarrow$ & $\dn$ & $\mathbf{c}$\\ + $( _{bs} a_1 \cdot a_2 )_\downarrow$ & $\dn$ & + $ (a_1) _\downarrow \cdot (a_2) _\downarrow$\\ + $( _{bs} [])_\downarrow $ & $\dn$ & $\ZERO $\\ + $( _{bs} [a] )_\downarrow$ & $\dn$ & $a_\downarrow$\\ + $_{bs} \sum [a_1, \; a_2]$ & $\dn$ & $ (a_1) _\downarrow + ( a_2 ) _\downarrow $\\ + $(_{bs} \sum (a :: as))_\downarrow$ & $\dn$ & $ a_\downarrow + \; (_{[]} \sum as)_\downarrow$\\ + $( _{bs} a^* )_\downarrow$ & $\dn$ & $(a_\downarrow)^*$ \end{tabular} \end{center} \noindent @@ -331,13 +335,12 @@ testing whether they contain empty string in their lauguage requires a dedicated function $\bnullable$ which simply calls $\erase$ first before testing whether it is $\nullable$. -\begin{center} - \begin{tabular}{lcr} - $\bnullable \; a $ & $\dn$ & $\nullable \; (a_\downarrow)$ - \end{tabular} -\end{center} +\begin{definition} + $\bnullable \; a \dn \nullable \; (a_\downarrow)$ +\end{definition} The function for collecting the -bitcodes of a $\bnullable$ annotated regular expression +bitcodes at the end of the derivative +phase from a (b)nullable regular expression is a generalised version of the $\textit{mkeps}$ function for annotated regular expressions, called $\textit{bmkeps}$: @@ -359,74 +362,105 @@ %\end{definition} \noindent -This function completes the value information by travelling along the +$\bmkeps$ completes the value information by travelling along the path of the regular expression that corresponds to a POSIX value and -collecting all the bitcodes, and using $S$ to indicate the end of star +collecting all the bitcodes, and attaching $S$ to indicate the end of star iterations. -The most central question is how these partial lexing information -represented as bit-codes is augmented and carried around -during a derivative is taken. -This is done by adding bitcodes to the -derivatives, for example when one more star iteratoin is taken (we -call the operation of derivatives on annotated regular expressions $\bder$ -because it is derivatives on regular expressiones with \emph{b}itcodes), +Now we give out the central part of this lexing algorithm, +the $\bder$ function (stands for \emph{b}itcoded-derivative). +For most time we use the infix notation $(\_\backslash\_)$ +to mean $\bder$ for brevity when +there is no danger of confusion with derivatives on plain regular expressions. +For example, we write $( _{[]}r^* ) \backslash c$ instead of $\bder \;c \; _{[]}r^*$, +as the bitcodes at the front of $r^*$ indicates that it is +a bit-coded regular expression, not a plain one. +$\bder$ tells us how regular expressions can be recursively traversed, +where the bitcodes are augmented and carried around +when a derivative is taken. +\begin{center} + \begin{tabular}{@{}lcl@{}} + $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ + $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ + $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & + $\textit{if}\;c=d\; \;\textit{then}\; + _{bs}\ONE\;\textit{else}\;\ZERO$\\ + $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & + $_{bs}\sum\;(\textit{map} \; (\_\backslash c) \; as )$\\ + $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a_1$\\ + & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ + & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ + & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ + $(_{bs}a^*)\,\backslash c$ & $\dn$ & + $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot + (_{[]}r^*))$ +\end{tabular} +\end{center} +\noindent +We give the intuition behind some of the more involved cases in +$\bder$. For example, +in the \emph{star} case, +a derivative on $_{bs}a^*$ means +that one more star iteratoin needs to be taken. we need to unfold it into a sequence, and attach an additional bit $Z$ to the front of $r \backslash c$ -to indicate one more star iteration. -\begin{center} - \begin{tabular}{@{}lcl@{}} - $\bder \; c\; (_{bs}a^*) $ & $\dn$ & - $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot - (_{[]}a^*))$ -\end{tabular} -\end{center} +as a record to indicate one new star iteration is unfolded. \noindent -For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when -there is no danger of confusion with derivatives on plain regular expressions, -for example, the above can be expressed as \begin{center} \begin{tabular}{@{}lcl@{}} $(_{bs}a^*)\,\backslash c$ & $\dn$ & - $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot + $_{bs}(\underbrace{\textit{fuse}\, [Z] \; a\,\backslash c}_{\text{One more iteration}})\cdot (_{[]}a^*))$ \end{tabular} \end{center} \noindent -Using the picture we used earlier to depict this, the transformation when -taking a derivative w.r.t a star is like below: +This information will be recovered later by the $\decode$ function. +The intuition is that the bit $Z$ will be decoded at the right location, +because we accumulate bits from left to right (a rigorous proof will be given +later). -\begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}} -\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] - \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] +\begin{tikzpicture}[ > = stealth, % arrow head style + shorten > = 1pt, % don't touch arrow head to node + semithick % line style + ] + + \tikzstyle{every state}=[ + draw = black, + thin, + fill = cyan!29, + minimum size = 7mm + ] + \begin{scope}[node distance=1cm and 0cm, every node/.style=state] + \node (k) [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] {$bs$ \nodepart{two} $a^*$ }; -%\caption{term 1 \ref{term:1}'s matching configuration} -\end{tikzpicture} -& + \node (l) [below =of k, rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] + { $bs$ + [Z] + \nodepart{two} $(a\backslash c )\cdot a^*$ }; + \end{scope} + \path[->] + (k) edge (l); +\end{tikzpicture} + +Pictorially the process looks like below. +Like before, the red region denotes +previous lexing information (stored as bitcodes in $bs$). + \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] - \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] - {$v_{\text{previous iterations}}$ - \nodepart{two} $a^*$}; -%\caption{term 1 \ref{term:1}'s matching configuration} -\end{tikzpicture} -\\ -\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] - \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] + \begin{scope}[node distance=1cm] + \node (a) [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] + {$bs$ + \nodepart{two} $a^*$ }; + \node (b) [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] { $bs$ + [Z] \nodepart{two} $(a\backslash c )\cdot a^*$ }; %\caption{term 1 \ref{term:1}'s matching configuration} + \end{scope} \end{tikzpicture} -& -\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] - \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] - {$v_{\text{previous iterations}}$ + 1 more iteration - \nodepart{two} $(a\backslash c )\cdot a^*$ }; -%\caption{term 1 \ref{term:1}'s matching configuration} -\end{tikzpicture} -\end{tabular} + \noindent Another place in the $\bder$ function where it differs from normal derivatives (on un-annotated regular expressions) @@ -474,25 +508,6 @@ The rest of the clauses of $\bder$ is rather similar to $\der$, and is put together here as a wholesome definition for $\bder$: -\begin{center} - \begin{tabular}{@{}lcl@{}} - $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ - $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ - $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & - $\textit{if}\;c=d\; \;\textit{then}\; - _{bs}\ONE\;\textit{else}\;\ZERO$\\ - $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & - $_{bs}\sum\;(\textit{map} \; (\_\backslash c) \; as )$\\ - $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,a_1$\\ - & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ - & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ - & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ - $(_{bs}a^*)\,\backslash c$ & $\dn$ & - $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot - (_{[]}r^*))$ -\end{tabular} -\end{center} \noindent Generalising the derivative operation with bitcodes to strings, we have \begin{center}