--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Thu Jul 21 20:22:35 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Tue Aug 02 22:11:22 2022 +0100
@@ -178,11 +178,6 @@
Because of the lossiness, the process of decoding a bitlist requires additionally
a regular expression. The function $\decode$ is defined as:
-We define the reverse operation of $\code$, which is $\decode$.
-As expected, $\decode$ not only requires the bit-codes,
-but also a regular expression to guide the decoding and
-fill the gaps of characters:
-
%\begin{definition}[Bitdecoding of Values]\mbox{}
\begin{center}
@@ -214,27 +209,35 @@
%\end{definition}
\noindent
-The function $\decode'$ returns a pair consisting of a partially decoded value and some leftover:
-$\decode'$ does most of the job while $\decode$ throws
-away leftover bit-codes and returns the value only.
+The function $\decode'$ returns a pair consisting of
+a partially decoded value and some leftover bit list that cannot
+be decide yet.
+The function $\decode'$ succeeds if the left-over
+bit-sequence is empty.
$\decode$ is terminating as $\decode'$ is terminating.
-We have the property that $\decode$ and $\code$ are
+$\decode'$ is terminating
+because at least one of $\decode'$'s parameters will go down in terms
+of size.
+Assuming we have a value $v$ and regular expression $r$
+with $\vdash v:r$,
+then we have the property that $\decode$ and $\code$ are
reverse operations of one another:
\begin{lemma}
-\[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \]
+\[If \vdash v : r \; then \;\decode \; (\code \; v) \; r = \textit{Some}(v) \]
\end{lemma}
\begin{proof}
By proving a more general version of the lemma, on $\decode'$:
\[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
-Then setting $ds$ to be $[]$ and unfolding $\decode$ definition
-we get the lemma.
+Then setting $ds$ to be $[]$ and unfolding $\decode$ definition,
+we obtain the property.
\end{proof}
With the $\code$ and $\decode$ functions in hand, we know how to
-switch between bit-codes and value--the two different representations of
-lexing information.
-The next step is to integrate this information into the working regular expression.
+switch between bit-codes and values.
+The next step is to integrate this information into regular expression.
Attaching bits to the front of regular expressions is the solution Sulzamann and Lu
-gave for storing partial values on the fly:
+gave for storing partial values in regular expressions.
+Annotated regular expressions are therefore defined as the Isabelle
+datatype:
\begin{center}
\begin{tabular}{lcl}
@@ -249,16 +252,18 @@
%(in \textit{ALTS})
\noindent
-We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}.
-$bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular
-expressions and $as$ for a list of annotated regular expressions.
-The alternative constructor ($\sum$) has been generalised to
-accept a list of annotated regular expressions rather than just 2.
+where $bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular
+expressions and $as$ for lists of annotated regular expressions.
+The alternative constructor, written, $\sum$, has been generalised to
+accept a list of annotated regular expressions rather than just two.
+Why is it generalised? This is because when we open up nested
+alternatives, there could be more than two elements at the same level
+after de-duplication, which can no longer be stored in a binary
+constructor.
-
-The first thing we define related to bit-coded regular expressions
-is how we move bits, for instance pasting it at the front of an annotated regular expression.
-The operation $\fuse$ is just to attach bit-codes
+The first operation we define related to bit-coded regular expressions
+is how we move bits to the inside of regular expressions.
+Called $\fuse$, this operation is attaches bit-codes
to the front of an annotated regular expression:
\begin{center}
\begin{tabular}{lcl}
@@ -277,14 +282,12 @@
\end{center}
\noindent
-With that we are able to define $\internalise$.
+With \emph{fuse} we are able to define the $\internalise$ function
+that translates a ``standard'' regular expression into an
+annotated regular expression.
+This function will be applied before we start
+with the derivative phase of the algorithm.
-To do lexing using annotated regular expressions, we shall first
-transform the usual (un-annotated) regular expressions into annotated
-regular expressions. This operation is called \emph{internalisation} and
-defined as follows:
-
-%\begin{definition}
\begin{center}
\begin{tabular}{lcl}
$(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
@@ -303,7 +306,7 @@
\noindent
We use an up arrow with postfix notation
-to denote the operation,
+to denote this operation.
for convenience. The $\textit{internalise} \; r$
notation is more cumbersome.
The opposite of $\textit{internalise}$ is
@@ -312,16 +315,17 @@
regular expressions is transformed to the binary alternatives
for plain regular expressions.
\begin{center}
- \begin{tabular}{lcr}
- $\erase \; \ZERO$ & $\dn$ & $\ZERO$\\
- $\erase \; _{bs}\ONE$ & $\dn$ & $\ONE$\\
- $\erase \; _{bs}\mathbf{c}$ & $\dn$ & $\mathbf{c}$\\
- $\erase \; _{bs} a_1 \cdot a_2$ & $\dn$ & $\erase \; r_1\cdot\erase\; r_2$\\
- $\erase \; _{bs} [] $ & $\dn$ & $\ZERO$\\
- $\erase \; _{bs} [a] $ & $\dn$ & $\erase \; a$\\
- $\erase \; _{bs} \sum [a_1, \; a_2]$ & $\dn$ & $\erase \; a_1 +\erase \; a_2$\\
- $\erase \; _{bs} \sum (a :: as)$ & $\dn$ & $\erase \; a + \erase \; _{[]} \sum as$\\
- $\erase \; _{bs} a^*$ & $\dn$ & $(\erase \; a)^*$
+ \begin{tabular}{lcl}
+ $\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\
+ $( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\
+ $( _{bs}\mathbf{c} )_\downarrow$ & $\dn$ & $\mathbf{c}$\\
+ $( _{bs} a_1 \cdot a_2 )_\downarrow$ & $\dn$ &
+ $ (a_1) _\downarrow \cdot (a_2) _\downarrow$\\
+ $( _{bs} [])_\downarrow $ & $\dn$ & $\ZERO $\\
+ $( _{bs} [a] )_\downarrow$ & $\dn$ & $a_\downarrow$\\
+ $_{bs} \sum [a_1, \; a_2]$ & $\dn$ & $ (a_1) _\downarrow + ( a_2 ) _\downarrow $\\
+ $(_{bs} \sum (a :: as))_\downarrow$ & $\dn$ & $ a_\downarrow + \; (_{[]} \sum as)_\downarrow$\\
+ $( _{bs} a^* )_\downarrow$ & $\dn$ & $(a_\downarrow)^*$
\end{tabular}
\end{center}
\noindent
@@ -331,13 +335,12 @@
testing whether they contain empty string in their lauguage requires
a dedicated function $\bnullable$
which simply calls $\erase$ first before testing whether it is $\nullable$.
-\begin{center}
- \begin{tabular}{lcr}
- $\bnullable \; a $ & $\dn$ & $\nullable \; (a_\downarrow)$
- \end{tabular}
-\end{center}
+\begin{definition}
+ $\bnullable \; a \dn \nullable \; (a_\downarrow)$
+\end{definition}
The function for collecting the
-bitcodes of a $\bnullable$ annotated regular expression
+bitcodes at the end of the derivative
+phase from a (b)nullable regular expression
is a generalised version of the $\textit{mkeps}$ function
for annotated regular expressions, called $\textit{bmkeps}$:
@@ -359,74 +362,105 @@
%\end{definition}
\noindent
-This function completes the value information by travelling along the
+$\bmkeps$ completes the value information by travelling along the
path of the regular expression that corresponds to a POSIX value and
-collecting all the bitcodes, and using $S$ to indicate the end of star
+collecting all the bitcodes, and attaching $S$ to indicate the end of star
iterations.
-The most central question is how these partial lexing information
-represented as bit-codes is augmented and carried around
-during a derivative is taken.
-This is done by adding bitcodes to the
-derivatives, for example when one more star iteratoin is taken (we
-call the operation of derivatives on annotated regular expressions $\bder$
-because it is derivatives on regular expressiones with \emph{b}itcodes),
+Now we give out the central part of this lexing algorithm,
+the $\bder$ function (stands for \emph{b}itcoded-derivative).
+For most time we use the infix notation $(\_\backslash\_)$
+to mean $\bder$ for brevity when
+there is no danger of confusion with derivatives on plain regular expressions.
+For example, we write $( _{[]}r^* ) \backslash c$ instead of $\bder \;c \; _{[]}r^*$,
+as the bitcodes at the front of $r^*$ indicates that it is
+a bit-coded regular expression, not a plain one.
+$\bder$ tells us how regular expressions can be recursively traversed,
+where the bitcodes are augmented and carried around
+when a derivative is taken.
+\begin{center}
+ \begin{tabular}{@{}lcl@{}}
+ $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
+ $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
+ $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
+ $\textit{if}\;c=d\; \;\textit{then}\;
+ _{bs}\ONE\;\textit{else}\;\ZERO$\\
+ $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
+ $_{bs}\sum\;(\textit{map} \; (\_\backslash c) \; as )$\\
+ $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
+ $\textit{if}\;\textit{bnullable}\,a_1$\\
+ & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
+ & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
+ & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
+ $(_{bs}a^*)\,\backslash c$ & $\dn$ &
+ $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
+ (_{[]}r^*))$
+\end{tabular}
+\end{center}
+\noindent
+We give the intuition behind some of the more involved cases in
+$\bder$. For example,
+in the \emph{star} case,
+a derivative on $_{bs}a^*$ means
+that one more star iteratoin needs to be taken.
we need to unfold it into a sequence,
and attach an additional bit $Z$ to the front of $r \backslash c$
-to indicate one more star iteration.
-\begin{center}
- \begin{tabular}{@{}lcl@{}}
- $\bder \; c\; (_{bs}a^*) $ & $\dn$ &
- $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot
- (_{[]}a^*))$
-\end{tabular}
-\end{center}
+as a record to indicate one new star iteration is unfolded.
\noindent
-For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when
-there is no danger of confusion with derivatives on plain regular expressions,
-for example, the above can be expressed as
\begin{center}
\begin{tabular}{@{}lcl@{}}
$(_{bs}a^*)\,\backslash c$ & $\dn$ &
- $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot
+ $_{bs}(\underbrace{\textit{fuse}\, [Z] \; a\,\backslash c}_{\text{One more iteration}})\cdot
(_{[]}a^*))$
\end{tabular}
\end{center}
\noindent
-Using the picture we used earlier to depict this, the transformation when
-taking a derivative w.r.t a star is like below:
+This information will be recovered later by the $\decode$ function.
+The intuition is that the bit $Z$ will be decoded at the right location,
+because we accumulate bits from left to right (a rigorous proof will be given
+later).
-\begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
- \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+\begin{tikzpicture}[ > = stealth, % arrow head style
+ shorten > = 1pt, % don't touch arrow head to node
+ semithick % line style
+ ]
+
+ \tikzstyle{every state}=[
+ draw = black,
+ thin,
+ fill = cyan!29,
+ minimum size = 7mm
+ ]
+ \begin{scope}[node distance=1cm and 0cm, every node/.style=state]
+ \node (k) [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
{$bs$
\nodepart{two} $a^*$ };
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-&
+ \node (l) [below =of k, rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+ { $bs$ + [Z]
+ \nodepart{two} $(a\backslash c )\cdot a^*$ };
+ \end{scope}
+ \path[->]
+ (k) edge (l);
+\end{tikzpicture}
+
+Pictorially the process looks like below.
+Like before, the red region denotes
+previous lexing information (stored as bitcodes in $bs$).
+
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
- \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
- {$v_{\text{previous iterations}}$
- \nodepart{two} $a^*$};
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\\
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
- \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+ \begin{scope}[node distance=1cm]
+ \node (a) [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+ {$bs$
+ \nodepart{two} $a^*$ };
+ \node (b) [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
{ $bs$ + [Z]
\nodepart{two} $(a\backslash c )\cdot a^*$ };
%\caption{term 1 \ref{term:1}'s matching configuration}
+ \end{scope}
\end{tikzpicture}
-&
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
- \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
- {$v_{\text{previous iterations}}$ + 1 more iteration
- \nodepart{two} $(a\backslash c )\cdot a^*$ };
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\end{tabular}
+
\noindent
Another place in the $\bder$ function where it differs
from normal derivatives (on un-annotated regular expressions)
@@ -474,25 +508,6 @@
The rest of the clauses of $\bder$ is rather similar to
$\der$, and is put together here as a wholesome definition
for $\bder$:
-\begin{center}
- \begin{tabular}{@{}lcl@{}}
- $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
- $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
- $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
- $\textit{if}\;c=d\; \;\textit{then}\;
- _{bs}\ONE\;\textit{else}\;\ZERO$\\
- $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
- $_{bs}\sum\;(\textit{map} \; (\_\backslash c) \; as )$\\
- $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
- $\textit{if}\;\textit{bnullable}\,a_1$\\
- & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
- & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
- & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
- $(_{bs}a^*)\,\backslash c$ & $\dn$ &
- $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
- (_{[]}r^*))$
-\end{tabular}
-\end{center}
\noindent
Generalising the derivative operation with bitcodes to strings, we have
\begin{center}