--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Wed Aug 17 22:59:15 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Sat Aug 20 23:49:17 2022 +0100
@@ -307,6 +307,13 @@
which condenses all the partial lexing information
(that was originally stored in $r_i, c_{i+1}$ pairs)
into bitcodes.
+The bitcodes are then carried with the regular
+expression, and augmented or moved around
+as the lexing goes on.
+It becomes unnecessary
+to remember all the
+intermediate expresssions, but only the most recent one
+with this bit-carrying regular expression.
Annotated regular expressions
are defined as the following
Isabelle datatype \footnote{ We use subscript notation to indicate
@@ -565,12 +572,23 @@
\end{tabular}
\end{center}
\noindent
-$\bmkeps$ retrieves the value $v$'s
-information in the format
-of bitcodes, by travelling along the
-path of the regular expression that corresponds to a POSIX match,
-collecting all the bitcodes, and attaching $S$ to indicate the end of star
-iterations. \\
+$\bmkeps$, just like $\mkeps$,
+visits a regular expression tree respecting
+the POSIX rules. The difference, however, is that
+it does not create values, but only bitcodes.
+It traverses each child of the sequence regular expression
+from left to right and creates a bitcode by stitching
+together bitcodes obtained from the children expressions.
+In the case of alternative regular expressions,
+it looks for the leftmost
+$\nullable$ branch
+to visit and ignores other siblings.
+%Whenever there is some bitcodes attached to a
+%node, it returns the bitcodes concatenated with whatever
+%child recursive calls return.
+The only time when $\bmkeps$ creates new bitcodes
+is when it completes a star's iterations by attaching a $S$ to the end of the bitcode
+list it returns.\\
The bitcodes extracted by $\bmkeps$ need to be
$\decode$d (with the guidance of a plain regular expression):
%\begin{definition}[Bitdecoding of Values]\mbox{}
@@ -813,115 +831,229 @@
\end{center}
\noindent
+$\blexer$ first attaches bitcodes to a
+plain regular expression, and then do successive derivatives
+with respect to the input string $s$, and
+then test whether the result is nullable.
+If yes, then extract the bitcodes out of the nullable expression,
+and decodes the bitcodes into a lexical value.
+If there does not exists a match between $r$ and $s$ the lexer
+outputs $\None$ indicating a failed lex.\\
Ausaf and Urban formally proved the correctness of the $\blexer$, namely
\begin{property}
$\blexer \;r \; s = \lexer \; r \; s$.
\end{property}
+\noindent
This was claimed but not formalised in Sulzmann and Lu's work.
We introduce the proof later, after we give all
the needed auxiliary functions and definitions.
-But before this we shall first walk the reader
+\subsection{An Example $\blexer$ Run}
+Before introducing the proof we shall first walk the reader
through a concrete example of our $\blexer$ calculating the right
lexical information through bit-coded regular expressions.\\
-Consider the regular expression $(aa)^* \cdot bc$ matching the string $aabc$
-and assume we have just read the first character $a$:
-\begin{center}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
- \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
- {$\ONE \cdot a \cdot (aa)^* \cdot bc$
- \nodepart{two} $[] \iff \Seq \; (\Stars \; [\Seq\; a \; ??, \;]), ??$};
-\end{tikzpicture}
-\end{center}
-\noindent
-We use the red area for (annotated) regular expressions and the blue
-area the (partial) bitcodes and (partial) values.
-In the injection-based lexing algorithm, we ``neglect" the red area
-by putting all the characters we have consumed and
-intermediate regular expressions on the stack when
-we go from left to right in the derivative phase.
-The red area grows till the string is exhausted.
-During the injection phase, the value in the blue area
-is built up incrementally, while the red area shrinks.
-Before we have recovered all characters and intermediate
-derivative regular expressions from the stack,
-what values these characters and regular expressions correspond
-to are unknown:
-\begin{center}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
- \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
- {$(\ONE \cdot \ONE) \cdot (aa)^* \cdot bc $ correspond to:$???$
- \nodepart{two} $b c$ corresponds to $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\end{center}
+Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$.
+We give again the bird's eye view of this particular example
+in each stage of the algorithm:
+
+\tikzset{three sided/.style={
+ draw=none,
+ append after command={
+ [-,shorten <= -0.5\pgflinewidth]
+ ([shift={(-1.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east)
+ edge([shift={( 0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west)
+ ([shift={( 0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west)
+ edge([shift={( 0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)
+ ([shift={( 0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)
+ edge([shift={(-1.0\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)
+ }
+ }
+}
+
+\tikzset{three sided1/.style={
+ draw=none,
+ append after command={
+ [-,shorten <= -0.5\pgflinewidth]
+ ([shift={(1.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west)
+ edge([shift={(-0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east)
+ ([shift={(-0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east)
+ edge([shift={(-0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)
+ ([shift={(-0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)
+ edge([shift={(1.0\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)
+ }
+ }
+}
+
+\begin{figure}[H]
+ \begin{tikzpicture}[->, >=stealth', shorten >= 1pt, auto, thick]
+ \node [rectangle, draw] (r) at (-6, -1) {$(aa)^*(b+c)$};
+ \node [rectangle, draw] (a) at (-6, 4) {$(aa)^*(_{Z}b + _{S}c)$};
+ \path (r)
+ edge [] node {$\internalise$} (a);
+ \node [rectangle, draw] (a1) at (-3, 1) {$(_{Z}(\ONE \cdot a) \cdot (aa)^*) (_{Z}b + _Sc)$};
+ \path (a)
+ edge [] node {$\backslash a$} (a1);
+
+ \node [rectangle, draw, three sided] (a21) at (-2.5, 4) {$(_{Z}\ONE \cdot (aa)^*)$};
+ \node [rectangle, draw, three sided1] (a22) at (-0.8, 4) {$(_{Z}b + _{S}c)$};
+ \path (a1)
+ edge [] node {$\backslash a$} (a21);
+ \node [rectangle, draw] (a3) at (0.5, 2) {$_{ZS}(_{Z}\ONE + \ZERO)$};
+ \path (a22)
+ edge [] node {$\backslash b$} (a3);
+ \path (a21)
+ edge [dashed, bend right] node {} (a3);
+ \node [rectangle, draw] (bs) at (2, 4) {$ZSZ$};
+ \path (a3)
+ edge [below] node {$\bmkeps$} (bs);
+ \node [rectangle, draw] (v) at (3, 0) {$\Seq \; (\Stars\; [\Seq \; a \; a]) \; (\Left \; b)$};
+ \path (bs)
+ edge [] node {$\decode$} (v);
+
+
+ \end{tikzpicture}
+ \caption{$\blexer \;\;\;\; (aa)^*(b+c) \;\;\;\; aab$}
+\end{figure}
\noindent
-However, they should be calculable,
-as characters and regular expression shapes
-after taking derivative w.r.t those characters
-have already been known, therefore in our example,
-we know that the value starts with two $a$s,
-and makes up to an iteration in a Kleene star:
-(We have put the injection-based lexing's partial
-result in the right part of the split rectangle
-to contrast it with the partial valued produced
-in a forward manner)
-\begin{center}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
- \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
- {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$
- \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$ $\stackrel{Inj}{\longleftarrow}$};
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\end{center}
+The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$
+turned into $ZS$ after derivative w.r.t $b$
+is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$
+before processing $_Zb+_Sc$.\\
+The annotated regular expressions
+would look too cumbersome if we explicitly indicate all the
+locations where bitcodes are attached.
+For example,
+$(aa)^*\cdot (b+c)$ would
+look like $_{[]}(_{[]}(_{[]}a \cdot _{[]}a)^*\cdot _{[]}(_{[]}b+_{[]}c))$
+after
+internalise.
+Therefore for readability we omit bitcodes if they are empty.
+This applies to all example annotated
+regular expressions in this thesis.\\
+%and assume we have just read the first character $a$:
+%\begin{center}
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+% \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+% {$(_{[Z]}(\ONE \cdot a) \cdot (aa)^* )\cdot bc$
+% \nodepart{two} $[Z] \iff \Seq \; (\Stars \; [\Seq\; a \; ??, \;??]) \; ??$};
+%\end{tikzpicture}
+%\end{center}
+%\noindent
+%We use the red area for (annotated) regular expressions and the blue
+%area the (partially calculated) bitcodes
+%and its corresponding (partial) values.
+%The first derivative
+%generates a $Z$ bitcode to indicate
+%a new iteration has been started.
+%This bitcode is attached to the front of
+%the unrolled star iteration $\ONE\cdot a$
+%for later decoding.
+%\begin{center}
+%\begin{tikzpicture}[]
+% \node [rectangle split, rectangle split horizontal,
+% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, draw, rounded corners, inner sep=10pt]
+% (der2) at (0,0)
+% {$(_{[Z]}(\ONE \cdot \ONE) \cdot (aa)^*) \cdot bc $
+% \nodepart{two} $[Z] \iff \Seq \; (\Stars \; [\Seq \; a\;a, ??]) \; ??$};
+%
+%\node [draw=none, minimum size = 0.1, ] (r) at (-7, 0) {$a_1$};
+%\path
+% (r)
+% edge [->, >=stealth',shorten >=1pt, above] node {$\backslash a$} (der2);
+%%\caption{term 1 \ref{term:1}'s matching configuration}
+%\end{tikzpicture}
+%\end{center}
+%\noindent
+%After we take derivative with respect to
+%second input character $a$, the annotated
+%regular expression has the second $a$ chopped off.
+%The second derivative does not involve any
+%new bitcodes being generated, because
+%there are no new iterations or bifurcations
+%in the regular expression requiring any $S$ or $Z$ marker
+%to indicate choices.
+%\begin{center}
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+% \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+% {$(_{[Z]}(\ONE \cdot \ONE) \cdot (aa)^*) \cdot (\ONE \cdot c) $
+% \nodepart{two} $[Z] \iff \Seq \; (\Stars \; [\Seq \; a\;a, ??]) \; ??$};
+%%\caption{term 1 \ref{term:1}'s matching configuration}
+%\end{tikzpicture}
+%\end{center}
+%\noindent
+%
+%
+%\begin{center}
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+% \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+% {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$
+% \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$ $\stackrel{Inj}{\longleftarrow}$};
+%%\caption{term 1 \ref{term:1}'s matching configuration}
+%\end{tikzpicture}
+%\end{center}
+%\noindent
+%If we do this kind of "attachment"
+%and each time augment the attached partially
+%constructed value when taking off a
+%character:
+%\begin{center}
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+% \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint)
+% {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
+% \nodepart{two} Remaining: $b c$};
+%\end{tikzpicture}\\
+%$\downarrow$\\
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+% \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+% {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
+% \nodepart{two} Remaining: $c$};
+%\end{tikzpicture}\\
+%$\downarrow$\\
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+% \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+% {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
+% \nodepart{two} EOF};
+%\end{tikzpicture}
+%\end{center}
\noindent
-If we do this kind of "attachment"
-and each time augment the attached partially
-constructed value when taking off a
-character:
-\begin{center}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
- \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint)
- {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
- \nodepart{two} Remaining: $b c$};
-\end{tikzpicture}\\
-$\downarrow$\\
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
- \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
- {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
- \nodepart{two} Remaining: $c$};
-\end{tikzpicture}\\
-$\downarrow$\\
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
- \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
- {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
- \nodepart{two} EOF};
-\end{tikzpicture}
-\end{center}
-\noindent
-In the end we could recover the value without a backward phase.
-But (partial) values are a bit clumsy to stick together with a regular expression, so
-we instead use bit-codes to encode them.
-
-Bits and bitcodes (lists of bits) are defined as:
-\begin{center}
- $b ::= S \mid Z \qquad
-bs ::= [] \mid b::bs
-$
-\end{center}
-
-\noindent
-
+In the next section we introduce the correctness proof
+found by Ausaf and Urban
+of the bitcoded lexer.
%-----------------------------------
% SUBSECTION 1
%-----------------------------------
-\section{Specifications of Some Helper Functions}
+\section{Correctness Proof of $\textit{Blexer}$}
+Why is $\blexer$ correct?
+In other words, why is it the case that
+$\blexer$ outputs the same value as $\lexer$?
+Intuitively,
+that is because
+\begin{itemize}
+ \item
+ $\blexer$ follows an almost identical
+ path as that of $\lexer$,
+ for example $r_1, r_2, \ldots$ and $a_1, a_2, \ldots$ being produced,
+ which are the same up to the application of $\erase$.
+ \item
+ The bit-encodings work properly,
+ allowing the possibility of
+ pulling out the right lexical value from an
+ annotated regular expression at
+ any stage of the algorithm.
+\end{itemize}
+We will elaborate on this, with the help of
+some helper functions such as $\retrieve$ and
+$\flex$.
+\subsection{Specifications of Some Helper Functions}
The functions we introduce will give a more detailed glimpse into
-the lexing process, which might not be possible
-using $\lexer$ or $\blexer$ themselves.
-The first function we shall look at is $\retrieve$.
-\subsection{$\textit{Retrieve}$}
-Our bit-coded lexer "retrieve"s the bitcodes using $\bmkeps$
-after we finished doing all the derivatives:
+the lexing process, which is not be possible
+using $\lexer$ or $\blexer$ alone.
+\subsubsection{$\textit{Retrieve}$}
+The first function we shall introduce is $\retrieve$.
+Sulzmann and Lu gave its definition, and
+Ausaf and Urban found
+its usage in mechanised proofs.
+Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$,
+after all the derivatives has been taken:
\begin{center}
\begin{tabular}{lcl}
& & $\ldots$\\
@@ -931,22 +1063,24 @@
\end{tabular}
\end{center}
\noindent
-Recall that $\bmkeps$ looks for the leftmost branch of an alternative
-and completes a star's iterations by attaching a $Z$ at the end of the bitcodes
-extracted. It "retrieves" a sequence by visiting both children and then stitch together
-two bitcodes using concatenation. After the entire tree structure of the regular
-expression has been traversed using the above manner, we get a bitcode encoding the
-lexing result.
+$\bmkeps$ retrieves the value $v$'s
+information in the format
+of bitcodes, by travelling along the
+path of the regular expression that corresponds to a POSIX match,
+collecting all the bitcodes.
We know that this "retrieved" bitcode leads to the correct value after decoding,
-which is $v_0$ in the bird's eye view of the injection-based lexing diagram.
-Now assume we keep every other data structure in the diagram \ref{InjFigure},
-and only replace all the plain regular expression by their annotated counterparts,
-computed during a $\blexer$ run.
-Then we obtain a diagram for the annotated regular expression derivatives and
-their corresponding values, though the values are never calculated in $\blexer$.
-We have that $a_n$ contains all the lexing result information.
+which is $v_0$ in the injection-based lexing diagram.
+As an observation we pointed at the beginning of this section,
+the annotated regular expressions generated in successive derivative steps
+in $\blexer$ after $\erase$ has the same structure
+as those appeared in $\lexer$.
+We redraw the diagram below to visualise this fact.
+We pretend that all the values are
+ready despite they are only calculated in $\lexer$.
+In general we have $\vdash v_i:(a_i)_\downarrow$.
\vspace{20mm}
-\begin{center}%\label{graph:injLexer}
+\begin{figure}[H]%\label{graph:injLexer}
+\begin{center}
\begin{tikzcd}[
every matrix/.append style = {name=p},
remember picture, overlay,
@@ -954,6 +1088,7 @@
a_0 \arrow[r, "\backslash c_0"] \arrow[d] & a_1 \arrow[r, "\backslash c_1"] \arrow[d] & a_2 \arrow[r, dashed] \arrow[d] & a_n \arrow[d] \\
v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
\end{tikzcd}
+\end{center}
\begin{tikzpicture}[
remember picture, overlay,
E/.style = {ellipse, draw=blue, dashed,
@@ -961,25 +1096,32 @@
]
\node[E = (p-1-1) (p-2-1)] {};
\node[E = (p-1-4) (p-2-4)] {};
+\node[E = (p-1-2) (p-2-2)] {};
+\node[E = (p-1-3) (p-2-3)] {};
\end{tikzpicture}
-\end{center}
+\end{figure}
\vspace{20mm}
\noindent
-On the other hand, $v_0$ also encodes the correct lexing result, as we have proven for $\lexer$.
-Encircled in the diagram are the two pairs $v_0, a_0$ and $v_n, a_n$, which both
-encode the correct lexical result. Though for the leftmost pair, we have
-the information condensed in $v_0$ the value part, whereas for the rightmost pair,
-the information is concentrated on $a_n$.
-We know that in the intermediate steps the pairs $v_i, a_i$, must in some way encode the complete
-lexing information as well. Therefore, we need a unified approach to extract such lexing result
-from a value $v_i$ and its annotated regular expression $a_i$.
-And the function $f$ must satisfy these requirements:
+We encircle in the diagram all the pairs $v_i, a_i$ to show that these values
+and regular expressions correspond to each other.
+For the leftmost pair, we have that the
+lexical information is condensed in
+$v_0$--the value part, whereas for the rightmost pair,
+the lexing result is in the bitcodes of $a_n$.
+$\bmkeps$ is able to extract from $a_n$ the result
+by looking for nullable parts of the regular expression,
+however for regular expressions $a_i$ in general
+they might not necessarily be nullable and therefore
+needs some ``help'' finding the POSIX bit-encoding.
+The most straightforward ``help'' comes from $a_i$'s corresponding
+value $v_i$, and this suggests a function $f$ satisfying the
+following properties:
\begin{itemize}
\item
- $f \; a_i\;v_i = f \; a_n \; v_n = \decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
+ $f \; a_i\;v_i = f \; a_n \; v_n = \bmkeps \; a_n$%\decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
\item
- $f \; a_i\;v_i = f \; a_0 \; v_0 = v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$
+ $f \; a_i\;v_i = f \; a_0 \; v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$
\end{itemize}
\noindent
If we factor out the common part $\decode \; \_ \; (\erase \; a_0)$,