# HG changeset patch # User Chengsong # Date 1661035757 -3600 # Node ID 9db2500629be353842773c0ce2cdbcac0b286e78 # Parent e0f0a81f907be87ed77adfc7c597a616b1227d35 chap3 almost done diff -r e0f0a81f907b -r 9db2500629be ChengsongTanPhdThesis/Chapters/Bitcoded1.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Wed Aug 17 22:59:15 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Sat Aug 20 23:49:17 2022 +0100 @@ -307,6 +307,13 @@ which condenses all the partial lexing information (that was originally stored in $r_i, c_{i+1}$ pairs) into bitcodes. +The bitcodes are then carried with the regular +expression, and augmented or moved around +as the lexing goes on. +It becomes unnecessary +to remember all the +intermediate expresssions, but only the most recent one +with this bit-carrying regular expression. Annotated regular expressions are defined as the following Isabelle datatype \footnote{ We use subscript notation to indicate @@ -565,12 +572,23 @@ \end{tabular} \end{center} \noindent -$\bmkeps$ retrieves the value $v$'s -information in the format -of bitcodes, by travelling along the -path of the regular expression that corresponds to a POSIX match, -collecting all the bitcodes, and attaching $S$ to indicate the end of star -iterations. \\ +$\bmkeps$, just like $\mkeps$, +visits a regular expression tree respecting +the POSIX rules. The difference, however, is that +it does not create values, but only bitcodes. +It traverses each child of the sequence regular expression +from left to right and creates a bitcode by stitching +together bitcodes obtained from the children expressions. +In the case of alternative regular expressions, +it looks for the leftmost +$\nullable$ branch +to visit and ignores other siblings. +%Whenever there is some bitcodes attached to a +%node, it returns the bitcodes concatenated with whatever +%child recursive calls return. +The only time when $\bmkeps$ creates new bitcodes +is when it completes a star's iterations by attaching a $S$ to the end of the bitcode +list it returns.\\ The bitcodes extracted by $\bmkeps$ need to be $\decode$d (with the guidance of a plain regular expression): %\begin{definition}[Bitdecoding of Values]\mbox{} @@ -813,115 +831,229 @@ \end{center} \noindent +$\blexer$ first attaches bitcodes to a +plain regular expression, and then do successive derivatives +with respect to the input string $s$, and +then test whether the result is nullable. +If yes, then extract the bitcodes out of the nullable expression, +and decodes the bitcodes into a lexical value. +If there does not exists a match between $r$ and $s$ the lexer +outputs $\None$ indicating a failed lex.\\ Ausaf and Urban formally proved the correctness of the $\blexer$, namely \begin{property} $\blexer \;r \; s = \lexer \; r \; s$. \end{property} +\noindent This was claimed but not formalised in Sulzmann and Lu's work. We introduce the proof later, after we give all the needed auxiliary functions and definitions. -But before this we shall first walk the reader +\subsection{An Example $\blexer$ Run} +Before introducing the proof we shall first walk the reader through a concrete example of our $\blexer$ calculating the right lexical information through bit-coded regular expressions.\\ -Consider the regular expression $(aa)^* \cdot bc$ matching the string $aabc$ -and assume we have just read the first character $a$: -\begin{center} -\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] - \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] - {$\ONE \cdot a \cdot (aa)^* \cdot bc$ - \nodepart{two} $[] \iff \Seq \; (\Stars \; [\Seq\; a \; ??, \;]), ??$}; -\end{tikzpicture} -\end{center} -\noindent -We use the red area for (annotated) regular expressions and the blue -area the (partial) bitcodes and (partial) values. -In the injection-based lexing algorithm, we ``neglect" the red area -by putting all the characters we have consumed and -intermediate regular expressions on the stack when -we go from left to right in the derivative phase. -The red area grows till the string is exhausted. -During the injection phase, the value in the blue area -is built up incrementally, while the red area shrinks. -Before we have recovered all characters and intermediate -derivative regular expressions from the stack, -what values these characters and regular expressions correspond -to are unknown: -\begin{center} -\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] - \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},] - {$(\ONE \cdot \ONE) \cdot (aa)^* \cdot bc $ correspond to:$???$ - \nodepart{two} $b c$ corresponds to $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$}; -%\caption{term 1 \ref{term:1}'s matching configuration} -\end{tikzpicture} -\end{center} +Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$. +We give again the bird's eye view of this particular example +in each stage of the algorithm: + +\tikzset{three sided/.style={ + draw=none, + append after command={ + [-,shorten <= -0.5\pgflinewidth] + ([shift={(-1.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east) + edge([shift={( 0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west) + ([shift={( 0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west) + edge([shift={( 0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west) + ([shift={( 0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west) + edge([shift={(-1.0\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east) + } + } +} + +\tikzset{three sided1/.style={ + draw=none, + append after command={ + [-,shorten <= -0.5\pgflinewidth] + ([shift={(1.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west) + edge([shift={(-0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east) + ([shift={(-0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east) + edge([shift={(-0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east) + ([shift={(-0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east) + edge([shift={(1.0\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west) + } + } +} + +\begin{figure}[H] + \begin{tikzpicture}[->, >=stealth', shorten >= 1pt, auto, thick] + \node [rectangle, draw] (r) at (-6, -1) {$(aa)^*(b+c)$}; + \node [rectangle, draw] (a) at (-6, 4) {$(aa)^*(_{Z}b + _{S}c)$}; + \path (r) + edge [] node {$\internalise$} (a); + \node [rectangle, draw] (a1) at (-3, 1) {$(_{Z}(\ONE \cdot a) \cdot (aa)^*) (_{Z}b + _Sc)$}; + \path (a) + edge [] node {$\backslash a$} (a1); + + \node [rectangle, draw, three sided] (a21) at (-2.5, 4) {$(_{Z}\ONE \cdot (aa)^*)$}; + \node [rectangle, draw, three sided1] (a22) at (-0.8, 4) {$(_{Z}b + _{S}c)$}; + \path (a1) + edge [] node {$\backslash a$} (a21); + \node [rectangle, draw] (a3) at (0.5, 2) {$_{ZS}(_{Z}\ONE + \ZERO)$}; + \path (a22) + edge [] node {$\backslash b$} (a3); + \path (a21) + edge [dashed, bend right] node {} (a3); + \node [rectangle, draw] (bs) at (2, 4) {$ZSZ$}; + \path (a3) + edge [below] node {$\bmkeps$} (bs); + \node [rectangle, draw] (v) at (3, 0) {$\Seq \; (\Stars\; [\Seq \; a \; a]) \; (\Left \; b)$}; + \path (bs) + edge [] node {$\decode$} (v); + + + \end{tikzpicture} + \caption{$\blexer \;\;\;\; (aa)^*(b+c) \;\;\;\; aab$} +\end{figure} \noindent -However, they should be calculable, -as characters and regular expression shapes -after taking derivative w.r.t those characters -have already been known, therefore in our example, -we know that the value starts with two $a$s, -and makes up to an iteration in a Kleene star: -(We have put the injection-based lexing's partial -result in the right part of the split rectangle -to contrast it with the partial valued produced -in a forward manner) -\begin{center} -\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] - \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] - {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$ - \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$ $\stackrel{Inj}{\longleftarrow}$}; -%\caption{term 1 \ref{term:1}'s matching configuration} -\end{tikzpicture} -\end{center} +The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$ +turned into $ZS$ after derivative w.r.t $b$ +is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$ +before processing $_Zb+_Sc$.\\ +The annotated regular expressions +would look too cumbersome if we explicitly indicate all the +locations where bitcodes are attached. +For example, +$(aa)^*\cdot (b+c)$ would +look like $_{[]}(_{[]}(_{[]}a \cdot _{[]}a)^*\cdot _{[]}(_{[]}b+_{[]}c))$ +after +internalise. +Therefore for readability we omit bitcodes if they are empty. +This applies to all example annotated +regular expressions in this thesis.\\ +%and assume we have just read the first character $a$: +%\begin{center} +%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] +% \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] +% {$(_{[Z]}(\ONE \cdot a) \cdot (aa)^* )\cdot bc$ +% \nodepart{two} $[Z] \iff \Seq \; (\Stars \; [\Seq\; a \; ??, \;??]) \; ??$}; +%\end{tikzpicture} +%\end{center} +%\noindent +%We use the red area for (annotated) regular expressions and the blue +%area the (partially calculated) bitcodes +%and its corresponding (partial) values. +%The first derivative +%generates a $Z$ bitcode to indicate +%a new iteration has been started. +%This bitcode is attached to the front of +%the unrolled star iteration $\ONE\cdot a$ +%for later decoding. +%\begin{center} +%\begin{tikzpicture}[] +% \node [rectangle split, rectangle split horizontal, +% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, draw, rounded corners, inner sep=10pt] +% (der2) at (0,0) +% {$(_{[Z]}(\ONE \cdot \ONE) \cdot (aa)^*) \cdot bc $ +% \nodepart{two} $[Z] \iff \Seq \; (\Stars \; [\Seq \; a\;a, ??]) \; ??$}; +% +%\node [draw=none, minimum size = 0.1, ] (r) at (-7, 0) {$a_1$}; +%\path +% (r) +% edge [->, >=stealth',shorten >=1pt, above] node {$\backslash a$} (der2); +%%\caption{term 1 \ref{term:1}'s matching configuration} +%\end{tikzpicture} +%\end{center} +%\noindent +%After we take derivative with respect to +%second input character $a$, the annotated +%regular expression has the second $a$ chopped off. +%The second derivative does not involve any +%new bitcodes being generated, because +%there are no new iterations or bifurcations +%in the regular expression requiring any $S$ or $Z$ marker +%to indicate choices. +%\begin{center} +%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] +% \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] +% {$(_{[Z]}(\ONE \cdot \ONE) \cdot (aa)^*) \cdot (\ONE \cdot c) $ +% \nodepart{two} $[Z] \iff \Seq \; (\Stars \; [\Seq \; a\;a, ??]) \; ??$}; +%%\caption{term 1 \ref{term:1}'s matching configuration} +%\end{tikzpicture} +%\end{center} +%\noindent +% +% +%\begin{center} +%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] +% \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] +% {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$ +% \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$ $\stackrel{Inj}{\longleftarrow}$}; +%%\caption{term 1 \ref{term:1}'s matching configuration} +%\end{tikzpicture} +%\end{center} +%\noindent +%If we do this kind of "attachment" +%and each time augment the attached partially +%constructed value when taking off a +%character: +%\begin{center} +%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] +% \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint) +% {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$ +% \nodepart{two} Remaining: $b c$}; +%\end{tikzpicture}\\ +%$\downarrow$\\ +%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] +% \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] +% {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$ +% \nodepart{two} Remaining: $c$}; +%\end{tikzpicture}\\ +%$\downarrow$\\ +%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] +% \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] +% {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$ +% \nodepart{two} EOF}; +%\end{tikzpicture} +%\end{center} \noindent -If we do this kind of "attachment" -and each time augment the attached partially -constructed value when taking off a -character: -\begin{center} -\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] - \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint) - {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$ - \nodepart{two} Remaining: $b c$}; -\end{tikzpicture}\\ -$\downarrow$\\ -\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] - \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] - {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$ - \nodepart{two} Remaining: $c$}; -\end{tikzpicture}\\ -$\downarrow$\\ -\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] - \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] - {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$ - \nodepart{two} EOF}; -\end{tikzpicture} -\end{center} -\noindent -In the end we could recover the value without a backward phase. -But (partial) values are a bit clumsy to stick together with a regular expression, so -we instead use bit-codes to encode them. - -Bits and bitcodes (lists of bits) are defined as: -\begin{center} - $b ::= S \mid Z \qquad -bs ::= [] \mid b::bs -$ -\end{center} - -\noindent - +In the next section we introduce the correctness proof +found by Ausaf and Urban +of the bitcoded lexer. %----------------------------------- % SUBSECTION 1 %----------------------------------- -\section{Specifications of Some Helper Functions} +\section{Correctness Proof of $\textit{Blexer}$} +Why is $\blexer$ correct? +In other words, why is it the case that +$\blexer$ outputs the same value as $\lexer$? +Intuitively, +that is because +\begin{itemize} + \item + $\blexer$ follows an almost identical + path as that of $\lexer$, + for example $r_1, r_2, \ldots$ and $a_1, a_2, \ldots$ being produced, + which are the same up to the application of $\erase$. + \item + The bit-encodings work properly, + allowing the possibility of + pulling out the right lexical value from an + annotated regular expression at + any stage of the algorithm. +\end{itemize} +We will elaborate on this, with the help of +some helper functions such as $\retrieve$ and +$\flex$. +\subsection{Specifications of Some Helper Functions} The functions we introduce will give a more detailed glimpse into -the lexing process, which might not be possible -using $\lexer$ or $\blexer$ themselves. -The first function we shall look at is $\retrieve$. -\subsection{$\textit{Retrieve}$} -Our bit-coded lexer "retrieve"s the bitcodes using $\bmkeps$ -after we finished doing all the derivatives: +the lexing process, which is not be possible +using $\lexer$ or $\blexer$ alone. +\subsubsection{$\textit{Retrieve}$} +The first function we shall introduce is $\retrieve$. +Sulzmann and Lu gave its definition, and +Ausaf and Urban found +its usage in mechanised proofs. +Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$, +after all the derivatives has been taken: \begin{center} \begin{tabular}{lcl} & & $\ldots$\\ @@ -931,22 +1063,24 @@ \end{tabular} \end{center} \noindent -Recall that $\bmkeps$ looks for the leftmost branch of an alternative -and completes a star's iterations by attaching a $Z$ at the end of the bitcodes -extracted. It "retrieves" a sequence by visiting both children and then stitch together -two bitcodes using concatenation. After the entire tree structure of the regular -expression has been traversed using the above manner, we get a bitcode encoding the -lexing result. +$\bmkeps$ retrieves the value $v$'s +information in the format +of bitcodes, by travelling along the +path of the regular expression that corresponds to a POSIX match, +collecting all the bitcodes. We know that this "retrieved" bitcode leads to the correct value after decoding, -which is $v_0$ in the bird's eye view of the injection-based lexing diagram. -Now assume we keep every other data structure in the diagram \ref{InjFigure}, -and only replace all the plain regular expression by their annotated counterparts, -computed during a $\blexer$ run. -Then we obtain a diagram for the annotated regular expression derivatives and -their corresponding values, though the values are never calculated in $\blexer$. -We have that $a_n$ contains all the lexing result information. +which is $v_0$ in the injection-based lexing diagram. +As an observation we pointed at the beginning of this section, +the annotated regular expressions generated in successive derivative steps +in $\blexer$ after $\erase$ has the same structure +as those appeared in $\lexer$. +We redraw the diagram below to visualise this fact. +We pretend that all the values are +ready despite they are only calculated in $\lexer$. +In general we have $\vdash v_i:(a_i)_\downarrow$. \vspace{20mm} -\begin{center}%\label{graph:injLexer} +\begin{figure}[H]%\label{graph:injLexer} +\begin{center} \begin{tikzcd}[ every matrix/.append style = {name=p}, remember picture, overlay, @@ -954,6 +1088,7 @@ a_0 \arrow[r, "\backslash c_0"] \arrow[d] & a_1 \arrow[r, "\backslash c_1"] \arrow[d] & a_2 \arrow[r, dashed] \arrow[d] & a_n \arrow[d] \\ v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] \end{tikzcd} +\end{center} \begin{tikzpicture}[ remember picture, overlay, E/.style = {ellipse, draw=blue, dashed, @@ -961,25 +1096,32 @@ ] \node[E = (p-1-1) (p-2-1)] {}; \node[E = (p-1-4) (p-2-4)] {}; +\node[E = (p-1-2) (p-2-2)] {}; +\node[E = (p-1-3) (p-2-3)] {}; \end{tikzpicture} -\end{center} +\end{figure} \vspace{20mm} \noindent -On the other hand, $v_0$ also encodes the correct lexing result, as we have proven for $\lexer$. -Encircled in the diagram are the two pairs $v_0, a_0$ and $v_n, a_n$, which both -encode the correct lexical result. Though for the leftmost pair, we have -the information condensed in $v_0$ the value part, whereas for the rightmost pair, -the information is concentrated on $a_n$. -We know that in the intermediate steps the pairs $v_i, a_i$, must in some way encode the complete -lexing information as well. Therefore, we need a unified approach to extract such lexing result -from a value $v_i$ and its annotated regular expression $a_i$. -And the function $f$ must satisfy these requirements: +We encircle in the diagram all the pairs $v_i, a_i$ to show that these values +and regular expressions correspond to each other. +For the leftmost pair, we have that the +lexical information is condensed in +$v_0$--the value part, whereas for the rightmost pair, +the lexing result is in the bitcodes of $a_n$. +$\bmkeps$ is able to extract from $a_n$ the result +by looking for nullable parts of the regular expression, +however for regular expressions $a_i$ in general +they might not necessarily be nullable and therefore +needs some ``help'' finding the POSIX bit-encoding. +The most straightforward ``help'' comes from $a_i$'s corresponding +value $v_i$, and this suggests a function $f$ satisfying the +following properties: \begin{itemize} \item - $f \; a_i\;v_i = f \; a_n \; v_n = \decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$ + $f \; a_i\;v_i = f \; a_n \; v_n = \bmkeps \; a_n$%\decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$ \item - $f \; a_i\;v_i = f \; a_0 \; v_0 = v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$ + $f \; a_i\;v_i = f \; a_0 \; v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$ \end{itemize} \noindent If we factor out the common part $\decode \; \_ \; (\erase \; a_0)$,