lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded1.tex

equal deleted inserted replaced

-:e0f0a81f907b
+:9db2500629be
 Sulzmann and Lu chose to  define a new datatype
 called \emph{annotated regular expression},
 which condenses all the partial lexing information
 (that was originally stored in $r_i, c_{i+1}$ pairs)
 into bitcodes.
+The bitcodes are then carried with the regular
+expression, and augmented or moved around
+as the lexing goes on.
+It becomes unnecessary
+to remember all the
+intermediate expresssions, but only the most recent one
+with this bit-carrying regular expression.
 Annotated regular expressions
 are defined as the following
 Isabelle datatype \footnote{ We use subscript notation to indicate
 	that the bitcodes are auxiliary information that do not
 interfere with the structure of the regular expressions }:
 $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
 $bs \,@\, [Z]$
 \end{tabular}
 \end{center}
 \noindent
-$\bmkeps$ retrieves the value $v$'s
+$\bmkeps$, just like $\mkeps$,
-information in the format
+visits a regular expression tree respecting
-of bitcodes, by travelling along the
+the POSIX rules. The difference, however, is that
-path of the regular expression that corresponds to a POSIX match,
+it does not create values, but only bitcodes.
-collecting all the bitcodes, and attaching $S$ to indicate the end of star
+It traverses each child of the sequence regular expression
-iterations. \\
+from left to right and creates a bitcode by stitching
+together bitcodes obtained from the children expressions.
+In the case of alternative regular expressions,
+it looks for the leftmost
+$\nullable$ branch
+to visit and ignores other siblings.
+%Whenever there is some bitcodes attached to a
+%node, it returns the bitcodes concatenated with whatever
+%child recursive calls return.
+The only time when $\bmkeps$ creates new bitcodes
+is when it completes a star's iterations by attaching a $S$ to the end of the bitcode
+list it returns.\\
 The bitcodes extracted by $\bmkeps$ need to be
 $\decode$d (with the guidance of a plain regular expression):
 %\begin{definition}[Bitdecoding of Values]\mbox{}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
 & & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 \noindent
+$\blexer$ first attaches bitcodes to a
+plain regular expression, and then do successive derivatives
+with respect to the input string $s$, and
+then test whether the result is nullable.
+If yes, then extract the bitcodes out of the nullable expression,
+and decodes the bitcodes into a lexical value.
+If there does not exists a match between $r$ and $s$ the lexer
+outputs $\None$ indicating a failed lex.\\
 Ausaf and Urban formally proved the correctness of the $\blexer$, namely
 \begin{property}
 $\blexer \;r \; s = \lexer \; r \; s$.
 \end{property}
+\noindent
 This was claimed but not formalised in Sulzmann and Lu's work.
 We introduce the proof later, after we give all
 the needed auxiliary functions and definitions.
-But before this we shall first walk the reader
+\subsection{An Example $\blexer$ Run}
+Before introducing the proof we shall first walk the reader
 through a concrete example of our $\blexer$ calculating the right
 lexical information through bit-coded regular expressions.\\
-Consider the regular expression $(aa)^* \cdot bc$ matching the string $aabc$
+Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$.
-and assume we have just read the first character $a$:
+We give again the bird's eye view of this particular example
-\begin{center}
+in each stage of the algorithm:
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+\tikzset{three sided/.style={
-	    {$\ONE \cdot a \cdot (aa)^* \cdot bc$
+draw=none,
-	    \nodepart{two} $[] \iff \Seq \; (\Stars \; [\Seq\; a \; ??, \;]), ??$};
+append after command={
-\end{tikzpicture}
+[-,shorten <= -0.5\pgflinewidth]
-\end{center}
+([shift={(-1.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east)
-\noindent
+edge([shift={( 0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west)
-We use the red area for (annotated) regular expressions and the blue
+([shift={( 0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west)
-area the (partial) bitcodes and (partial) values.
+edge([shift={( 0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)
-In the injection-based lexing algorithm, we ``neglect" the red area
+([shift={( 0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)
-by putting all the characters we have consumed and
+edge([shift={(-1.0\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)
-intermediate regular expressions on the stack when
+}
-we go from left to right in the derivative phase.
+}
-The red area grows till the string is exhausted.
+}
-During the injection phase, the value in the blue area
-is built up incrementally, while the red area shrinks.
+\tikzset{three sided1/.style={
-Before we have recovered all characters and intermediate
+draw=none,
-derivative regular expressions from the stack,
+append after command={
-what values these characters and regular expressions correspond
+[-,shorten <= -0.5\pgflinewidth]
-to are unknown:
+([shift={(1.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west)
-\begin{center}
+edge([shift={(-0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east)
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+([shift={(-0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east)
-\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
+edge([shift={(-0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)
-	    {$(\ONE \cdot \ONE) \cdot (aa)^* \cdot bc $ correspond to:$???$
+([shift={(-0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)
-\nodepart{two}  $b c$ corresponds to  $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
+edge([shift={(1.0\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)
-%\caption{term 1 \ref{term:1}'s matching configuration}
+}
-\end{tikzpicture}
+}
-\end{center}
+}
-\noindent
-However, they should be calculable,
+\begin{figure}[H]
-as characters and regular expression shapes
+	\begin{tikzpicture}[->, >=stealth', shorten >= 1pt, auto, thick]
-after taking derivative w.r.t those characters
+		\node [rectangle, draw] (r) at (-6, -1) {$(aa)^*(b+c)$};
-have already been known, therefore in our example,
+		\node [rectangle, draw] (a) at (-6, 4)	  {$(aa)^*(_{Z}b + _{S}c)$};
-we know that the value starts with two $a$s,
+		\path	(r)
-and makes up to an iteration in a Kleene star:
+			edge [] node {$\internalise$} (a);
-(We have put the injection-based lexing's partial
+		\node [rectangle, draw] (a1) at (-3, 1) {$(_{Z}(\ONE \cdot a) \cdot (aa)^*) (_{Z}b + _Sc)$};
-result in the right part of the split rectangle
+		\path	(a)
-to contrast it with the partial valued produced
+			edge [] node {$\backslash a$} (a1);
-in a forward manner)
-\begin{center}
+		\node [rectangle, draw, three sided] (a21) at (-2.5, 4) {$(_{Z}\ONE \cdot (aa)^*)$};
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+		\node [rectangle, draw, three sided1] (a22) at (-0.8, 4) {$(_{Z}b + _{S}c)$};
-\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+		\path	(a1)
-	    {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$
+			edge [] node {$\backslash a$} (a21);
-	\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$  $\stackrel{Inj}{\longleftarrow}$};
+		\node [rectangle, draw] (a3) at (0.5, 2) {$_{ZS}(_{Z}\ONE + \ZERO)$};
-%\caption{term 1 \ref{term:1}'s matching configuration}
+		\path	(a22)
-\end{tikzpicture}
+			edge [] node {$\backslash b$} (a3);
-\end{center}
+		\path	(a21)
-\noindent
+			edge [dashed, bend right] node {} (a3);
-If we do this kind of "attachment"
+		\node [rectangle, draw] (bs) at (2, 4) {$ZSZ$};
-and each time augment the attached partially
+		\path	(a3)
-constructed value when taking off a
+			edge [below] node {$\bmkeps$} (bs);
-character:
+		\node [rectangle, draw] (v) at (3, 0) {$\Seq \; (\Stars\; [\Seq \; a \; a]) \; (\Left \; b)$};
-\begin{center}
+		\path 	(bs)
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+			edge [] node {$\decode$} (v);
-	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint)
-{$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
-\nodepart{two} Remaining: $b c$};
+	\end{tikzpicture}
-\end{tikzpicture}\\
+	\caption{$\blexer \;\;\;\; (aa)^*(b+c) \;\;\;\; aab$}
-$\downarrow$\\
+\end{figure}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+\noindent
-\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$
-{$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
+turned into $ZS$ after derivative w.r.t $b$
-\nodepart{two} Remaining: $c$};
+is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$
-\end{tikzpicture}\\
+before processing $_Zb+_Sc$.\\
-$\downarrow$\\
+The annotated regular expressions
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+would look too cumbersome if we explicitly indicate all the
-\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+locations where bitcodes are attached.
-{$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
+For example,
-\nodepart{two} EOF};
+$(aa)^*\cdot (b+c)$ would
-\end{tikzpicture}
+look like $_{[]}(_{[]}(_{[]}a \cdot _{[]}a)^*\cdot _{[]}(_{[]}b+_{[]}c))$
-\end{center}
+after
-\noindent
+internalise.
-In the end we could recover the value without a backward phase.
+Therefore for readability we omit bitcodes if they are empty.
-But (partial) values are a bit clumsy to stick together with a regular expression, so
+This applies to all example annotated
-we instead use bit-codes to encode them.
+regular expressions in this thesis.\\
+%and assume we have just read the first character $a$:
-Bits and bitcodes (lists of bits) are defined as:
+%\begin{center}
-\begin{center}
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-		$b ::=   S \mid  Z \qquad
+%    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-bs ::= [] \mid b::bs
+%	    {$(_{[Z]}(\ONE \cdot a) \cdot (aa)^* )\cdot bc$
-$
+%	    \nodepart{two} $[Z] \iff \Seq \; (\Stars \; [\Seq\; a \; ??, \;??]) \; ??$};
-\end{center}
+%\end{tikzpicture}
+%\end{center}
-\noindent
+%\noindent
+%We use the red area for (annotated) regular expressions and the blue
+%area the (partially calculated) bitcodes
+%and its corresponding (partial) values.
+%The first derivative
+%generates a $Z$ bitcode to indicate
+%a new iteration has been started.
+%This bitcode is attached to the front of
+%the unrolled star iteration $\ONE\cdot a$
+%for later decoding.
+%\begin{center}
+%\begin{tikzpicture}[]
+%    \node [rectangle split, rectangle split horizontal,
+%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, draw, rounded corners, inner sep=10pt]
+%	    (der2) at (0,0)
+%	    {$(_{[Z]}(\ONE \cdot \ONE) \cdot (aa)^*) \cdot bc $
+%	    \nodepart{two} $[Z] \iff \Seq \; (\Stars \; [\Seq \; a\;a, ??]) \; ??$};
+%
+%\node [draw=none, minimum size = 0.1, ] (r) at (-7, 0) {$a_1$};
+%\path
+%	(r)
+%	edge [->, >=stealth',shorten >=1pt, above] node {$\backslash a$} (der2);
+%%\caption{term 1 \ref{term:1}'s matching configuration}
+%\end{tikzpicture}
+%\end{center}
+%\noindent
+%After we take derivative with respect to
+%second input character $a$, the annotated
+%regular expression has the second $a$ chopped off.
+%The second derivative does not involve any
+%new bitcodes being generated, because
+%there are no new iterations or bifurcations
+%in the regular expression requiring any $S$ or $Z$ marker
+%to indicate choices.
+%\begin{center}
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+%    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+%	    {$(_{[Z]}(\ONE \cdot \ONE) \cdot (aa)^*) \cdot (\ONE \cdot c) $
+%	    \nodepart{two} $[Z] \iff \Seq \; (\Stars \; [\Seq \; a\;a, ??]) \; ??$};
+%%\caption{term 1 \ref{term:1}'s matching configuration}
+%\end{tikzpicture}
+%\end{center}
+%\noindent
+%
+%
+%\begin{center}
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+%    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+%	    {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$
+%	\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$  $\stackrel{Inj}{\longleftarrow}$};
+%%\caption{term 1 \ref{term:1}'s matching configuration}
+%\end{tikzpicture}
+%\end{center}
+%\noindent
+%If we do this kind of "attachment"
+%and each time augment the attached partially
+%constructed value when taking off a
+%character:
+%\begin{center}
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+%	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint)
+%        {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
+%         \nodepart{two} Remaining: $b c$};
+%\end{tikzpicture}\\
+%$\downarrow$\\
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+%    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+%        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
+%         \nodepart{two} Remaining: $c$};
+%\end{tikzpicture}\\
+%$\downarrow$\\
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+%    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+%        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
+%         \nodepart{two} EOF};
+%\end{tikzpicture}
+%\end{center}
+\noindent
+In the next section we introduce the correctness proof
+found by Ausaf and Urban
+of the bitcoded lexer.
 %-----------------------------------
 %	SUBSECTION 1
 %-----------------------------------
-\section{Specifications of Some Helper Functions}
+\section{Correctness Proof of $\textit{Blexer}$}
+Why is $\blexer$ correct?
+In other words, why is it the case that
+$\blexer$ outputs the same value as $\lexer$?
+Intuitively,
+that is because
+\begin{itemize}
+	\item
+		$\blexer$ follows an almost identical
+		path as that of $\lexer$,
+		for example $r_1, r_2, \ldots$ and $a_1, a_2, \ldots$ being produced,
+		which are the same up to the application of $\erase$.
+	\item
+		The bit-encodings work properly,
+		allowing the possibility of
+		pulling out the right lexical value from an
+		annotated regular expression at
+		any stage of the algorithm.
+\end{itemize}
+We will elaborate on this, with the help of
+some helper functions such as $\retrieve$ and
+$\flex$.
+\subsection{Specifications of Some Helper Functions}
 The functions we introduce will give a more detailed glimpse into
-the lexing process, which might not be possible
+the lexing process, which is not be possible
-using $\lexer$ or $\blexer$ themselves.
+using $\lexer$ or $\blexer$ alone.
-The first function we shall look at is $\retrieve$.
+\subsubsection{$\textit{Retrieve}$}
-\subsection{$\textit{Retrieve}$}
+The first function we shall introduce is $\retrieve$.
-Our bit-coded lexer "retrieve"s the bitcodes using $\bmkeps$
+Sulzmann and Lu gave its definition, and
-after we finished doing all the derivatives:
+Ausaf and Urban found
+its usage in mechanised proofs.
+Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$,
+after all the derivatives has been taken:
 \begin{center}
 \begin{tabular}{lcl}
 	& & $\ldots$\\
 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
 & & $\ldots$
 \end{tabular}
 \end{center}
 \noindent
-Recall that $\bmkeps$ looks for the leftmost branch of an alternative
+$\bmkeps$ retrieves the value $v$'s
-and completes a star's iterations by attaching a $Z$ at the end of the bitcodes
+information in the format
-extracted. It "retrieves" a sequence by visiting both children and then stitch together
+of bitcodes, by travelling along the
-two bitcodes using concatenation. After the entire tree structure of the regular
+path of the regular expression that corresponds to a POSIX match,
-expression has been traversed using the above manner, we get a bitcode encoding the
+collecting all the bitcodes.
-lexing result.
 We know that this "retrieved" bitcode leads to the correct value after decoding,
-which is $v_0$ in the bird's eye view of the injection-based lexing diagram.
+which is $v_0$ in the injection-based lexing diagram.
-Now assume we keep every other data structure in the diagram \ref{InjFigure},
+As an observation we pointed at the beginning of this section,
-and only replace all the plain regular expression by their annotated counterparts,
+the annotated regular expressions generated in successive derivative steps
-computed during a $\blexer$ run.
+in $\blexer$ after $\erase$ has the same structure
-Then we obtain a diagram for the annotated regular expression derivatives and
+as those appeared in $\lexer$.
-their corresponding values, though the values are never calculated in $\blexer$.
+We redraw the diagram below to visualise this fact.
-We have that $a_n$ contains all the lexing result information.
+We pretend that all the values are
+ready despite they are only calculated in $\lexer$.
+In general we have $\vdash v_i:(a_i)_\downarrow$.
 \vspace{20mm}
-\begin{center}%\label{graph:injLexer}
+\begin{figure}[H]%\label{graph:injLexer}
+\begin{center}
 \begin{tikzcd}[
 	every matrix/.append style = {name=p},
 	remember picture, overlay,
 	]
 	a_0 \arrow[r, "\backslash c_0"]  \arrow[d] & a_1 \arrow[r, "\backslash c_1"] \arrow[d] & a_2 \arrow[r, dashed] \arrow[d] & a_n \arrow[d] \\
 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
 \end{tikzcd}
+\end{center}
 \begin{tikzpicture}[
 	remember picture, overlay,
 E/.style = {ellipse, draw=blue, dashed,
 inner xsep=4mm,inner ysep=-4mm, rotate=90, fit=#1}
 ]
 \node[E = (p-1-1) (p-2-1)] {};
 \node[E = (p-1-4) (p-2-4)] {};
+\node[E = (p-1-2) (p-2-2)] {};
+\node[E = (p-1-3) (p-2-3)] {};
 \end{tikzpicture}
-\end{center}
+\end{figure}
 \vspace{20mm}
 \noindent
-On the other hand, $v_0$ also encodes the correct lexing result, as we have proven for $\lexer$.
+We encircle in the diagram  all the pairs $v_i, a_i$ to show that these values
-Encircled in the diagram  are the two pairs $v_0, a_0$ and $v_n, a_n$, which both
+and regular expressions correspond to each other.
-encode the correct lexical result. Though for the leftmost pair, we have
+For the leftmost pair, we have that the
-the information condensed in $v_0$ the value part, whereas for the rightmost pair,
+lexical information is condensed in
-the information is concentrated on $a_n$.
+$v_0$--the value part, whereas for the rightmost pair,
-We know that in the intermediate steps the pairs $v_i, a_i$, must in some way encode the complete
+the lexing result is in the bitcodes of $a_n$.
-lexing information as well. Therefore, we need a unified approach to extract such lexing result
+$\bmkeps$ is able to extract from $a_n$ the result
-from a value $v_i$ and its annotated regular expression $a_i$.
+by looking for nullable parts of the regular expression,
-And the function $f$ must satisfy these requirements:
+however for regular expressions $a_i$ in general
+they might not necessarily be nullable and therefore
+needs some ``help'' finding the POSIX bit-encoding.
+The most straightforward ``help'' comes from $a_i$'s corresponding
+value $v_i$, and this suggests a function $f$ satisfying the
+following properties:
 \begin{itemize}
 	\item
-		$f \; a_i\;v_i = f \; a_n \; v_n = \decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
+		$f \; a_i\;v_i = f \; a_n \; v_n = \bmkeps \; a_n$%\decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
 	\item
-		$f \; a_i\;v_i = f \; a_0 \; v_0 = v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$
+		$f \; a_i\;v_i = f \; a_0 \; v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$
 \end{itemize}
 \noindent
 If we factor out the common part $\decode \; \_ \; (\erase \; a_0)$,
 The core of the function $f$ is something that produces the bitcodes
 $\code \; v_0$.

changeset 581	9db2500629be
parent 580	e0f0a81f907b
child 582	3e19073e91f4