lexing: changeset 581:9db2500629be

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Wed Aug 17 22:59:15 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Sat Aug 20 23:49:17 2022 +0100
@@ -307,6 +307,13 @@
 which condenses all the partial lexing information
 (that was originally stored in $r_i, c_{i+1}$ pairs)
 into bitcodes.
+The bitcodes are then carried with the regular
+expression, and augmented or moved around
+as the lexing goes on.
+It becomes unnecessary
+to remember all the 
+intermediate expresssions, but only the most recent one
+with this bit-carrying regular expression.
 Annotated regular expressions 
 are defined as the following 
 Isabelle datatype \footnote{ We use subscript notation to indicate
@@ -565,12 +572,23 @@
 \end{tabular}    
 \end{center}    
 \noindent
-$\bmkeps$ retrieves the value $v$'s
-information in the format
-of bitcodes, by travelling along the
-path of the regular expression that corresponds to a POSIX match,
-collecting all the bitcodes, and attaching $S$ to indicate the end of star
-iterations. \\
+$\bmkeps$, just like $\mkeps$, 
+visits a regular expression tree respecting
+the POSIX rules. The difference, however, is that
+it does not create values, but only bitcodes.
+It traverses each child of the sequence regular expression
+from left to right and creates a bitcode by stitching
+together bitcodes obtained from the children expressions.
+In the case of alternative regular expressions, 
+it looks for the leftmost
+$\nullable$ branch
+to visit and ignores other siblings.
+%Whenever there is some bitcodes attached to a
+%node, it returns the bitcodes concatenated with whatever
+%child recursive calls return.
+The only time when $\bmkeps$ creates new bitcodes
+is when it completes a star's iterations by attaching a $S$ to the end of the bitcode
+list it returns.\\
 The bitcodes extracted by $\bmkeps$ need to be 
 $\decode$d (with the guidance of a plain regular expression):
 %\begin{definition}[Bitdecoding of Values]\mbox{}
@@ -813,115 +831,229 @@
 \end{center}
 
 \noindent
+$\blexer$ first attaches bitcodes to a
+plain regular expression, and then do successive derivatives
+with respect to the input string $s$, and
+then test whether the result is nullable.
+If yes, then extract the bitcodes out of the nullable expression,
+and decodes the bitcodes into a lexical value.
+If there does not exists a match between $r$ and $s$ the lexer
+outputs $\None$ indicating a failed lex.\\
 Ausaf and Urban formally proved the correctness of the $\blexer$, namely
 \begin{property}
 $\blexer \;r \; s = \lexer \; r \; s$.
 \end{property}
+\noindent
 This was claimed but not formalised in Sulzmann and Lu's work.
 We introduce the proof later, after we give all
 the needed auxiliary functions and definitions.
-But before this we shall first walk the reader 
+\subsection{An Example $\blexer$ Run}
+Before introducing the proof we shall first walk the reader 
 through a concrete example of our $\blexer$ calculating the right 
 lexical information through bit-coded regular expressions.\\
-Consider the regular expression $(aa)^* \cdot bc$ matching the string $aabc$
-and assume we have just read the first character $a$:
-\begin{center}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-	    {$\ONE \cdot a \cdot (aa)^* \cdot bc$ 
-	    \nodepart{two} $[] \iff \Seq \; (\Stars \; [\Seq\; a \; ??, \;]), ??$};
-\end{tikzpicture}
-\end{center}
-\noindent
-We use the red area for (annotated) regular expressions and the blue 
-area the (partial) bitcodes and (partial) values.
-In the injection-based lexing algorithm, we ``neglect" the red area
-by putting all the characters we have consumed and
-intermediate regular expressions on the stack when 
-we go from left to right in the derivative phase.
-The red area grows till the string is exhausted.
-During the injection phase, the value in the blue area
-is built up incrementally, while the red area shrinks.
-Before we have recovered all characters and intermediate
-derivative regular expressions from the stack,
-what values these characters and regular expressions correspond 
-to are unknown: 
-\begin{center}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
-	    {$(\ONE \cdot \ONE) \cdot (aa)^* \cdot bc $ correspond to:$???$
-         \nodepart{two}  $b c$ corresponds to  $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\end{center}
+Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$.
+We give again the bird's eye view of this particular example 
+in each stage of the algorithm:
+
+\tikzset{three sided/.style={
+        draw=none,
+        append after command={
+            [-,shorten <= -0.5\pgflinewidth]
+            ([shift={(-1.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east)
+        edge([shift={( 0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west) 
+            ([shift={( 0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west)
+        edge([shift={( 0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)            
+            ([shift={( 0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)
+        edge([shift={(-1.0\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)
+        }
+    }
+}
+
+\tikzset{three sided1/.style={
+        draw=none,
+        append after command={
+            [-,shorten <= -0.5\pgflinewidth]
+            ([shift={(1.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west)
+        edge([shift={(-0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east) 
+            ([shift={(-0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east)
+        edge([shift={(-0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)            
+            ([shift={(-0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)
+        edge([shift={(1.0\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)
+        }
+    }
+}
+
+\begin{figure}[H]
+	\begin{tikzpicture}[->, >=stealth', shorten >= 1pt, auto, thick]
+		\node [rectangle, draw] (r) at (-6, -1) {$(aa)^*(b+c)$};
+		\node [rectangle, draw] (a) at (-6, 4)	  {$(aa)^*(_{Z}b + _{S}c)$};
+		\path	(r)
+			edge [] node {$\internalise$} (a);
+		\node [rectangle, draw] (a1) at (-3, 1) {$(_{Z}(\ONE \cdot a) \cdot (aa)^*) (_{Z}b + _Sc)$};
+		\path	(a)
+			edge [] node {$\backslash a$} (a1);
+
+		\node [rectangle, draw, three sided] (a21) at (-2.5, 4) {$(_{Z}\ONE \cdot (aa)^*)$};
+		\node [rectangle, draw, three sided1] (a22) at (-0.8, 4) {$(_{Z}b + _{S}c)$};
+		\path	(a1)
+			edge [] node {$\backslash a$} (a21);
+		\node [rectangle, draw] (a3) at (0.5, 2) {$_{ZS}(_{Z}\ONE + \ZERO)$};
+		\path	(a22)
+			edge [] node {$\backslash b$} (a3);
+		\path	(a21)
+			edge [dashed, bend right] node {} (a3);
+		\node [rectangle, draw] (bs) at (2, 4) {$ZSZ$};
+		\path	(a3)
+			edge [below] node {$\bmkeps$} (bs);
+		\node [rectangle, draw] (v) at (3, 0) {$\Seq \; (\Stars\; [\Seq \; a \; a]) \; (\Left \; b)$};
+		\path 	(bs)
+			edge [] node {$\decode$} (v);
+
+
+	\end{tikzpicture}
+	\caption{$\blexer \;\;\;\; (aa)^*(b+c) \;\;\;\; aab$}
+\end{figure}
 \noindent
-However, they should be calculable,
-as characters and regular expression shapes
-after taking derivative w.r.t those characters
-have already been known, therefore in our example,
-we know that the value starts with two $a$s,
-and makes up to an iteration in a Kleene star:
-(We have put the injection-based lexing's partial 
-result in the right part of the split rectangle
-to contrast it with the partial valued produced
-in a forward manner)
-\begin{center}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-	    {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$
-	\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$  $\stackrel{Inj}{\longleftarrow}$};
-%\caption{term 1 \ref{term:1}'s matching configuration}
-\end{tikzpicture}
-\end{center}
+The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$
+turned into $ZS$ after derivative w.r.t $b$
+is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$
+before processing $_Zb+_Sc$.\\
+The annotated regular expressions
+would look too cumbersome if we explicitly indicate all the
+locations where bitcodes are attached.
+For example,
+$(aa)^*\cdot (b+c)$ would 
+look like $_{[]}(_{[]}(_{[]}a \cdot _{[]}a)^*\cdot _{[]}(_{[]}b+_{[]}c))$ 
+after 
+internalise.
+Therefore for readability we omit bitcodes if they are empty. 
+This applies to all example annotated 
+regular expressions in this thesis.\\
+%and assume we have just read the first character $a$:
+%\begin{center}
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+%    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+%	    {$(_{[Z]}(\ONE \cdot a) \cdot (aa)^* )\cdot bc$ 
+%	    \nodepart{two} $[Z] \iff \Seq \; (\Stars \; [\Seq\; a \; ??, \;??]) \; ??$};
+%\end{tikzpicture}
+%\end{center}
+%\noindent
+%We use the red area for (annotated) regular expressions and the blue 
+%area the (partially calculated) bitcodes 
+%and its corresponding (partial) values.
+%The first derivative 
+%generates a $Z$ bitcode to indicate
+%a new iteration has been started.
+%This bitcode is attached to the front of
+%the unrolled star iteration $\ONE\cdot a$
+%for later decoding.
+%\begin{center}
+%\begin{tikzpicture}[]
+%    \node [rectangle split, rectangle split horizontal, 
+%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, draw, rounded corners, inner sep=10pt]
+%	    (der2) at (0,0)
+%	    {$(_{[Z]}(\ONE \cdot \ONE) \cdot (aa)^*) \cdot bc $ 
+%	    \nodepart{two} $[Z] \iff \Seq \; (\Stars \; [\Seq \; a\;a, ??]) \; ??$};
+%
+%\node [draw=none, minimum size = 0.1, ] (r) at (-7, 0) {$a_1$};
+%\path
+%	(r)
+%	edge [->, >=stealth',shorten >=1pt, above] node {$\backslash a$} (der2);
+%%\caption{term 1 \ref{term:1}'s matching configuration}
+%\end{tikzpicture}
+%\end{center}
+%\noindent
+%After we take derivative with respect to 
+%second input character $a$, the annotated
+%regular expression has the second $a$ chopped off.
+%The second derivative does not involve any 
+%new bitcodes being generated, because
+%there are no new iterations or bifurcations
+%in the regular expression requiring any $S$ or $Z$ marker
+%to indicate choices.
+%\begin{center}
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+%    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+%	    {$(_{[Z]}(\ONE \cdot \ONE) \cdot (aa)^*) \cdot (\ONE \cdot c) $ 
+%	    \nodepart{two} $[Z] \iff \Seq \; (\Stars \; [\Seq \; a\;a, ??]) \; ??$};
+%%\caption{term 1 \ref{term:1}'s matching configuration}
+%\end{tikzpicture}
+%\end{center}
+%\noindent
+%
+%
+%\begin{center}
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+%    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+%	    {$\stackrel{Bitcoded}{\longrightarrow} \Seq(\Stars[\Char(a), \Char(a)], ???)$
+%	\nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$  $\stackrel{Inj}{\longleftarrow}$};
+%%\caption{term 1 \ref{term:1}'s matching configuration}
+%\end{tikzpicture}
+%\end{center}
+%\noindent
+%If we do this kind of "attachment"
+%and each time augment the attached partially
+%constructed value when taking off a 
+%character:
+%\begin{center}
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+%	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint)
+%        {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
+%         \nodepart{two} Remaining: $b c$};
+%\end{tikzpicture}\\
+%$\downarrow$\\
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+%    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+%        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
+%         \nodepart{two} Remaining: $c$};
+%\end{tikzpicture}\\
+%$\downarrow$\\
+%\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
+%    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
+%        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
+%         \nodepart{two} EOF};
+%\end{tikzpicture}
+%\end{center}
 \noindent
-If we do this kind of "attachment"
-and each time augment the attached partially
-constructed value when taking off a 
-character:
-\begin{center}
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-	\node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] (spPoint)
-        {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
-         \nodepart{two} Remaining: $b c$};
-\end{tikzpicture}\\
-$\downarrow$\\
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
-         \nodepart{two} Remaining: $c$};
-\end{tikzpicture}\\
-$\downarrow$\\
-\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
-    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
-        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
-         \nodepart{two} EOF};
-\end{tikzpicture}
-\end{center}
-\noindent
-In the end we could recover the value without a backward phase.
-But (partial) values are a bit clumsy to stick together with a regular expression, so 
-we instead use bit-codes to encode them.
-
-Bits and bitcodes (lists of bits) are defined as:
-\begin{center}
-		$b ::=   S \mid  Z \qquad
-bs ::= [] \mid b::bs    
-$
-\end{center}
-
-\noindent
-
+In the next section we introduce the correctness proof
+found by Ausaf and Urban
+of the bitcoded lexer.
 %-----------------------------------
 %	SUBSECTION 1
 %-----------------------------------
-\section{Specifications of Some Helper Functions}
+\section{Correctness Proof of $\textit{Blexer}$}
+Why is $\blexer$ correct?
+In other words, why is it the case that 
+$\blexer$ outputs the same value as $\lexer$?
+Intuitively,
+that is because 
+\begin{itemize}
+	\item
+		$\blexer$ follows an almost identical
+		path as that of $\lexer$, 
+		for example $r_1, r_2, \ldots$ and $a_1, a_2, \ldots$ being produced,
+		which are the same up to the application of $\erase$.
+	\item
+		The bit-encodings work properly,
+		allowing the possibility of 
+		pulling out the right lexical value from an
+		annotated regular expression at 
+		any stage of the algorithm.
+\end{itemize}
+We will elaborate on this, with the help of
+some helper functions such as $\retrieve$ and
+$\flex$.
+\subsection{Specifications of Some Helper Functions}
 The functions we introduce will give a more detailed glimpse into 
-the lexing process, which might not be possible
-using $\lexer$ or $\blexer$ themselves.
-The first function we shall look at is $\retrieve$.
-\subsection{$\textit{Retrieve}$}
-Our bit-coded lexer "retrieve"s the bitcodes using $\bmkeps$
-after we finished doing all the derivatives:
+the lexing process, which is not be possible
+using $\lexer$ or $\blexer$ alone.
+\subsubsection{$\textit{Retrieve}$}
+The first function we shall introduce is $\retrieve$.
+Sulzmann and Lu gave its definition, and
+Ausaf and Urban found
+its usage in mechanised proofs.
+Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$,
+after all the derivatives has been taken:
 \begin{center}
 \begin{tabular}{lcl}
 	& & $\ldots$\\
@@ -931,22 +1063,24 @@
 \end{tabular}
 \end{center}
 \noindent
-Recall that $\bmkeps$ looks for the leftmost branch of an alternative
-and completes a star's iterations by attaching a $Z$ at the end of the bitcodes
-extracted. It "retrieves" a sequence by visiting both children and then stitch together 
-two bitcodes using concatenation. After the entire tree structure of the regular 
-expression has been traversed using the above manner, we get a bitcode encoding the 
-lexing result.
+$\bmkeps$ retrieves the value $v$'s
+information in the format
+of bitcodes, by travelling along the
+path of the regular expression that corresponds to a POSIX match,
+collecting all the bitcodes.
 We know that this "retrieved" bitcode leads to the correct value after decoding,
-which is $v_0$ in the bird's eye view of the injection-based lexing diagram.
-Now assume we keep every other data structure in the diagram \ref{InjFigure},
-and only replace all the plain regular expression by their annotated counterparts,
-computed during a $\blexer$ run.
-Then we obtain a diagram for the annotated regular expression derivatives and
-their corresponding values, though the values are never calculated in $\blexer$.
-We have that $a_n$ contains all the lexing result information.
+which is $v_0$ in the injection-based lexing diagram.
+As an observation we pointed at the beginning of this section,
+the annotated regular expressions generated in successive derivative steps
+in $\blexer$ after $\erase$ has the same structure 
+as those appeared in $\lexer$.
+We redraw the diagram below to visualise this fact.
+We pretend that all the values are 
+ready despite they are only calculated in $\lexer$.
+In general we have $\vdash v_i:(a_i)_\downarrow$.
 \vspace{20mm}
-\begin{center}%\label{graph:injLexer}
+\begin{figure}[H]%\label{graph:injLexer}
+\begin{center}
 \begin{tikzcd}[
 	every matrix/.append style = {name=p},
 	remember picture, overlay,
@@ -954,6 +1088,7 @@
 	a_0 \arrow[r, "\backslash c_0"]  \arrow[d] & a_1 \arrow[r, "\backslash c_1"] \arrow[d] & a_2 \arrow[r, dashed] \arrow[d] & a_n \arrow[d] \\
 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
 \end{tikzcd}
+\end{center}
 \begin{tikzpicture}[
 	remember picture, overlay,
 E/.style = {ellipse, draw=blue, dashed,
@@ -961,25 +1096,32 @@
                         ]
 \node[E = (p-1-1) (p-2-1)] {};
 \node[E = (p-1-4) (p-2-4)] {};
+\node[E = (p-1-2) (p-2-2)] {};
+\node[E = (p-1-3) (p-2-3)] {};
 \end{tikzpicture}
 
-\end{center}
+\end{figure}
 \vspace{20mm}
 \noindent
-On the other hand, $v_0$ also encodes the correct lexing result, as we have proven for $\lexer$.
-Encircled in the diagram  are the two pairs $v_0, a_0$ and $v_n, a_n$, which both 
-encode the correct lexical result. Though for the leftmost pair, we have
-the information condensed in $v_0$ the value part, whereas for the rightmost pair,
-the information is concentrated on $a_n$.
-We know that in the intermediate steps the pairs $v_i, a_i$, must in some way encode the complete
-lexing information as well. Therefore, we need a unified approach to extract such lexing result
-from a value $v_i$ and its annotated regular expression $a_i$. 
-And the function $f$ must satisfy these requirements:
+We encircle in the diagram  all the pairs $v_i, a_i$ to show that these values
+and regular expressions correspond to each other.
+For the leftmost pair, we have that the 
+lexical information is condensed in 
+$v_0$--the value part, whereas for the rightmost pair,
+the lexing result is in the bitcodes of $a_n$.
+$\bmkeps$ is able to extract from $a_n$ the result
+by looking for nullable parts of the regular expression,
+however for regular expressions $a_i$ in general
+they might not necessarily be nullable and therefore
+needs some ``help'' finding the POSIX bit-encoding.
+The most straightforward ``help'' comes from $a_i$'s corresponding
+value $v_i$, and this suggests a function $f$ satisfying the
+following properties:
 \begin{itemize}
 	\item
-		$f \; a_i\;v_i = f \; a_n \; v_n = \decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
+		$f \; a_i\;v_i = f \; a_n \; v_n = \bmkeps \; a_n$%\decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
 	\item
-		$f \; a_i\;v_i = f \; a_0 \; v_0 = v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$
+		$f \; a_i\;v_i = f \; a_0 \; v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$
 \end{itemize}
 \noindent
 If we factor out the common part $\decode \; \_ \; (\erase \; a_0)$,
author	Chengsong
	Sat, 20 Aug 2022 23:49:17 +0100
changeset 581	9db2500629be
parent 580	e0f0a81f907b
child 582	3e19073e91f4