lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded1.tex

equal deleted inserted replaced

-:e3752aac8ec2
+:dd9dde2d902b
 \end{center}
 \noindent
 The first problem with this algorithm is that
 for the function $\inj$ to work properly
 we cannot destroy the structure of a regular expression,
-and therefore canno simplify aggressively.
+and therefore cannot simplify too aggressively.
 For example,
 \[
 	r + (r + a) \rightarrow r + a
 \]
-cannot be done because that would require
+cannot be applied because that would require
 breaking up the inner alternative.
-The $\lexer$ function therefore only enables
+The $\lexer$ plus $\textit{simp}$ therefore only enables
 same-level de-duplications like
 \[
 	r + r \rightarrow r.
 \]
 Secondly, the algorithm recursively calls $\lexer$ on
 \path   (c1)
 	edge [bend right, dashed] node {} (stack);
 \end{tikzpicture}
-\caption{First Derivative Taken}
+\caption{First derivative taken}
 \end{figure}
 \begin{figure}[H]
 \path   (r1)
 	edge [] node {} (r2);
 \end{tikzpicture}
-\caption{Second Derivative Taken}
+\caption{Second derivative taken}
 \end{figure}
 \noindent
 As the number of derivative steps increase,
 the stack would increase:
 \path   (ldots)
 	edge [bend left, dashed] node {} (stack);
 \path   (5.03, 4.9)
 	edge [bend left, dashed] node {} (stack);
 \end{tikzpicture}
-\caption{More Derivatives Taken}
+\caption{More derivatives taken}
 \end{figure}
 \begin{figure}[H]
 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick]
 \path (r2)
 	edge [dashed, ] node {} (stack);
 \path (rn)
 	edge [dashed, bend left] node {} (stack);
 \end{tikzpicture}
-\caption{Before Injection Phase Starts}
+\caption{Before injection phase starts}
 \end{figure}
 \noindent
 After all derivatives have been taken, the stack grows to a maximum size
 which condenses all the partial lexing information
 (that was originally stored in $r_i, c_{i+1}$ pairs)
 into bitcodes.
 The bitcodes are then carried with the regular
 expression, and augmented or moved around
-as the lexing goes on.
+as the lexing proceeds.
 It becomes unnecessary
 to remember all the
 intermediate expresssions, but only the most recent one
 with this bit-carrying regular expression.
 Annotated regular expressions
 are defined as the following
-Isabelle datatype \footnote{ We use subscript notation to indicate
+Isabelle datatype:\footnote{ We use subscript notation to indicate
 	that the bitcodes are auxiliary information that do not
-interfere with the structure of the regular expressions }:
+interfere with the structure of the regular expressions }
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{a}$ & $::=$  & $\ZERO$\\
 & $\mid$ & $_{bs}\ONE$\\
 & $\mid$ & $_{bs}{\bf c}$\\
 \end{tabular}
 \end{center}
 \noindent
 where $bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
 expressions and $as$ for lists of annotated regular expressions.
-The alternative constructor, written, $\sum$, has been generalised to
+The alternative constructor, written $\sum$, has been generalised to
 take a list of annotated regular expressions rather than just two.
 Why is it generalised? This is because when we analyse nested
 alternatives, there can be more than two elements at the same level
 after de-duplication, which can no longer be stored in a binary
 constructor.
 during lexing.
 For example,
 $(_{Z}a+_{S}b) \backslash a$ gives us
 $_{Z}\ONE + \ZERO$, where the $Z$ bitcode will
 later be used to decode that a left branch was
-selected at the time when the part $a+b$ was anallysed by
+selected at the time when the part $a+b$ was analysed by
 derivative.
 \subsection{A Bird's Eye View of the Bit-coded Lexer}
 Before we give the details of the functions and definitions
 related to our
 $\blexer$ (\emph{b}-itcoded lexer), we first provide a high-level
 \end{tikzpicture}
 \caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand
 	for the annotated regular expressions in each derivative step.
 	The characters used in each derivative is written as $c_i$.
-The relative size increase reflect the size increase in each derivative step.}
+The size of the derivatives typically increase during each derivative step,
+therefore we draw the larger circles to indicate that.
+The process starts with an internalise phase and concludes with a decoding phase.}
 \end{figure}
 \noindent
 The plain regular expressions
 are first ``lifted'' to an annotated regular expression,
 with the function called \emph{internalise} ($r \rightarrow _{bs}a$).
 The most notable improvements of $\blexer$
 over $\lexer$ are
 \begin{itemize}
 	\item
-		An absence of the second injection phase.
+		the absence of the second injection phase.
 	\item
-		One does not need to store each pair of the
+		one does not need to store each pair of the
 		intermediate regular expressions $_{bs_i}a_i$ and  $c_{i+1}$.
 		The previous annotated regular expression $_{bs_i}a_i$'s information is passed
 		on without loss to its successor $_{bs_{i+1}}a_{i+1}$,
 		and $c_{i+1}$'s information is stored in $L\;r$.\footnote{
 		which will be used during the decode phase, where we use $r$ as
 	a source of information.}
 	\item
-		Finally, simplification works much better--one can maintain correctness
+		simplification works much smoother--one can maintain correctness
 		while applying quite aggressive simplifications.
 \end{itemize}
 %In the next section we will
 %give the operations needed in $\blexer$,
 %with some remarks on the idea behind their definitions.
 		$(_{bs} \sum (a :: as))_\downarrow$ & $\dn$ & $ a_\downarrow + \; (_{[]} \sum as)_\downarrow$\\
 		$( _{bs} a^* )_\downarrow$ & $\dn$ & $(a_\downarrow)^*$
 	\end{tabular}
 \end{center}
 \noindent
-We also abbreviate the $\erase\; a$ operation
+Where we abbreviate the $\erase\; a$ operation
-as $a_\downarrow$, for conciseness.
+as $(a)_\downarrow$, for conciseness.
 Determining whether an annotated regular expression
-contains the empty string in its lauguage requires
+contains the empty string in its language requires
 a dedicated function $\bnullable$.
 In our formalisation this function is defined by simply calling $\erase$
 before $\nullable$.
-$\bnullable$ simply calls $\erase$ before $\nullable$.
 \begin{definition}
 		$\bnullable \; a \dn  \nullable \; (a_\downarrow)$
 \end{definition}
 The function for collecting
 bitcodes from a
 from left to right and creates a bitcode by stitching
 together bitcodes obtained from the children expressions.
 In the case of alternative regular expressions,
 it looks for the leftmost
 $\nullable$ branch
-to visit and ignores other siblings.
+to visit and ignores the other siblings.
 %Whenever there is some bitcodes attached to a
 %node, it returns the bitcodes concatenated with whatever
 %child recursive calls return.
 The only time when $\bmkeps$ creates new bitcodes
 is when it completes a star's iterations by attaching a $S$ to the end of the bitcode
 list it returns.
 The bitcodes extracted by $\bmkeps$ need to be
-$\decode$d (with the guidance of a plain regular expression):
+$\decode$d (with the guidance of a ``plain'' regular expression):
 %\begin{definition}[Bitdecoding of Values]\mbox{}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
 terminates is marked by $Z$.
 This coding is lossy, as it throws away the information about
 characters, and does not encode the ``boundary'' between two
 sequence values. Moreover, with only the bitcode we cannot even tell
 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$,
-but this swill not be necessary in our proof.
+but this will not be necessary in our correctness proof.
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
 Then setting $ds$ to be $[]$ and unfolding $\decode$ definition,
 we obtain the property.
 \end{proof}
 \noindent
-Now we define the central part of Sulzmann and Lu's lexing algorithm,
+Now we define the central part of Sulzmann and Lu's second lexing algorithm,
 the $\bder$ function (which stands for \emph{b}itcoded-derivative):
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
 \noindent
 The $\bder$ function tells us how regular expressions can be recursively traversed,
 where the bitcodes are augmented and carried around
 when a derivative is taken.
 We give the intuition behind some of the more involved cases in
-$\bder$. \\
+$\bder$.
 For example,
 in the \emph{star} case,
 a derivative of $_{bs}a^*$ means
 that one more star iteration needs to be taken.
 We therefore need to unfold it into a sequence,
 in $a_1$ before collapsing
 it (as it has been fully matched by string prior to $c$),
 and attach the collected bit-codes to the front of $a_2$
 before throwing away $a_1$.
 In the injection-based lexer, $r_1$ is immediately thrown away in
-in the $\textit{if}$ branch,
+the $\textit{if}$ branch,
 the information $r_1$ stores is therefore lost:
 \begin{center}
 	\begin{tabular}{lcl}
 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
 with respect to the input string $s$, and
 then test whether the result is (b)nullable.
 If yes, then extract the bitcodes from the nullable expression,
 and decodes the bitcodes into a lexical value.
 If there does not exists a match between $r$ and $s$ the lexer
-outputs $\None$ indicating a mismatch.\\
+outputs $\None$ indicating a mismatch.
 Ausaf and Urban formally proved the correctness of the $\blexer$, namely
 \begin{property}
 $\blexer \;r \; s = \lexer \; r \; s$.
 \end{property}
 \noindent
 We introduce the proof later, after we give all
 the needed auxiliary functions and definitions.
 \subsection{An Example $\blexer$ Run}
 Before introducing the proof we shall first walk
 through a concrete example of our $\blexer$ calculating the right
-lexical information through bit-coded regular expressions.\\
+lexical information through bit-coded regular expressions.
 Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$.
 We give again the bird's eye view of this particular example
 in each stage of the algorithm:
 \tikzset{three sided/.style={
 		\path 	(bs)
 			edge [] node {$\decode$} (v);
 	\end{tikzpicture}
-	\caption{$\blexer \;\;\;\; (aa)^*(b+c) \;\;\;\; aab$}
+	\caption{$\blexer$ with the regular expression $(aa)^*(b+c)$ and $aab$}
 \end{figure}
 \noindent
 The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$
 turned into $ZS$ after derivative w.r.t $b$
-is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$
+is taken, which calls $\bmkeps$ on the nullable $_Z\ONE\cdot (aa)^*$
-before processing $_Zb+_Sc$.\\
+before processing $_Zb+_Sc$.
 The annotated regular expressions
 would look overwhelming if we explicitly indicate all the
 locations where bitcodes are attached.
 For example,
 $(aa)^*\cdot (b+c)$ would
 look like $_{[]}(_{[]}(_{[]}a \cdot _{[]}a)^*\cdot _{[]}(_{[]}b+_{[]}c))$
 after
 internalise.
 Therefore for readability we omit bitcodes if they are empty.
 This applies to all annotated
-regular expressions in this thesis.\\
+regular expressions in this thesis.
 %and assume we have just read the first character $a$:
 %\begin{center}
 %\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
 %    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
 %	    {$(_{[Z]}(\ONE \cdot a) \cdot (aa)^* )\cdot bc$
 %    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
 %        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
 %         \nodepart{two} EOF};
 %\end{tikzpicture}
 %\end{center}
-\noindent
 In the next section we introduce the correctness proof
 found by Ausaf and Urban
 of the bitcoded lexer.
 %-----------------------------------
 %	SUBSECTION 1
 the lexing process, which is not be possible
 using $\lexer$ or $\blexer$ alone.
 \subsubsection{$\textit{Retrieve}$}
 The first function we shall introduce is $\retrieve$.
 Sulzmann and Lu gave its definition, and
-Ausaf and Urban found
+Ausaf and Urban found it
 useful in their mechanised proofs.
 Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$,
 after all the derivatives have been taken:
 \begin{center}
 \begin{tabular}{lcl}
 These pairs all contain adequate information to
 obtain the final lexing result.
 For example, in the leftmost pair the
 lexical information is condensed in
 $v_0$, whereas for the rightmost pair,
-the lexing result hides is in the bitcodes of $a_n$.\\
+the lexing result hides is in the bitcodes of $a_n$.
 The function $\bmkeps$ is able to extract from $a_n$ the result
 by looking for nullable parts of the regular expression.
 However for regular expressions $a_i$ in general,
 they might not necessarily be nullable.
 For those regular expressions that are not nullable,
 The function $\retrieve$ allows connecting
 between the intermediate
 results $a_i$ and $a_{i+1}$ in $\blexer$.
 For plain regular expressions something similar is required.
-\subsection{$\flex$}
+\subsection{The Function $\flex$}
 Ausaf and Urban defined an auxiliary function called $\flex$ for $\lexer$,
 defined as
 \begin{center}
 \begin{tabular}{lcl}
 $\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
 $\flex \; r \; f \; (c :: s) \; v$ &  $=$ &   $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$
 \end{tabular}
 \end{center}
 which accumulates the characters that need to be injected back,
-and does the injection in a stack-like manner (last taken derivative first injected).
+and does the injection in a stack-like manner (the last taken derivative is first injected).
 The function
 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression
 $r$, the identity function $id$, the input string $s$ and the value
 $v_n= \mkeps \; (r\backslash s)$:
 \begin{lemma}
 %----------------------------------------------------------------------------------------
 %	SECTION  correctness proof
 %----------------------------------------------------------------------------------------
 \section{Correctness of the  Bit-coded Algorithm (Without Simplification)}
 We can first show that
-$\flex$ and $\blexer$ calculates the same thing.
+$\flex$ and $\blexer$ calculates the same value.
 \begin{lemma}\label{flex_retrieve}
 	If $\vdash v: (r\backslash s)$, then\\
 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
 \end{lemma}
 \begin{proof}

changeset 638	dd9dde2d902b
parent 624	8ffa28fce271
child 639	80cc6dc4c98b