lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded1.tex

equal deleted inserted replaced

-:80cc6dc4c98b
+:bd1354127574
 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of
 derivatives of
 regular expressions.
 We have implemented their algorithm in Scala and Isabelle,
 and found problems
-in their algorithm such as de-duplication not working properly and redundant
+in their algorithm, such as de-duplication not working properly and redundant
 fixpoint construction.
 \section{The Motivation Behind Using Bitcodes}
 Let us give again the definition of $\lexer$ from Chapter \ref{Inj}:
 \begin{center}
 \begin{tabular}{lcl}
 \end{tikzpicture}
 \caption{Second derivative taken}
 \end{figure}
 \noindent
-As the number of derivative steps increase,
+As the number of derivative steps increases,
 the stack would increase:
 \begin{figure}[H]
 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick]
 %\draw (-6,-6) grid (6,6);
 The bitcodes are then carried with the regular
 expression, and augmented or moved around
 as the lexing proceeds.
 It becomes unnecessary
 to remember all the
-intermediate expresssions, but only the most recent one
+intermediate expressions, but only the most recent one
 with this bit-carrying regular expression.
 Annotated regular expressions
 are defined as the following
 Isabelle datatype:\footnote{ We use subscript notation to indicate
-	that the bitcodes are auxiliary information that do not
+	that the bitcodes are auxiliary information that does not
 interfere with the structure of the regular expressions }
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{a}$ & $::=$  & $\ZERO$\\
 & $\mid$ & $_{bs}\ONE$\\
 \end{tikzpicture}
 \caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand
 	for the annotated regular expressions in each derivative step.
-	The characters used in each derivative is written as $c_i$.
+	The characters used in each derivative are written as $c_i$.
-The size of the derivatives typically increase during each derivative step,
+The size of the derivatives typically increases during each derivative step.
-therefore we draw the larger circles to indicate that.
 The process starts with an internalise phase and concludes with a decoding phase.}
 \end{figure}
 \noindent
 The plain regular expressions
 are first ``lifted'' to an annotated regular expression,
 \end{center}
 \noindent
 The opposite of $\textit{internalise}$ is
 $\erase$, where all bit-codes are removed,
 and the alternative operator $\sum$ for annotated
-regular expressions is transformed to the binary version
+regular expressions is transformed into the binary version
 in (plain) regular expressions. This can be defined as follows:
 \begin{center}\label{eraseDef}
 	\begin{tabular}{lcl}
 		$\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\
 		$( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\
 %because at least one of $\decode'$'s parameters will go down in terms
 %of size.\\
 The reverse operation of $\decode$ is $\code$.
 This function encodes a value into a bitcode by converting
 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
-star iteration by $S$. The border where a star iteration
+star iteration by $S$. $ Z$ marks the border where a star iteration
-terminates is marked by $Z$.
+terminates.
 This coding is lossy, as it throws away the information about
 characters, and does not encode the ``boundary'' between two
 sequence values. Moreover, with only the bitcode we cannot even tell
 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$,
-but this will not be necessary in our correctness proof.
+but this will not be necessary for our correctness proof.
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
 \[If \vdash v : r \; then \;\decode \; (\code \; v) \; r = \textit{Some}(v) \]
 \end{lemma}
 \begin{proof}
 By proving a more general version of the lemma, on $\decode'$:
 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
-Then setting $ds$ to be $[]$ and unfolding $\decode$ definition,
+Then we obtain the property by setting $ds$ to be $[]$ and unfolding the $\decode$ definition.
-we obtain the property.
 \end{proof}
 \noindent
 Now we define the central part of Sulzmann and Lu's second lexing algorithm,
 the $\bder$ function (which stands for \emph{b}itcoded-derivative):
 \begin{center}
 plain regular expression $r$, and then builds successive derivatives
 with respect to the input string $s$, and
 then test whether the result is (b)nullable.
 If yes, then extract the bitcodes from the nullable expression,
 and decodes the bitcodes into a lexical value.
-If there does not exists a match between $r$ and $s$ the lexer
+If there does not exist a match between $r$ and $s$, the lexer
 outputs $\None$ indicating a mismatch.
 Ausaf and Urban formally proved the correctness of the $\blexer$, namely
 \begin{property}
 $\blexer \;r \; s = \lexer \; r \; s$.
 \end{property}
 We will elaborate on this, with the help of
 some helper functions such as $\retrieve$ and
 $\flex$.
 \subsection{Specifications of Some Helper Functions}
 The functions we introduce will give a more detailed view into
-the lexing process, which is not be possible
+the lexing process, which is not possible
 using $\lexer$ or $\blexer$ alone.
 \subsubsection{$\textit{Retrieve}$}
 The first function we shall introduce is $\retrieve$.
 Sulzmann and Lu gave its definition, and
 Ausaf and Urban found it
 provided that the corresponding values are used respectively:
 \begin{lemma}\label{retrieveStepwise}
 	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
 \end{lemma}
 \begin{proof}
-	By induction on $r$, generalising over $v$.
+	We do an induction on $r$, generalising over $v$.
 	The induction principle in our formalisation
 	is function $\erase$'s cases.
 \end{proof}
 \noindent
 Before we move on to the next
 $\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
 $\flex \; r \; f \; (c :: s) \; v$ &  $=$ &   $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$
 \end{tabular}
 \end{center}
 which accumulates the characters that need to be injected back,
-and does the injection in a stack-like manner (the last taken derivative is first injected).
+and does the injection in a stack-like manner (the last character being chopped off in the derivatives phase is first injected).
 The function
 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression
 $r$, the identity function $id$, the input string $s$ and the value
 $v_n= \mkeps \; (r\backslash s)$:
 \begin{lemma}

changeset 640	bd1354127574
parent 639	80cc6dc4c98b
child 649	ef2b8abcbc55