lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:3e19073e91f4
+:4aabb0629e4b
 %and then give the algorithm and its variant and discuss
 %why more aggressive simplifications are needed.
 In this chapter, we define the basic notions
 for regular languages and regular expressions.
-This is essentially a description in ``English"
+This is essentially a description in ``English''
 of our formalisation in Isabelle/HOL.
 We also give the definition of what $\POSIX$ lexing means,
-followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014}
+followed by a lexing algorithm by Sulzmanna and Lu\parencite{Sulzmann2014}
 that produces the output conforming
 to the $\POSIX$ standard.
-It is also worth mentioning that
+\footnote{In what follows
-we choose to use the ML-style notation
+we choose to use the Isabelle-style notation
 for function applications, where
-the parameters of a function is not enclosed
+the parameters of a function are not enclosed
 inside a pair of parentheses (e.g. $f \;x \;y$
 instead of $f(x,\;y)$). This is mainly
-to make the text visually more concise.
+to make the text visually more concise.}
 \section{Basic Concepts}
 Usually, formal language theory starts with an alphabet
 denoting a set of characters.
 Here we just use the datatype of characters from Isabelle,
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
 \end{tabular}
 \end{center}
-Where $c$ is a variable ranging over characters.
+where $c$ is a variable ranging over characters.
 Strings can be concatenated to form longer strings in the same
 way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
 We omit the precise
 recursive definition here.
 We overload this concatenation operator for two sets of strings:
 \begin{tabular}{lcl}
 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
 \end{tabular}
 \end{center}
 \noindent
-This can be generalised to "chopping off" a string from all strings within set $A$,
+This can be generalised to ``chopping off'' a string
+from all strings within a set $A$,
 namely:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
 \end{tabular}
 \end{center}
 \noindent
 which is essentially the left quotient $A \backslash L$ of $A$ against
-the singleton language with $L = \{w\}$
+the singleton language with $L = \{s\}$
 in formal language theory.
 However, for the purposes here, the $\textit{Ders}$ definition with
 a single string is sufficient.
 The reason for defining derivatives
 \]
 \end{lemma}
 \noindent
 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
 and get to $B$.
-The language $A*$'s derivative can be described using the language derivative
+The language derivative for $A*$ can be described using the language derivative
 of $A$:
 \begin{lemma}
 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
 \end{lemma}
 \begin{proof}
-There are too inclusions to prove:
+There are two inclusions to prove:
 \begin{itemize}
 \item{$\subseteq$}:\\
 The set
 \[ \{s \mid c :: s \in A*\} \]
 is enclosed in the set
 and the rest of the string will
 be still in $A*$.
 \item{$\supseteq$}:\\
 Note that
 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
-hold.
+holds.
-Also this holds:
+Also the following holds:
 \[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
 where the $\textit{RHS}$ can be rewritten
 as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
 which of course contains $\Der \; c \; A @ A*$.
 \end{itemize}
 $L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
 $L \; r^*$ & $\dn$ & $ (L\;r)*$
 \end{tabular}
 \end{center}
 \noindent
-Now with semantic derivatives of a language and regular expressions and
+Now with language derivatives of a language and regular expressions and
-their language interpretations in place, we are ready to define derivatives on regexes.
+their language interpretations in place, we are ready to define derivatives on regular expressions.
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
 %Recall, the language derivative acts on a set of strings
 %and essentially chops off a particular character from
 %all strings in that set, Brzozowski defined a derivative operation on regular expressions
 %so that after derivative $L(r\backslash c)$
 %will look as if it was obtained by doing a language derivative on $L(r)$:
-Recall that the semantic derivative acts on a
+Recall that the language derivative acts on a
 language (set of strings).
 One can decide whether a string $s$ belongs
 to a language $S$ by taking derivative with respect to
 that string and then checking whether the empty
 string is in the derivative:
 \end{property}
 \noindent
 Now we give the recursive definition of derivative on
 regular expressions, so that it satisfies the properties above.
 The derivative function, written $r\backslash c$,
-defines how a regular expression evolves into
+takes a regular expression $r$ and character $c$, and
-a new one after all the string it contains is acted on:
+returns a new regular expression representing
-if it starts with $c$, then the character is chopped of,
+the original regular expression's language $L \; r$
-if not, that string is removed.
+being taken the language derivative with respect to $c$.
 \begin{center}
 \begin{tabular}{lcl}
 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
 		$d \backslash c$     & $\dn$ &
 contains an empty string, then the second component of the sequence
 needs to be considered, as its derivative will contribute to the
 result of this derivative:
 \begin{center}
 	\begin{tabular}{lcl}
-		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1))\; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
+		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ &
+		$\textit{if}\;\,([] \in L(r_1))\;
+		\textit{then} \; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
 		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
 	\end{tabular}
 \end{center}
 \noindent
 Notice how this closely resembles
 		\Der \; c\; B$\\
 		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
 	\end{tabular}
 \end{center}
 \noindent
-The star regular expression $r^*$'s derivative
+The derivative of the star regular expression $r^*$
 unwraps one iteration of $r$, turns it into $r\backslash c$,
 and attaches the original $r^*$
 after $r\backslash c$, so that
 we can further unfold it as many times as needed:
 \[
 	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
 \]
 Again,
-the structure is the same as the semantic derivative of Kleene star:
+the structure is the same as the language derivative of Kleene star:
 \[
 	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
 \]
 In the above definition of $(r_1\cdot r_2) \backslash c$,
 the $\textit{if}$ clause's
 to strings as follows:
 \begin{center}
 \begin{tabular}{lcl}
 $r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
-$r \backslash [\,] $ & $\dn$ & $r$
+$r \backslash_s [\,] $ & $\dn$ & $r$
 \end{tabular}
 \end{center}
 \noindent
 When there is no ambiguity, we will
 omit the subscript and use $\backslash$ instead
-of $\backslash_r$ to denote
+of $\backslash_s$ to denote
 string derivatives for brevity.
-Brzozowski's  regular-expression matcher algorithm can then be described as:
+Brzozowski's  regular-expression matching algorithm can then be described as:
 \begin{definition}
 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
 \end{definition}
 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$,
 this algorithm presented graphically is as follows:
 \begin{equation}\label{graph:successive_ders}
 \begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
+r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] &
+r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] &
+\;\textrm{true}/\textrm{false}
 \end{tikzcd}
 \end{equation}
 \noindent
 It can  be
 relatively  easily shown that this matcher is correct:
 \begin{lemma}
 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
 \end{lemma}
 \begin{proof}
-	By the stepwise property of derivatives
+	By induction on $s$ using property of derivatives:
-	(lemma \ref{derDer}).
+	lemma \ref{derDer}.
 \end{proof}
 \noindent
 \begin{center}
-	\begin{figure}
+\begin{figure}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 ylabel={time in secs},
 ymode = log,
 legend pos=north west,
 legend cell align=left]
 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
 \end{axis}
 \end{tikzpicture}
-\caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
+\caption{Matching the regular expression $(a^*)^*b$ against strings of the form
+$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
+$ using Brzozowski's original algorithm}\label{NaiveMatcher}
 \end{figure}
 \end{center}
 \noindent
 If we implement the above algorithm naively, however,
 the algorithm can be excruciatingly slow, as shown in
 \ref{NaiveMatcher}.
 Note that both axes are in logarithmic scale.
 Around two dozens characters
-would already explode the matcher on regular expression
+would already explode the matcher on the regular expression
 $(a^*)^*b$.
-For this, we need to introduce certain
+Too improve this situation, we need to introduce simplification
 rewrite rules for the intermediate results,
 such as $r + r \rightarrow r$,
 and make sure those rules do not change the
 language of the regular expression.
 One simpled-minded simplification function
 					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
 		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
 				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
 				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
 				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
-		$\simp \; r$ & $\dn$ & $r$
+		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
 	\end{tabular}
 \end{center}
 If we repeatedly apply this simplification
 function during the matching algorithm,
 against
 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
 \end{figure}
 \noindent
 The running time of $\textit{ders}\_\textit{simp}$
-on the same example of \ref{NaiveMatcher}
+on the same example of Figure \ref{NaiveMatcher}
-is now very tame in terms of the length of inputs,
+is now ``tame''  in terms of the length of inputs,
-as shown in \ref{BetterMatcher}.
+as shown in Figure \ref{BetterMatcher}.
-Building derivatives and then testing the existence
+So far the story is use Brzozowski derivatives,
-of empty string in the resulting regular expression's language,
+simplifying where possible and at the end test
-adding simplifications when necessary.
+whether the empty string is recognised by the final derivative.
-So far, so good. But what if we want to
+But what if we want to
-do lexing instead of just getting a YES/NO answer?
+do lexing instead of just getting a true/false answer?
 Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and
 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
 In this section, we present a two-phase regular expression lexing
 The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
 by lemma \ref{injPosix}.
 \end{proof}
 \noindent
 As we did earlier in this chapter on the matcher, one can
-introduce simplification on the regex.
+introduce simplification on the regular expression.
 However, now one needs to do a backward phase and make sure
 the values align with the regular expressions.
 Therefore one has to
 be careful not to break the correctness, as the injection
-function heavily relies on the structure of the regexes and values
+function heavily relies on the structure of the regular expressions and values
 being correct and matching each other.
 It can be achieved by recording some extra rectification functions
 during the derivatives step, and applying these rectifications in
 each run during the injection phase.
 With extra care

changeset 583	4aabb0629e4b
parent 579	35df9cdd36ca
child 585	4969ef817d92