lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:e3752aac8ec2
+:dd9dde2d902b
 In this chapter, we define the basic notions
 for regular languages and regular expressions.
 This is essentially a description in ``English''
 the functions and datatypes of our formalisation in Isabelle/HOL.
 We also define what $\POSIX$ lexing means,
-followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
+followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
 that produces the output conforming
 to the $\POSIX$ standard\footnote{In what follows
 we choose to use the Isabelle-style notation
 for function applications, where
 the parameters of a function are not enclosed
 The reason for defining derivatives
 is that they provide another approach
 to test membership of a string in
 a set of strings.
 For example, to test whether the string
-$bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
+$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
 respect to the string $bar$:
 \begin{center}
 \begin{tabular}{lll}
 	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ &
 	$\{ar, rak\}$ \\
 				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
 \end{tabular}
 \end{center}
 \noindent
 and in the end, test whether the set
-has the empty string.\footnote{We use the infix notation $A\backslash c$
+contains the empty string.\footnote{We use the infix notation $A\backslash c$
-	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
+	instead of $\Der \; c \; A$ for brevity, as it will always be
+	clear from the context that we are operating
 on languages rather than regular expressions.}
 In general, if we have a language $S$,
 then we can test whether $s$ is in $S$
 by testing whether $[] \in S \backslash s$.
 We use $\ZERO$ for the regular expression that
 matches no string, and $\ONE$ for the regular
 expression that matches only the empty string.\footnote{
 Some authors
 also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
-but we prefer our notation.}
+but we prefer this notation.}
 The sequence regular expression is written $r_1\cdot r_2$
 and sometimes we omit the dot if it is clear which
 regular expression is meant; the alternative
 is written $r_1 + r_2$.
 The \emph{language} or meaning of
 \end{center}
 \noindent
 %Now with language derivatives of a language and regular expressions and
 %their language interpretations in place, we are ready to define derivatives on regular expressions.
 With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
-We do so by first introducing what properties it should satisfy.
+We do so by first introducing what properties they should satisfy.
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
 %Recall, the language derivative acts on a set of strings
 %and essentially chops off a particular character from
 %all strings in that set, Brzozowski defined a derivative operation on regular expressions
 Brzozowski noticed that $\Der$
 can be ``mirrored'' on regular expressions which
 he calls the derivative of a regular expression $r$
 with respect to a character $c$, written
 $r \backslash c$. This infix operator
-takes an original regular expression $r$ as input
+takes regular expression $r$ as input
-and a character as a right operand and
+and a character as a right operand.
-outputs a result, which is a new regular expression.
 The derivative operation on regular expression
 is defined such that the language of the derivative result
 coincides with the language of the original
 regular expression being taken
-derivative with respect to the same character:
+derivative with respect to the same characters, namely
 \begin{property}
 \[
 	L \; (r \backslash c) = \Der \; c \; (L \; r)
 \]
 	\{\ldots,\;s_1,\;\ldots\}$}
 }\\
 \vspace{ 3mm }
 The derivatives on regular expression can again be
-generalised to a string.
+generalised to strings.
 One could compute $r_{start} \backslash s$  and test membership of $s$
 in $L \; r_{start}$ by checking
 whether the empty string is in the language of
-$r_{end}$ ($r_{start}\backslash s$).\\
+$r_{end}$ (that is $r_{start}\backslash s$).\\
 \vspace{2mm}
 \Longstack{
 	\notate{$r_{start}$}{4}{
 		\Longstack{$L \; r_{start} = \{\ldots, \;$
 	\notate{$r_{end}$}{1}{
 	$L \; r_{end} = \{\ldots, \; [], \ldots\}$}
 }
+We have the property that
 \begin{property}
 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
 \end{property}
 \noindent
 Next, we give the recursive definition of derivative on
 regular expressions so that it satisfies the properties above.
-The derivative function, written $r\backslash c$,
+%The derivative function, written $r\backslash c$,
-takes a regular expression $r$ and character $c$, and
+%takes a regular expression $r$ and character $c$, and
-returns a new regular expression representing
+%returns a new regular expression representing
-the original regular expression's language $L \; r$
+%the original regular expression's language $L \; r$
-being taken the language derivative with respect to $c$.
+%being taken the language derivative with respect to $c$.
 \begin{table}
 	\begin{center}
 \begin{tabular}{lcl}
 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
 result of this derivative:
 \begin{center}
 	\begin{tabular}{lcl}
 		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ &
 		$\textit{if}\;\,([] \in L(r_1))\;
-		\textit{then} \; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
+		\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
 		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
 	\end{tabular}
 \end{center}
 \noindent
 Notice how this closely resembles
 would require both children to have the empty string
 to compose an empty string, and the Kleene star
 is always nullable because it naturally
 contains the empty string.
 \noindent
-We have the following correspondence between
+We have the following two correspondences between
 derivatives on regular expressions and
 derivatives on a set of strings:
 \begin{lemma}\label{derDer}
 	\mbox{}
 	\begin{itemize}
 \end{lemma}
 \begin{proof}
 	By induction on $r$.
 \end{proof}
 \noindent
-which is the main property of derivatives
+which are the main properties of derivatives
-that enables us to reason about the correctness of
+that enables us later to reason about the correctness of
 derivative-based matching.
 We can generalise the derivative operation shown above for single characters
 to strings as follows:
 \begin{center}
 If we implement the above algorithm naively, however,
 the algorithm can be excruciatingly slow, as shown in
 \ref{NaiveMatcher}.
 Note that both axes are in logarithmic scale.
 Around two dozen characters
-would already ``explode'' the matcher with the regular expression
+this algorithm already ``explodes'' with the regular expression
 $(a^*)^*b$.
 To improve this situation, we need to introduce simplification
 rules for the intermediate results,
-such as $r + r \rightarrow r$,
+such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
 and make sure those rules do not change the
 language of the regular expression.
-One simpled-minded simplification function
+One simple-minded simplification function
 that achieves these requirements
 is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
 \begin{center}
 	\begin{tabular}{lcl}
 		$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
 and the values both flatten to $abc$.
 Lexers therefore have to disambiguate and choose only
 one of the values to be generated. $\POSIX$ is one of the
 disambiguation strategies that is widely adopted.
-Ausaf et al.\parencite{AusafDyckhoffUrban2016}
+Ausaf et al. \cite{AusafDyckhoffUrban2016}
 formalised the property
 as a ternary relation.
 The $\POSIX$ value $v$ for a regular expression
 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
 in the following rules\footnote{The names of the rules are used
 where identifiers are defined as usual (letters
 followed by letters, numbers or underscores),
 then a match with a keyword (if)
 followed by
 an identifier (foo) would be incorrect.
-POSIX lexing would generate what we want.
+POSIX lexing generates what is included by lexing.
 \noindent
 We know that a POSIX
 value for regular expression $r$ is inhabited by $r$.
 \begin{lemma}
 derivative-based matching
 to a lexing algorithm by a second phase
 after the initial phase of successive derivatives.
 This second phase generates a POSIX value
 if the regular expression matches the string.
-Two functions are involved: $\inj$ and $\mkeps$.
+The algorithm uses two functions called $\inj$ and $\mkeps$.
 The function $\mkeps$ constructs a POSIX value from the last
 derivative $r_n$:
 \begin{ceqn}
 \begin{equation}\label{graph:mkeps}
 \begin{tikzcd}
 we give the $\Stars$ constructor an empty list, meaning
 no iteration is taken.
 The result of $\mkeps$ on a $\nullable$ $r$
 is a POSIX value for $r$ and the empty string:
 \begin{lemma}\label{mePosix}
-$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
+$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
 \end{lemma}
 \begin{proof}
 	By induction on $r$.
 \end{proof}
 \noindent
 	\Stars \; [
 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
 		   \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
 		  ]
 \]
-And $\derssimp \; aa \; (a^*a^*)^*$ would be
+And $\derssimp \; aa \; (a^*a^*)^*$ is
 \[
 	((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* +
 	(a^*a^* + a^*)\cdot(a^*a^*)^*.
 \]
 which removes two out of the seven terms corresponding to the
 	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
 \]
 At any moment, the  subterms in a regular expression
 that will potentially result in a POSIX value is only
 a minority among the many other terms,
-and one can remove ones that are not possible to
+and one can remove the ones that are not possible to
 be POSIX.
 In the above example,
 \begin{equation}\label{eqn:growth2}
 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
 \end{center}
 Other terms with an underlying value, such as
 \[
 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
 \]
-is too hopeless to contribute a POSIX lexical value,
+do not to contribute a POSIX lexical value,
-and is therefore thrown away.
+and therefore can be thrown away.
 Ausaf et al. \cite{AusafDyckhoffUrban2016}
 have come up with some simplification steps, however those steps
-are not yet sufficiently strong, so they achieve the above effects.
+are not yet sufficiently strong, to achieve the above effects.
 And even with these relatively mild simplifications, the proof
 is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
-One would prove something like this:
+One would need to prove something like this:
 \[
 	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
 	\textit{then}\;\; (r, c::s) \rightarrow
 	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
 \]
 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
 not only has to return a simplified regular expression,
 but also what specific simplifications
-has been done as a function on values
+have been done as a function on values
 showing how one can transform the value
 underlying the simplified regular expression
 to the unsimplified one.
 We therefore choose a slightly different approach

changeset 638	dd9dde2d902b
parent 637	e3752aac8ec2
child 639	80cc6dc4c98b