lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:0bcb4a7cb40c
+:e3752aac8ec2
 %why more aggressive simplifications are needed.
 In this chapter, we define the basic notions
 for regular languages and regular expressions.
 This is essentially a description in ``English''
-of our formalisation in Isabelle/HOL.
+the functions and datatypes of our formalisation in Isabelle/HOL.
-We also give the definition of what $\POSIX$ lexing means,
+We also define what $\POSIX$ lexing means,
 followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
 that produces the output conforming
 to the $\POSIX$ standard\footnote{In what follows
 we choose to use the Isabelle-style notation
 for function applications, where
 to make the text visually more concise.}.
 \section{Basic Concepts}
 Formal language theory usually starts with an alphabet
 denoting a set of characters.
-Here we just use the datatype of characters from Isabelle,
+Here we use the datatype of characters from Isabelle,
 which roughly corresponds to the ASCII characters.
 In what follows, we shall leave the information about the alphabet
 implicit.
 Then using the usual bracket notation for lists,
 we can define strings made up of characters as:
 \end{tabular}
 \end{center}
 where $c$ is a variable ranging over characters.
 The $::$ stands for list cons and $[]$ for the empty
 list.
-A singleton list is sometimes written as $[c]$ for brevity.
+For brevity, a singleton list is sometimes written as $[c]$.
 Strings can be concatenated to form longer strings in the same
-way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
+way we concatenate two lists, which we shall write as $s_1 @ s_2$.
 We omit the precise
 recursive definition here.
 We overload this concatenation operator for two sets of strings:
 \begin{center}
 \begin{tabular}{lcl}
 				 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
 				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
 \end{tabular}
 \end{center}
 \noindent
-and in the end test whether the set
+and in the end, test whether the set
 has the empty string.\footnote{We use the infix notation $A\backslash c$
 	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
 on languages rather than regular expressions.}
 In general, if we have a language $S$,
 \begin{proof}
 There are two inclusions to prove:
 \begin{itemize}
 \item{$\subseteq$}:\\
 The set
-\[ \{s \mid c :: s \in A*\} \]
+\[ S_1 = \{s \mid c :: s \in A*\} \]
 is enclosed in the set
-\[ \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
+\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
-because whenever you have a string starting with a character
+This is because for any string $c::s$ satisfying $c::s \in A*$,
-in the language of a Kleene star $A*$,
+%whenever you have a string starting with a character
-then that character together with some sub-string
+%in the language of a Kleene star $A*$,
-immediately after it will form the first iteration,
+%then that
-and the rest of the string will
+the character $c$, together with a prefix of $s$
-be still in $A*$.
+%immediately after $c$
+forms the first iteration of $A*$,
+and the rest of the $s$ is also $A*$.
+This coincides with the definition of $S_2$.
 \item{$\supseteq$}:\\
 Note that
 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
 holds.
 Also the following holds:
 			 \mid  r_1 + r_2
 			 \mid r^*
 \]
 \noindent
 We call them basic because we will introduce
-additional constructors in later chapters such as negation
+additional constructors in later chapters, such as negation
 and bounded repetitions.
 We use $\ZERO$ for the regular expression that
 matches no string, and $\ONE$ for the regular
 expression that matches only the empty string.\footnote{
 Some authors
 \end{tabular}
 \end{center}
 \noindent
 %Now with language derivatives of a language and regular expressions and
 %their language interpretations in place, we are ready to define derivatives on regular expressions.
-With $L$ we are ready to introduce Brzozowski derivatives on regular expressions.
+With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
 We do so by first introducing what properties it should satisfy.
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
 %Recall, the language derivative acts on a set of strings
 %and essentially chops off a particular character from
 \begin{property}
 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
 \end{property}
 \noindent
-Next we give the recursive definition of derivative on
+Next, we give the recursive definition of derivative on
-regular expressions, so that it satisfies the properties above.
+regular expressions so that it satisfies the properties above.
 The derivative function, written $r\backslash c$,
 takes a regular expression $r$ and character $c$, and
 returns a new regular expression representing
 the original regular expression's language $L \; r$
 being taken the language derivative with respect to $c$.
 we can further unfold it as many times as needed:
 \[
 	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
 \]
 Again,
-the structure is the same as the language derivative of Kleene star:
+the structure is the same as the language derivative of the Kleene star:
 \[
 	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
 \]
 In the above definition of $(r_1\cdot r_2) \backslash c$,
 the $\textit{if}$ clause's
 \begin{definition}
 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
 \end{definition}
 \noindent
-Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$,
+Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$,
-this algorithm presented graphically is as follows:
+this algorithm, presented graphically, is as follows:
 \begin{equation}\label{matcher}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] &
 r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] &
 easy to show that this matcher is correct, namely
 \begin{lemma}
 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
 \end{lemma}
 \begin{proof}
-	By induction on $s$ using property of derivatives:
+	By induction on $s$ using the property of derivatives:
 	lemma \ref{derDer}.
 \end{proof}
 \begin{figure}
 \begin{center}
 \begin{tikzpicture}
 \noindent
 If we implement the above algorithm naively, however,
 the algorithm can be excruciatingly slow, as shown in
 \ref{NaiveMatcher}.
 Note that both axes are in logarithmic scale.
-Around two dozens characters
+Around two dozen characters
 would already ``explode'' the matcher with the regular expression
 $(a^*)^*b$.
 To improve this situation, we need to introduce simplification
 rules for the intermediate results,
 such as $r + r \rightarrow r$,
 The running time of $\textit{ders}\_\textit{simp}$
 on the same example of Figure \ref{NaiveMatcher}
 is now ``tame''  in terms of the length of inputs,
 as shown in Figure \ref{BetterMatcher}.
-So far the story is use Brzozowski derivatives and
+So far, the story is use Brzozowski derivatives and
-simplify as much as possible and at the end test
+simplify as much as possible, and at the end test
 whether the empty string is recognised
 by the final derivative.
 But what if we want to
 do lexing instead of just getting a true/false answer?
 Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and
 multiple values for it. For example, both
 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
 and the values both flatten to $abc$.
 Lexers therefore have to disambiguate and choose only
-one of the values be generated. $\POSIX$ is one of the
+one of the values to be generated. $\POSIX$ is one of the
 disambiguation strategies that is widely adopted.
 Ausaf et al.\parencite{AusafDyckhoffUrban2016}
 formalised the property
 as a ternary relation.
 %
 %
 %\end{tikzpicture}
 %\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
 %\end{figure}
-The above $\POSIX$ rules follows the intuition described below:
+The above $\POSIX$ rules follow the intuition described below:
 \begin{itemize}
 	\item (Left Priority)\\
-		Match the leftmost regular expression when multiple options of matching
+		Match the leftmost regular expression when multiple options for matching
 		are available. See P+L and P+R where in P+R $s$ cannot
 		be in the language of $L \; r_1$.
 	\item (Maximum munch)\\
 		Always match a subpart as much as possible before proceeding
 		to the next part of the string.
 		For example, when the string $s$ matches
 		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
 		Then the split that matches a longer string for the first part
 		$r_{part1}$ is preferred by this maximum munch rule.
-		This is caused by the side-condition
+		The side-condition
 		\begin{center}
 		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
 		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
 		\end{center}
-		in PS.
+		in PS causes this.
 		%(See
 		%\ref{munch} for an illustration).
 \end{itemize}
 \noindent
 These disambiguation strategies can be
 \[
 (s_1'', r_1) \rightarrow v_1''
 \;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s
 \]
 cannot possibly form a $\POSIX$ value either, because
-by definition there is a candidate
+by definition, there is a candidate
-with longer initial string
+with a longer initial string
 $s_1$. Therefore, we know that the POSIX
 value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
 $s$ must have the
 property that
 \[
 then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$,
 which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
 is the same as $\Seq(v_1, v_2)$.
 \end{proof}
 \noindent
-We have now defined what a POSIX value is and shown that it is unique,
+We have now defined what a POSIX value is and shown that it is unique.
-the problem is to generate
+The problem is to generate
 such a value in a lexing algorithm using derivatives.
 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
 Sulzmann and Lu extended Brzozowski's
 	By induction on $r$.
 \end{proof}
 \noindent
 After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
 in reverse order as they were chopped off in the derivative phase.
-The fucntion for this is called $\inj$. This function
+The function for this is called $\inj$. This function
 operates on values, unlike $\backslash$ which operates on regular expressions.
 In the diagram below, $v_i$ stands for the (POSIX) value
 for how the regular expression
 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
 \caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
 	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
 	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
 	$c_1$, and so on. These are the same operations as they have appeared in the matcher
 	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
-	then the second phase starts. First $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
+	then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
-	the empty string , by always selecting the leftmost
+	the empty string, by always selecting the leftmost
-	nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they
+	nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they
 	appeared in the string, always preserving POSIXness.}\label{graph:inj}
 \end{figure}
 \noindent
 The function $\textit{inj}$ as defined by Sulzmann and Lu
 takes three arguments: a regular
 \end{center}
 \noindent
 The function recurses on
 the shape of regular
-expressionsw  and values.
+expressions and values.
 Intuitively, each clause analyses
 how $r_i$ could have transformed when being
 derived by $c$, identifying which subpart
 of $v_{i+1}$ has the ``hole''
 to inject the character back into.
 Once the character is
 injected back to that sub-value;
-$\inj$ assembles all parts together
+$\inj$ assembles all parts
 to form a new value.
 For instance, the last clause is an
 injection into a sequence value $v_{i+1}$
 whose second child
-value is a star, and the shape of the
+value is a star and the shape of the
 regular expression $r_i$ before injection
 is a star.
 We therefore know
 the derivative
 starts on a star and ends as a sequence:
 \[
 	\vdash v: r\backslash c.
 \]
 Finally,
 $\inj \; r \;c \; v$ is prepended
-to the previous list of iterations, and then
+to the previous list of iterations and then
 wrapped under the $\Stars$
 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
 Recall that lemma
 \ref{mePosix} tells us that
 					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
 					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
 				\end{tabular}
 			\end{center}
 			We know that there exists a unique pair of
-			$s_a$ and $s_b$ satisfaying
+			$s_a$ and $s_b$ satisfying
 				$(a \backslash c, s_a) \rightarrow v_a$,
 				$(b , s_b) \rightarrow v_b$, and
 				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in
 				L \; (a\backslash c) \land
 				s_4 \in L \; b$.
 				\begin{tabular}{lcl}
 					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
 					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
 				\end{tabular}
 			\end{center}
-			With a similar reasoning,
+			With similar reasoning,
 			\[
 				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
 			\]
 			again holds.
 			We know that $a$ must be nullable,
 			allowing us to call $\mkeps$ and get
 			\[
 				(a, []) \rightarrow \mkeps \; a.
 			\]
-			Also by inductive hypothesis
+			Also, by inductive hypothesis
 			\[
 				(b, c::s) \rightarrow \inj\; b \; c \; v_b
 			\]
 			holds.
 			In addition, as
 			\[
 				\nexists s_3 \; s_4. \; s_3 \neq [] \land
 				s_3 @s_4 = c::s  \land s_3 \in L \; a
 				\land s_4 \in L \; b.
 			\]
-			(Which basically says there cannot be a longer
+			(Which says there cannot be a longer
 			initial split for $s$ other than the empty string.)
 			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
 			as the POSIX value for $a\cdot b$.
 	\end{itemize}
 	The star case can be proven similarly.
 		\end{tabular}
 	\end{center}
 \end{theorem}
 \begin{proof}
 By induction on $s$. $r$ generalising over an arbitrary regular expression.
-The $[]$ case is proven by  an application of lemma \ref{mePosix}, and the inductive case
+The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
 by lemma \ref{injPosix}.
 \end{proof}
 \noindent
 As we did earlier in this chapter with the matcher, one can
 introduce simplification on the regular expression in each derivative step.
-However, now one needs to do a backward phase and make sure
+However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
-the values align with the regular expression.
+and ensure that
+the values align with the regular expression at each step.
 Therefore one has to
 be careful not to break the correctness, as the injection
 function heavily relies on the structure of
 the regular expressions and values being aligned.
 This can be achieved by recording some extra rectification functions
-during the derivatives step, and applying these rectifications in
+during the derivatives step and applying these rectifications in
 each run during the injection phase.
 With extra care
 one can show that POSIXness will not be affected
 by the simplifications listed here \cite{AusafDyckhoffUrban2016}.
 \begin{center}
 		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
 	\end{tabular}
 \end{center}
-However, with the simple-minded simplification rules allowed
+However, one can still end up
-in an injection-based lexer, one can still end up
+with exploding derivatives,
-with exploding derivatives.
+even with the simple-minded simplification rules allowed
-\section{A Case Requring More Aggressive Simplifications}
+in an injection-based lexer.
+\section{A Case Requiring More Aggressive Simplifications}
 For example, when starting with the regular
 expression $(a^* \cdot a^*)^*$ and building just over
 a dozen successive derivatives
 w.r.t.~the character $a$, one obtains a derivative regular expression
 with millions of nodes (when viewed as a tree)
-even with the simple-minded simplification.
+even with the mentioned simplifications.
 \begin{figure}[H]
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
 \]
 as an example.
 This is a highly ambiguous regular expression, with
 many ways to split up the string into multiple segments for
-different star iteratioins,
+different star iterations,
 and for each segment
 multiple ways of splitting between
 the two $a^*$ sub-expressions.
 When $n$ is equal to $1$, there are two lexical values for
 the match:
 distinct lexical values that cannot be eliminated by
 the simple-minded simplification of $\derssimp$.
 A lexer without a good enough strategy to
 deduplicate will naturally
 have an exponential runtime on highly
-ambiguous regular expressions, because there
+ambiguous regular expressions because there
 are exponentially many matches.
 For this particular example, it seems
 that the number of distinct matches growth
 speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
 On the other hand, the
 $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is
 \[
 	\Stars\,
-	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]
+	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
 \]
-and at any moment the  subterms in a regular expression
+At any moment, the  subterms in a regular expression
-that will result in a POSIX value is only
+that will potentially result in a POSIX value is only
 a minority among the many other terms,
-and one can remove ones that are absolutely not possible to
+and one can remove ones that are not possible to
 be POSIX.
 In the above example,
 \begin{equation}\label{eqn:growth2}
 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
 	\begin{tabular}{lr}
 		$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$  & $(\text{term 1})$\\
 		$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]  $ &  $(\text{term 2})$
 	\end{tabular}
 \end{center}
-Other terms with an underlying value such as
+Other terms with an underlying value, such as
 \[
 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
 \]
-is simply too hopeless to contribute a POSIX lexical value,
+is too hopeless to contribute a POSIX lexical value,
 and is therefore thrown away.
 Ausaf et al. \cite{AusafDyckhoffUrban2016}
 have come up with some simplification steps, however those steps
-are not yet sufficiently strong so that they achieve the above effects.
+are not yet sufficiently strong, so they achieve the above effects.
-And even with these relatively mild simplifications the proof
+And even with these relatively mild simplifications, the proof
-is already quite a bit complicated than the theorem \ref{lexerCorrectness}.
+is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
-One would prove something like:
+One would prove something like this:
 \[
 	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
 	\textit{then}\;\; (r, c::s) \rightarrow
-	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v)
+	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
 \]
 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
 not only has to return a simplified regular expression,
 but also what specific simplifications
 has been done as a function on values

changeset 637	e3752aac8ec2
parent 628	7af4e2420a8c
child 638	dd9dde2d902b