--- a/ChengsongTanPhdThesis/Chapters/Inj.tex Sat Dec 24 11:55:04 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Fri Dec 30 01:52:32 2022 +0000
@@ -13,7 +13,7 @@
This is essentially a description in ``English''
the functions and datatypes of our formalisation in Isabelle/HOL.
We also define what $\POSIX$ lexing means,
-followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
+followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
that produces the output conforming
to the $\POSIX$ standard\footnote{In what follows
we choose to use the Isabelle-style notation
@@ -109,7 +109,7 @@
to test membership of a string in
a set of strings.
For example, to test whether the string
-$bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
+$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
respect to the string $bar$:
\begin{center}
\begin{tabular}{lll}
@@ -122,8 +122,9 @@
\end{center}
\noindent
and in the end, test whether the set
-has the empty string.\footnote{We use the infix notation $A\backslash c$
- instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
+contains the empty string.\footnote{We use the infix notation $A\backslash c$
+ instead of $\Der \; c \; A$ for brevity, as it will always be
+ clear from the context that we are operating
on languages rather than regular expressions.}
In general, if we have a language $S$,
@@ -204,7 +205,7 @@
expression that matches only the empty string.\footnote{
Some authors
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
-but we prefer our notation.}
+but we prefer this notation.}
The sequence regular expression is written $r_1\cdot r_2$
and sometimes we omit the dot if it is clear which
regular expression is meant; the alternative
@@ -227,7 +228,7 @@
%Now with language derivatives of a language and regular expressions and
%their language interpretations in place, we are ready to define derivatives on regular expressions.
With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
-We do so by first introducing what properties it should satisfy.
+We do so by first introducing what properties they should satisfy.
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
%Recall, the language derivative acts on a set of strings
@@ -291,14 +292,13 @@
he calls the derivative of a regular expression $r$
with respect to a character $c$, written
$r \backslash c$. This infix operator
-takes an original regular expression $r$ as input
-and a character as a right operand and
-outputs a result, which is a new regular expression.
+takes regular expression $r$ as input
+and a character as a right operand.
The derivative operation on regular expression
is defined such that the language of the derivative result
coincides with the language of the original
regular expression being taken
-derivative with respect to the same character:
+derivative with respect to the same characters, namely
\begin{property}
\[
@@ -335,11 +335,11 @@
\vspace{ 3mm }
The derivatives on regular expression can again be
-generalised to a string.
+generalised to strings.
One could compute $r_{start} \backslash s$ and test membership of $s$
in $L \; r_{start}$ by checking
whether the empty string is in the language of
-$r_{end}$ ($r_{start}\backslash s$).\\
+$r_{end}$ (that is $r_{start}\backslash s$).\\
\vspace{2mm}
\Longstack{
@@ -378,18 +378,18 @@
}
-
+We have the property that
\begin{property}
$s \in L \; r_{start} \iff [] \in L \; r_{end}$
\end{property}
\noindent
Next, we give the recursive definition of derivative on
regular expressions so that it satisfies the properties above.
-The derivative function, written $r\backslash c$,
-takes a regular expression $r$ and character $c$, and
-returns a new regular expression representing
-the original regular expression's language $L \; r$
-being taken the language derivative with respect to $c$.
+%The derivative function, written $r\backslash c$,
+%takes a regular expression $r$ and character $c$, and
+%returns a new regular expression representing
+%the original regular expression's language $L \; r$
+%being taken the language derivative with respect to $c$.
\begin{table}
\begin{center}
\begin{tabular}{lcl}
@@ -418,7 +418,7 @@
\begin{tabular}{lcl}
$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ &
$\textit{if}\;\,([] \in L(r_1))\;
- \textit{then} \; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
+ \textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
\end{tabular}
\end{center}
@@ -483,7 +483,7 @@
is always nullable because it naturally
contains the empty string.
\noindent
-We have the following correspondence between
+We have the following two correspondences between
derivatives on regular expressions and
derivatives on a set of strings:
\begin{lemma}\label{derDer}
@@ -499,8 +499,8 @@
By induction on $r$.
\end{proof}
\noindent
-which is the main property of derivatives
-that enables us to reason about the correctness of
+which are the main properties of derivatives
+that enables us later to reason about the correctness of
derivative-based matching.
We can generalise the derivative operation shown above for single characters
to strings as follows:
@@ -570,14 +570,14 @@
\ref{NaiveMatcher}.
Note that both axes are in logarithmic scale.
Around two dozen characters
-would already ``explode'' the matcher with the regular expression
+this algorithm already ``explodes'' with the regular expression
$(a^*)^*b$.
To improve this situation, we need to introduce simplification
rules for the intermediate results,
-such as $r + r \rightarrow r$,
+such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
and make sure those rules do not change the
language of the regular expression.
-One simpled-minded simplification function
+One simple-minded simplification function
that achieves these requirements
is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
\begin{center}
@@ -733,7 +733,7 @@
one of the values to be generated. $\POSIX$ is one of the
disambiguation strategies that is widely adopted.
-Ausaf et al.\parencite{AusafDyckhoffUrban2016}
+Ausaf et al. \cite{AusafDyckhoffUrban2016}
formalised the property
as a ternary relation.
The $\POSIX$ value $v$ for a regular expression
@@ -843,7 +843,7 @@
then a match with a keyword (if)
followed by
an identifier (foo) would be incorrect.
-POSIX lexing would generate what we want.
+POSIX lexing generates what is included by lexing.
\noindent
We know that a POSIX
@@ -915,7 +915,7 @@
after the initial phase of successive derivatives.
This second phase generates a POSIX value
if the regular expression matches the string.
-Two functions are involved: $\inj$ and $\mkeps$.
+The algorithm uses two functions called $\inj$ and $\mkeps$.
The function $\mkeps$ constructs a POSIX value from the last
derivative $r_n$:
\begin{ceqn}
@@ -958,7 +958,7 @@
The result of $\mkeps$ on a $\nullable$ $r$
is a POSIX value for $r$ and the empty string:
\begin{lemma}\label{mePosix}
-$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
+$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
\end{lemma}
\begin{proof}
By induction on $r$.
@@ -1322,7 +1322,7 @@
\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
]
\]
-And $\derssimp \; aa \; (a^*a^*)^*$ would be
+And $\derssimp \; aa \; (a^*a^*)^*$ is
\[
((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* +
(a^*a^* + a^*)\cdot(a^*a^*)^*.
@@ -1352,7 +1352,7 @@
At any moment, the subterms in a regular expression
that will potentially result in a POSIX value is only
a minority among the many other terms,
-and one can remove ones that are not possible to
+and one can remove the ones that are not possible to
be POSIX.
In the above example,
\begin{equation}\label{eqn:growth2}
@@ -1379,15 +1379,15 @@
\[
\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
\]
-is too hopeless to contribute a POSIX lexical value,
-and is therefore thrown away.
+do not to contribute a POSIX lexical value,
+and therefore can be thrown away.
Ausaf et al. \cite{AusafDyckhoffUrban2016}
have come up with some simplification steps, however those steps
-are not yet sufficiently strong, so they achieve the above effects.
+are not yet sufficiently strong, to achieve the above effects.
And even with these relatively mild simplifications, the proof
is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
-One would prove something like this:
+One would need to prove something like this:
\[
\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\;
\textit{then}\;\; (r, c::s) \rightarrow
@@ -1396,7 +1396,7 @@
instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
not only has to return a simplified regular expression,
but also what specific simplifications
-has been done as a function on values
+have been done as a function on values
showing how one can transform the value
underlying the simplified regular expression
to the unsimplified one.