--- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex Mon May 30 20:36:15 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex Tue May 31 17:36:45 2022 +0100
@@ -353,7 +353,7 @@
Brzozowski was quick in finding that during this process a lot useless
$\ONE$s and $\ZERO$s are generated and therefore not optimal.
-He also introduced some "similarity rules" such
+He also introduced some "similarity rules", such
as $P+(Q+R) = (P+Q)+R$ to merge syntactically
different but language-equivalent sub-regexes to further decrease the size
of the intermediate regexes.
@@ -368,7 +368,7 @@
If we want the size of derivatives in the algorithm to
stay even lower, we would need more aggressive simplifications.
Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
-deleting duplicates whenever possible. For example, the parentheses in
+delete duplicates whenever possible. For example, the parentheses in
$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
@@ -376,9 +376,9 @@
a very tight size bound, possibly as low
as that of the \emph{partial derivatives}\parencite{Antimirov1995}.
-Building derivatives and then simplify them.
-So far so good. But what if we want to
-do lexing instead of just getting a YES/NO answer?
+Building derivatives and then simplifying them.
+So far, so good. But what if we want to
+do lexing instead of just getting a YES/NO answer?
This requires us to go back again to the world
without simplification first for a moment.
Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
@@ -422,7 +422,7 @@
\noindent
-Building on top of Sulzmann and Lu's attempt to formalize the
+Building on top of Sulzmann and Lu's attempt to formalise the
notion of POSIX lexing rules \parencite{Sulzmann2014},
Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
POSIX matching as a ternary relation recursively defined in a
@@ -440,8 +440,8 @@
values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
$\ldots$, and vice versa.
-Even for the regular expression matching a certain string, there could
-still be more than one value corresponding to it.
+Even for the regular expression matching a particular string, there could
+be more than one value corresponding to it.
Take the example where $r= (a^*\cdot a^*)^*$ and the string
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
If we do not allow any empty iterations in its lexical values,
@@ -463,12 +463,12 @@
that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-A lexer aimed at keeping all the possible values will naturally
+A lexer to keep all the possible values will naturally
have an exponential runtime on ambiguous regular expressions.
Somehow one has to decide which lexical value to keep and
output in a lexing algorithm.
In practice, we are usually
-interested about POSIX values, which by intuition always
+interested in POSIX values, which by intuition always
\begin{itemize}
\item
match the leftmost regular expression when multiple options of matching
@@ -513,7 +513,7 @@
The contribution of Sulzmann and Lu is an extension of Brzozowski's
algorithm by a second phase (the first phase being building successive
derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value
-is generated in case the regular expression matches the string.
+is generated if the regular expression matches the string.
How can we construct a value out of regular expressions and character
sequences only?
Two functions are involved: $\inj$ and $\mkeps$.
@@ -545,9 +545,9 @@
\noindent
-We favour the left to match an empty string in case there is a choice.
+We favour the left to match an empty string if there is a choice.
When there is a star for us to match the empty string,
-we simply give the $\Stars$ constructor an empty list, meaning
+we give the $\Stars$ constructor an empty list, meaning
no iterations are taken.
@@ -555,7 +555,7 @@
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
After injecting back $n$ characters, we get the lexical value for how $r_0$
-matches $s$. The POSIX value is maintained throught out the process.
+matches $s$. The POSIX value is maintained throughout the process.
For this Sulzmann and Lu defined a function that reverses
the ``chopping off'' of characters during the derivative phase. The
corresponding function is called \emph{injection}, written
@@ -593,7 +593,7 @@
\noindent This definition is by recursion on the ``shape'' of regular
expressions and values.
-The clauses basically do one thing--identifying the ``holes'' on
+The clauses do one thing--identifying the ``hole'' on a
value to inject the character back into.
For instance, in the last clause for injecting back to a value
that would turn into a new star value that corresponds to a star,
@@ -602,7 +602,7 @@
with the first character being chopped off--an iteration of the star
that had just been unfolded. This value is followed by the already
matched star iterations we collected before. So we inject the character
-back to the first value and form a new value with this new iteration
+back to the first value and form a new value with this latest iteration
being added to the previous list of iterations, all under the $\Stars$
top level.
@@ -635,10 +635,10 @@
to the characters $c_0$, $c_1$ until we exhaust the string and obtain
the derivative $r_n$. We test whether this derivative is
$\textit{nullable}$ or not. If not, we know the string does not match
-$r$ and no value needs to be generated. If yes, we start building the
+$r$, and no value needs to be generated. If yes, we start building the
values incrementally by \emph{injecting} back the characters into the
earlier values $v_n, \ldots, v_0$. This is the second phase of the
-algorithm from the right to left. For the first value $v_n$, we call the
+algorithm from right to left. For the first value $v_n$, we call the
function $\textit{mkeps}$, which builds a POSIX lexical value
for how the empty string has been matched by the (nullable) regular
expression $r_n$. This function is defined as
@@ -647,17 +647,17 @@
We have mentioned before that derivatives without simplification
can get clumsy, and this is true for values as well--they reflect
-the regular expressions size by definition.
+the size of the regular expression by definition.
-One can introduce simplification on the regex and values, but have to
-be careful in not breaking the correctness as the injection
+One can introduce simplification on the regex and values but have to
+be careful not to break the correctness, as the injection
function heavily relies on the structure of the regexes and values
-being correct and match each other.
+being correct and matching each other.
It can be achieved by recording some extra rectification functions
during the derivatives step, and applying these rectifications in
each run during the injection phase.
And we can prove that the POSIX value of how
-regular expressions match strings will not be affected---although is much harder
+regular expressions match strings will not be affected---although it is much harder
to establish.
Some initial results in this regard have been
obtained in \cite{AusafDyckhoffUrban2016}.
@@ -665,20 +665,20 @@
%Brzozowski, after giving the derivatives and simplification,
-%did not explore lexing with simplification or he may well be
-%stuck on an efficient simplificaiton with a proof.
-%He went on to explore the use of derivatives together with
-%automaton, and did not try lexing using derivatives.
+%did not explore lexing with simplification, or he may well be
+%stuck on an efficient simplification with proof.
+%He went on to examine the use of derivatives together with
+%automaton, and did not try lexing using products.
-We want to get rid of complex and fragile rectification of values.
+We want to get rid of the complex and fragile rectification of values.
Can we not create those intermediate values $v_1,\ldots v_n$,
and get the lexing information that should be already there while
-doing derivatives in one pass, without a second phase of injection?
+doing derivatives in one pass, without a second injection phase?
In the meantime, can we make sure that simplifications
are easily handled without breaking the correctness of the algorithm?
Sulzmann and Lu solved this problem by
-introducing additional informtaion to the
+introducing additional information to the
regular expressions called \emph{bitcodes}.