lexing: changeset 531:c334f0b3ef52

--- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Mon May 30 20:36:15 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Tue May 31 17:36:45 2022 +0100
@@ -353,7 +353,7 @@
 
 Brzozowski was quick in finding that during this process a lot useless
 $\ONE$s and $\ZERO$s are generated and therefore not optimal.
-He also introduced some "similarity rules" such
+He also introduced some "similarity rules", such
 as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
 different but language-equivalent sub-regexes to further decrease the size
 of the intermediate regexes. 
@@ -368,7 +368,7 @@
 If we want the size of derivatives in the algorithm to
 stay even lower, we would need more aggressive simplifications.
 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
-deleting duplicates whenever possible. For example, the parentheses in
+delete duplicates whenever possible. For example, the parentheses in
 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
@@ -376,9 +376,9 @@
  a very tight size bound, possibly as low
   as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 
 
-Building derivatives and then simplify them.
-So far so good. But what if we want to 
-do lexing instead of just  getting a YES/NO answer?
+Building derivatives and then simplifying them.
+So far, so good. But what if we want to 
+do lexing instead of just getting a YES/NO answer?
 This requires us to go back again to the world 
 without simplification first for a moment.
 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
@@ -422,7 +422,7 @@
 
 \noindent
 
-Building on top of Sulzmann and Lu's attempt to formalize the 
+Building on top of Sulzmann and Lu's attempt to formalise the 
 notion of POSIX lexing rules \parencite{Sulzmann2014}, 
 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
 POSIX matching as a ternary relation recursively defined in a
@@ -440,8 +440,8 @@
 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
 $\ldots$, and vice versa.
-Even for the regular expression matching a certain string, there could 
-still be more than one value corresponding to it.
+Even for the regular expression matching a particular string, there could 
+be more than one value corresponding to it.
 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
 If we do not allow any empty iterations in its lexical values,
@@ -463,12 +463,12 @@
 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
 
-A lexer aimed at keeping all the possible values will naturally 
+A lexer to keep all the possible values will naturally 
 have an exponential runtime on ambiguous regular expressions.
 Somehow one has to decide which lexical value to keep and
 output in a lexing algorithm.
 In practice, we are usually 
-interested about POSIX values, which by intuition always
+interested in POSIX values, which by intuition always
 \begin{itemize}
 \item
 match the leftmost regular expression when multiple options of matching
@@ -513,7 +513,7 @@
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
-is generated in case the regular expression matches  the string.
+is generated if the regular expression matches the string.
 How can we construct a value out of regular expressions and character
 sequences only?
 Two functions are involved: $\inj$ and $\mkeps$.
@@ -545,9 +545,9 @@
 
 
 \noindent 
-We favour the left to match an empty string in case there is a choice.
+We favour the left to match an empty string if there is a choice.
 When there is a star for us to match the empty string,
-we simply give the $\Stars$ constructor an empty list, meaning
+we give the $\Stars$ constructor an empty list, meaning
 no iterations are taken.
 
 
@@ -555,7 +555,7 @@
 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
 After injecting back $n$ characters, we get the lexical value for how $r_0$
-matches $s$. The POSIX value is maintained throught out the process.
+matches $s$. The POSIX value is maintained throughout the process.
 For this Sulzmann and Lu defined a function that reverses
 the ``chopping off'' of characters during the derivative phase. The
 corresponding function is called \emph{injection}, written
@@ -593,7 +593,7 @@
 
 \noindent This definition is by recursion on the ``shape'' of regular
 expressions and values. 
-The clauses basically do one thing--identifying the ``holes'' on 
+The clauses do one thing--identifying the ``hole'' on a
 value to inject the character back into.
 For instance, in the last clause for injecting back to a value
 that would turn into a new star value that corresponds to a star,
@@ -602,7 +602,7 @@
 with the first character being chopped off--an iteration of the star
 that had just been unfolded. This value is followed by the already
 matched star iterations we collected before. So we inject the character 
-back to the first value and form a new value with this new iteration
+back to the first value and form a new value with this latest iteration
 being added to the previous list of iterations, all under the $\Stars$
 top level.
 
@@ -635,10 +635,10 @@
 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
 the derivative $r_n$. We test whether this derivative is
 $\textit{nullable}$ or not. If not, we know the string does not match
-$r$ and no value needs to be generated. If yes, we start building the
+$r$, and no value needs to be generated. If yes, we start building the
 values incrementally by \emph{injecting} back the characters into the
 earlier values $v_n, \ldots, v_0$. This is the second phase of the
-algorithm from the right to left. For the first value $v_n$, we call the
+algorithm from right to left. For the first value $v_n$, we call the
 function $\textit{mkeps}$, which builds a POSIX lexical value
 for how the empty string has been matched by the (nullable) regular
 expression $r_n$. This function is defined as
@@ -647,17 +647,17 @@
 
 We have mentioned before that derivatives without simplification 
 can get clumsy, and this is true for values as well--they reflect
-the regular expressions size by definition.
+the size of the regular expression by definition.
 
-One can introduce simplification on the regex and values, but have to
-be careful in not breaking the correctness as the injection 
+One can introduce simplification on the regex and values but have to
+be careful not to break the correctness, as the injection 
 function heavily relies on the structure of the regexes and values
-being correct and match each other.
+being correct and matching each other.
 It can be achieved by recording some extra rectification functions
 during the derivatives step, and applying these rectifications in 
 each run during the injection phase.
 And we can prove that the POSIX value of how
-regular expressions match strings will not be affected---although is much harder
+regular expressions match strings will not be affected---although it is much harder
 to establish. 
 Some initial results in this regard have been
 obtained in \cite{AusafDyckhoffUrban2016}. 
@@ -665,20 +665,20 @@
 
 
 %Brzozowski, after giving the derivatives and simplification,
-%did not explore lexing with simplification or he may well be 
-%stuck on an efficient simplificaiton with a proof.
-%He went on to explore the use of derivatives together with 
-%automaton, and did not try lexing using derivatives.
+%did not explore lexing with simplification, or he may well be 
+%stuck on an efficient simplification with proof.
+%He went on to examine the use of derivatives together with 
+%automaton, and did not try lexing using products.
 
-We want to get rid of complex and fragile rectification of values.
+We want to get rid of the complex and fragile rectification of values.
 Can we not create those intermediate values $v_1,\ldots v_n$,
 and get the lexing information that should be already there while
-doing derivatives in one pass, without a second phase of injection?
+doing derivatives in one pass, without a second injection phase?
 In the meantime, can we make sure that simplifications
 are easily handled without breaking the correctness of the algorithm?
 
 Sulzmann and Lu solved this problem by
-introducing additional informtaion to the 
+introducing additional information to the 
 regular expressions called \emph{bitcodes}.
author	Chengsong
	Tue, 31 May 2022 17:36:45 +0100
changeset 531	c334f0b3ef52
parent 530	823d9b19d21c
child 532	cc54ce075db5