lexing: comparison ChengsongTanPhdThesis/Chapters/Chapter2.tex

equal deleted inserted replaced

-:823d9b19d21c
+:c334f0b3ef52
 consequently the bigger the size of the derivative the slower the
 algorithm.
 Brzozowski was quick in finding that during this process a lot useless
 $\ONE$s and $\ZERO$s are generated and therefore not optimal.
-He also introduced some "similarity rules" such
+He also introduced some "similarity rules", such
 as $P+(Q+R) = (P+Q)+R$ to merge syntactically
 different but language-equivalent sub-regexes to further decrease the size
 of the intermediate regexes.
 More simplifications are possible, such as deleting duplicates
 a matcher with simpler regular expressions.
 If we want the size of derivatives in the algorithm to
 stay even lower, we would need more aggressive simplifications.
 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
-deleting duplicates whenever possible. For example, the parentheses in
+delete duplicates whenever possible. For example, the parentheses in
 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
 $a^*+a+\ONE$.  These more aggressive simplification rules are for
 a very tight size bound, possibly as low
 as that of the \emph{partial derivatives}\parencite{Antimirov1995}.
-Building derivatives and then simplify them.
+Building derivatives and then simplifying them.
-So far so good. But what if we want to
+So far, so good. But what if we want to
-do lexing instead of just  getting a YES/NO answer?
+do lexing instead of just getting a YES/NO answer?
 This requires us to go back again to the world
 without simplification first for a moment.
 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
 elegant(arguably as beautiful as the original
 derivatives definition) solution for this.
 	\end{tabular}
 \end{center}
 \noindent
-Building on top of Sulzmann and Lu's attempt to formalize the
+Building on top of Sulzmann and Lu's attempt to formalise the
 notion of POSIX lexing rules \parencite{Sulzmann2014},
 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
 POSIX matching as a ternary relation recursively defined in a
 natural deduction style.
 With the formally-specified rules for what a POSIX matching is,
 One regular expression can have multiple lexical values. For example
 for the regular expression $(a+b)^*$, it has a infinite list of
 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
 $\ldots$, and vice versa.
-Even for the regular expression matching a certain string, there could
+Even for the regular expression matching a particular string, there could
-still be more than one value corresponding to it.
+be more than one value corresponding to it.
 Take the example where $r= (a^*\cdot a^*)^*$ and the string
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
 If we do not allow any empty iterations in its lexical values,
 there will be $n - 1$ "splitting points" on $s$ we can choose to
 split or not so that each sub-string
 between the two $a^*$s.
 It is not surprising there are exponentially many lexical values
 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-A lexer aimed at keeping all the possible values will naturally
+A lexer to keep all the possible values will naturally
 have an exponential runtime on ambiguous regular expressions.
 Somehow one has to decide which lexical value to keep and
 output in a lexing algorithm.
 In practice, we are usually
-interested about POSIX values, which by intuition always
+interested in POSIX values, which by intuition always
 \begin{itemize}
 \item
 match the leftmost regular expression when multiple options of matching
 are available
 \item
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value
-is generated in case the regular expression matches  the string.
+is generated if the regular expression matches the string.
 How can we construct a value out of regular expressions and character
 sequences only?
 Two functions are involved: $\inj$ and $\mkeps$.
 The function $\mkeps$ constructs a value from the last
 one of all the successive derivatives:
 		\end{tabular}
 	\end{center}
 \noindent
-We favour the left to match an empty string in case there is a choice.
+We favour the left to match an empty string if there is a choice.
 When there is a star for us to match the empty string,
-we simply give the $\Stars$ constructor an empty list, meaning
+we give the $\Stars$ constructor an empty list, meaning
 no iterations are taken.
 After the $\mkeps$-call, we inject back the characters one by one in order to build
 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
 After injecting back $n$ characters, we get the lexical value for how $r_0$
-matches $s$. The POSIX value is maintained throught out the process.
+matches $s$. The POSIX value is maintained throughout the process.
 For this Sulzmann and Lu defined a function that reverses
 the ``chopping off'' of characters during the derivative phase. The
 corresponding function is called \emph{injection}, written
 $\textit{inj}$; it takes three arguments: the first one is a regular
 expression ${r_{i-1}}$, before the character is chopped off, the second
 \end{tabular}
 \end{center}
 \noindent This definition is by recursion on the ``shape'' of regular
 expressions and values.
-The clauses basically do one thing--identifying the ``holes'' on
+The clauses do one thing--identifying the ``hole'' on a
 value to inject the character back into.
 For instance, in the last clause for injecting back to a value
 that would turn into a new star value that corresponds to a star,
 we know it must be a sequence value. And we know that the first
 value of that sequence corresponds to the child regex of the star
 with the first character being chopped off--an iteration of the star
 that had just been unfolded. This value is followed by the already
 matched star iterations we collected before. So we inject the character
-back to the first value and form a new value with this new iteration
+back to the first value and form a new value with this latest iteration
 being added to the previous list of iterations, all under the $\Stars$
 top level.
 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
 we have a lexer with the following recursive definition:
 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
 the derivative $r_n$. We test whether this derivative is
 $\textit{nullable}$ or not. If not, we know the string does not match
-$r$ and no value needs to be generated. If yes, we start building the
+$r$, and no value needs to be generated. If yes, we start building the
 values incrementally by \emph{injecting} back the characters into the
 earlier values $v_n, \ldots, v_0$. This is the second phase of the
-algorithm from the right to left. For the first value $v_n$, we call the
+algorithm from right to left. For the first value $v_n$, we call the
 function $\textit{mkeps}$, which builds a POSIX lexical value
 for how the empty string has been matched by the (nullable) regular
 expression $r_n$. This function is defined as
 We have mentioned before that derivatives without simplification
 can get clumsy, and this is true for values as well--they reflect
-the regular expressions size by definition.
+the size of the regular expression by definition.
-One can introduce simplification on the regex and values, but have to
+One can introduce simplification on the regex and values but have to
-be careful in not breaking the correctness as the injection
+be careful not to break the correctness, as the injection
 function heavily relies on the structure of the regexes and values
-being correct and match each other.
+being correct and matching each other.
 It can be achieved by recording some extra rectification functions
 during the derivatives step, and applying these rectifications in
 each run during the injection phase.
 And we can prove that the POSIX value of how
-regular expressions match strings will not be affected---although is much harder
+regular expressions match strings will not be affected---although it is much harder
 to establish.
 Some initial results in this regard have been
 obtained in \cite{AusafDyckhoffUrban2016}.
 %Brzozowski, after giving the derivatives and simplification,
-%did not explore lexing with simplification or he may well be
+%did not explore lexing with simplification, or he may well be
-%stuck on an efficient simplificaiton with a proof.
+%stuck on an efficient simplification with proof.
-%He went on to explore the use of derivatives together with
+%He went on to examine the use of derivatives together with
-%automaton, and did not try lexing using derivatives.
+%automaton, and did not try lexing using products.
-We want to get rid of complex and fragile rectification of values.
+We want to get rid of the complex and fragile rectification of values.
 Can we not create those intermediate values $v_1,\ldots v_n$,
 and get the lexing information that should be already there while
-doing derivatives in one pass, without a second phase of injection?
+doing derivatives in one pass, without a second injection phase?
 In the meantime, can we make sure that simplifications
 are easily handled without breaking the correctness of the algorithm?
 Sulzmann and Lu solved this problem by
-introducing additional informtaion to the
+introducing additional information to the
 regular expressions called \emph{bitcodes}.

changeset 531	c334f0b3ef52
parent 530	823d9b19d21c