--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Tue Jun 27 16:26:48 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Jun 29 04:17:48 2023 +0100
@@ -107,6 +107,7 @@
\def\Stars{\textit{Stars}}
\def\Char{\textit{Char}}
\def\Seq{\textit{Seq}}
+\def\Alt{\textit{Alt}}
\def\Der{\textit{Der}}
\def\Ders{\textit{Ders}}
\def\nullable{\mathit{nullable}}
@@ -206,19 +207,30 @@
Regular expressions, since their inception in the 1940s,
have been subject to extensive study and implementation.
-Their primary application lies in lexing,
-a process whereby an unstructured input string is disassembled into a tree of tokens
-with their guidance.
-This is often used to match strings that comprises of numerous fields,
-where certain fields may recur or be omitted.
-For example, the regular expression for recognising email addresses is
+Their primary application lies in text processing--finding
+matches and identifying patterns in a string.
+%It is often used to match strings that comprises of numerous fields,
+%where certain fields may recur or be omitted.
+For example, a simple regular expression that tries
+to recognise email addresses is
+\marginpar{rephrased from "the regex for recognising" to "a simple regex that tries to match email"}
\begin{center}
-\verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
+$[a-z0-9.\_]^\backslash+@[a-z0-9.-]^\backslash+\.\{a-z\}\{2,6\}$
+%$[a-z0-9._]^+@[a-z0-9.-]^+\.[a-z]{2,6}$.
\end{center}
-Using this, regular expression matchers and lexers are able to extract
-the domain names by the use of \verb|[a-zA-Z0-9.-]+|.
-Consequently, they are an indispensible components in text processing tools
-of software systems such as compilers, IDEs, and firewalls.
+\marginpar{Simplified example, but the distinction between . and escaped . is correct
+and therefore left unchanged.}
+
+%Using this, regular expression matchers and lexers are able to extract
+%the domain names by the use of \verb|[a-zA-Z0-9.-]+|.
+\marginpar{Rewrote explanation for the expression.}
+The bracketed sub-expressions are used to extract specific parts of an email address.
+The local part is recognised by the expression enclosed in
+the first pair of brackets: $[a-z0-9._]$, and after the ``@'' sign
+is the part that recognises the domain, where $[a-z]{2, 6}$ specifically
+matches the top-level domain.
+%Consequently, they are an indispensible components in text processing tools
+%of software systems such as compilers, IDEs, and firewalls.
The study of regular expressions is ongoing due to an
issue known as catastrophic backtracking.
@@ -226,8 +238,8 @@
matching or lexing engine exhibits a disproportionately long
runtime despite the simplicity of the input and expression.
-The origin of catastrophic backtracking is rooted in the
-ambiguities of lexing.
+One cause of catastrophic backtracking lies within the
+ambiguities of lexing.\marginpar{rephrased "the origin of catastrophic ...}
In the process of matching a multi-character string with
a regular expression that encompasses several sub-expressions,
different positions can be designated to mark
@@ -239,7 +251,8 @@
Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
-Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
+Various disambiguation strategies are
+employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
%Regular expressions
@@ -283,6 +296,7 @@
%There have been prose definitions like the following
%by Kuklewicz \cite{KuklewiczHaskell}:
%described the POSIX rule as (section 1, last paragraph):
+\marginpar{\em Deleted some peripheral specifications.}
\begin{quote}
\begin{itemize}
\item
@@ -291,16 +305,17 @@
\item
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
\item
-REs have right associative concatenation which can be changed with parenthesis\\
-\item
-parenthesized subexpressions return the match from their last usage\\
-\item
-text of component subexpressions must be contained in the text of the
-higher-level subexpressions\\
-\item
-if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
-\item
-if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
+ $\ldots$
+%REs have right associative concatenation which can be changed with parenthesis\\
+%\item
+%parenthesized subexpressions return the match from their last usage\\
+%\item
+%text of component subexpressions must be contained in the text of the
+%higher-level subexpressions\\
+%\item
+%if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+%\item
+%if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
\end{itemize}
\end{quote}
However, the author noted that various lexers that claim to be POSIX