ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 653 bc5571c38d1f
parent 652 a4d692a9a289
child 654 2ad20ba5b178
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Tue Jun 27 16:26:48 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Thu Jun 29 04:17:48 2023 +0100
@@ -107,6 +107,7 @@
 \def\Stars{\textit{Stars}}
 \def\Char{\textit{Char}}
 \def\Seq{\textit{Seq}}
+\def\Alt{\textit{Alt}}
 \def\Der{\textit{Der}}
 \def\Ders{\textit{Ders}}
 \def\nullable{\mathit{nullable}}
@@ -206,19 +207,30 @@
 
 Regular expressions, since their inception in the 1940s, 
 have been subject to extensive study and implementation. 
-Their primary application lies in lexing, 
-a process whereby an unstructured input string is disassembled into a tree of tokens
-with their guidance.
-This is often used to match strings that comprises of numerous fields, 
-where certain fields may recur or be omitted. 
-For example, the regular expression for recognising email addresses is
+Their primary application lies in text processing--finding
+matches and identifying patterns in a string.
+%It is often used to match strings that comprises of numerous fields, 
+%where certain fields may recur or be omitted. 
+For example, a simple regular expression that tries 
+to recognise email addresses is
+\marginpar{rephrased from "the regex for recognising" to "a simple regex that tries to match email"}
 \begin{center}
-\verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
+$[a-z0-9.\_]^\backslash+@[a-z0-9.-]^\backslash+\.\{a-z\}\{2,6\}$
+%$[a-z0-9._]^+@[a-z0-9.-]^+\.[a-z]{2,6}$.
 \end{center}
-Using this, regular expression matchers and lexers are able to extract 
-the domain names by the use of \verb|[a-zA-Z0-9.-]+|. 
-Consequently, they are an indispensible components in text processing tools 
-of software systems such as compilers, IDEs, and firewalls.
+\marginpar{Simplified example, but the distinction between . and escaped . is correct
+and therefore left unchanged.}
+
+%Using this, regular expression matchers and lexers are able to extract 
+%the domain names by the use of \verb|[a-zA-Z0-9.-]+|. 
+\marginpar{Rewrote explanation for the expression.}
+The bracketed sub-expressions are used to extract specific parts of an email address.
+The local part is recognised by the expression enclosed in 
+the first pair of brackets: $[a-z0-9._]$, and after the ``@'' sign
+is the part that recognises the domain, where $[a-z]{2, 6}$ specifically
+matches the top-level domain.
+%Consequently, they are an indispensible components in text processing tools 
+%of software systems such as compilers, IDEs, and firewalls.
 
 The study of regular expressions is ongoing due to an
 issue known as catastrophic backtracking. 
@@ -226,8 +238,8 @@
 matching or lexing engine exhibits a disproportionately long 
 runtime despite the simplicity of the input and expression.
 
-The origin of catastrophic backtracking is rooted in the 
-ambiguities of lexing. 
+One cause of catastrophic backtracking lies within the 
+ambiguities of lexing.\marginpar{rephrased "the origin of catastrophic ...} 
 In the process of matching a multi-character string with 
 a regular expression that encompasses several sub-expressions, 
 different positions can be designated to mark 
@@ -239,7 +251,8 @@
 
 Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
 
-Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
+Various disambiguation strategies are 
+employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
 
 
 %Regular expressions 
@@ -283,6 +296,7 @@
 %There have been prose definitions like the following
 %by Kuklewicz \cite{KuklewiczHaskell}: 
 %described the POSIX rule as (section 1, last paragraph):
+\marginpar{\em Deleted some peripheral specifications.}
 \begin{quote}
 	\begin{itemize}
 		\item
@@ -291,16 +305,17 @@
 \item
 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
 \item
-REs have right associative concatenation which can be changed with parenthesis\\
-\item
-parenthesized subexpressions return the match from their last usage\\
-\item
-text of component subexpressions must be contained in the text of the 
-higher-level subexpressions\\
-\item
-if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
-\item
-if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
+	$\ldots$
+%REs have right associative concatenation which can be changed with parenthesis\\
+%\item
+%parenthesized subexpressions return the match from their last usage\\
+%\item
+%text of component subexpressions must be contained in the text of the 
+%higher-level subexpressions\\
+%\item
+%if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+%\item
+%if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
 \end{itemize}
 \end{quote}
 However, the author noted that various lexers that claim to be POSIX