--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri May 26 08:10:17 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri May 26 23:42:15 2023 +0100
@@ -204,40 +204,85 @@
%TODO: look up snort rules to use here--give readers idea of what regexes look like
+Regular expressions, since their inception in the 1940s,
+have been subject to extensive study and implementation.
+Their primary application lies in lexing,
+a process whereby an unstructured input string is disassembled into a tree of tokens
+with their guidance.
+This is often used to match strings that comprises of numerous fields,
+where certain fields may recur or be omitted.
+For example, the regular expression for recognising email addresses is
+\begin{center}
+\verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
+\end{center}
+Using this, regular expression matchers and lexers are able to extract
+for example the domain names by the use of \verb|[a-zA-Z0-9.-]+|.
+Consequently, they are an indispensible component in text processing tools
+of software systems such as compilers, IDEs, and firewalls.
+The study of regular expressions is ongoing due to an
+issue known as catastrophic backtracking.
+This phenomenon refers to scenarios in which the regular expression
+matching or lexing engine exhibits a disproportionately long
+runtime despite the simplicity of the input and expression.
+
+The origin of catastrophic backtracking is rooted in the
+ambiguities of lexing.
+In the process of matching a multi-character string with
+a regular expression that encompasses several sub-expressions,
+different positions can be designated to mark
+the boundaries of corresponding substrings of the sub-expressions.
+For instance, in matching the string $aaa$ with the
+regular expression $(a+aa)^*$, the divide between
+the initial match and the subsequent iteration could either be
+set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities.
+
+Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
+
+Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
-Regular expressions
-have been extensively studied and
-implemented since their invention in 1940s.
-It is primarily used in lexing, where an unstructured input string is broken
-down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression.
-Being able to delimit different subsections of a string and potentially
-extract information from inputs, regular expression
-matchers and lexers serve as an indispensible component in modern software systems' text processing tasks
-such as compilers, IDEs, and firewalls.
-Research on them is far from over due to the well-known issue called catastrophic-backtracking.
-This stems from the ambiguities of lexing: when matching a multiple-character string with a regular
-exression that includes serveral sub-expressions, there might be different positions to set
-the border between sub-expressions' corresponding sub-strings.
-For example, matching the string $aaa$ against the regular expression
-$(a+aa)^*$, the border between the initial match and the second iteration
-could be between the first and second character ($a | aa$)
-or between the second and third character ($aa | a$).
-As the size of the input string and the structural complexity
-of the regular expression increase,
-the number of different combinations of setting delimiters can grow exponentially, and
-algorithms that explore these possibilities unwisely will also see an exponential complexity.
-
-Catastrophic backtracking allow a class of computationally inexpensive attacks called
-Regular expression Denial of Service attacks (ReDoS).
-These attacks, be it deliberate or not, have caused real-world systems to go down (see more
-details of this in the practical overview section in chapter \ref{Introduction}).
-There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
-The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
-There have been prose definitions like the following
-by Kuklewicz \cite{KuklewiczHaskell}:
-described the POSIX rule as (section 1, last paragraph):
+%Regular expressions
+%have been extensively studied and
+%implemented since their invention in 1940s.
+%It is primarily used in lexing, where an unstructured input string is broken
+%down into a tree of tokens.
+%That tree's construction is guided by the shape of the regular expression.
+%This is particularly useful in expressing the shape of a string
+%with many fields, where certain fields might repeat or be omitted.
+%Regular expression matchers and Lexers allow us to
+%identify and delimit different subsections of a string and potentially
+%extract information from inputs, making them
+%an indispensible component in modern software systems' text processing tasks
+%such as compilers, IDEs, and firewalls.
+%Research on them is far from over due to the well-known issue called catastrophic-backtracking,
+%which means the regular expression matching or lexing engine takes an unproportional time to run
+%despite the input and the expression being relatively simple.
+%
+%Catastrophic backtracking stems from the ambiguities of lexing:
+%when matching a multiple-character string with a regular
+%exression that includes serveral sub-expressions, there might be different positions to set
+%the border between sub-expressions' corresponding sub-strings.
+%For example, matching the string $aaa$ against the regular expression
+%$(a+aa)^*$, the border between the initial match and the second iteration
+%could be between the first and second character ($a | aa$)
+%or between the second and third character ($aa | a$).
+%As the size of the input string and the structural complexity
+%of the regular expression increase,
+%the number of different combinations of delimiters can grow exponentially, and
+%algorithms that explore these possibilities unwisely will also see an exponential complexity.
+%
+%Catastrophic backtracking allow a class of computationally inexpensive attacks called
+%Regular expression Denial of Service attacks (ReDoS), in which the hacker
+%simply sends out a small attack string to a server,
+%triggering high-complexity behaviours in its regular expression engine.
+%These attacks, be it deliberate or not, have caused real-world systems to go down (see more
+%details of this in the practical overview section in chapter \ref{Introduction}).
+%There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
+%The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
+%There have been prose definitions like the following
+%by Kuklewicz \cite{KuklewiczHaskell}:
+%described the POSIX rule as (section 1, last paragraph):
\begin{quote}
\begin{itemize}
\item
@@ -258,19 +303,25 @@
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
\end{itemize}
\end{quote}
-But the author noted that lexers are rarely correct according to this standard.
+However, the author noted that lexers are rarely correct according to this standard.
+Being able to formally define and capture the idea of POSIX rules and matching/lexing algorithms and prove
+the correctness of such algorithms against the POSIX semantics definitions
+would be valuable.
-Formal proofs, such as the one written in Isabelle/HOL, is a powerful means
+Formal proofs are machine checked programs
+such as the ones written in Isabelle/HOL, is a powerful means
for computer scientists to be certain about correctness of their algorithms and correctness properties.
-This is done by automatically checking the end goal is derived solely from a list of axioms, definitions
+This is done by automatically checking that the end goal is derived solely from a list of axioms, definitions
and proof rules that are known to be correct.
The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
methods as their implementation and complexity analysis tend to be error-prone.
+There have been many interesting steps taken in the direction of formal proofs and regular expressions
+lexing and matching.
-
+(ends)
%Regular expressions are widely used in computer science:
%be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;