diff -r a4d692a9a289 -r bc5571c38d1f ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Tue Jun 27 16:26:48 2023 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Jun 29 04:17:48 2023 +0100 @@ -107,6 +107,7 @@ \def\Stars{\textit{Stars}} \def\Char{\textit{Char}} \def\Seq{\textit{Seq}} +\def\Alt{\textit{Alt}} \def\Der{\textit{Der}} \def\Ders{\textit{Ders}} \def\nullable{\mathit{nullable}} @@ -206,19 +207,30 @@ Regular expressions, since their inception in the 1940s, have been subject to extensive study and implementation. -Their primary application lies in lexing, -a process whereby an unstructured input string is disassembled into a tree of tokens -with their guidance. -This is often used to match strings that comprises of numerous fields, -where certain fields may recur or be omitted. -For example, the regular expression for recognising email addresses is +Their primary application lies in text processing--finding +matches and identifying patterns in a string. +%It is often used to match strings that comprises of numerous fields, +%where certain fields may recur or be omitted. +For example, a simple regular expression that tries +to recognise email addresses is +\marginpar{rephrased from "the regex for recognising" to "a simple regex that tries to match email"} \begin{center} -\verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|. +$[a-z0-9.\_]^\backslash+@[a-z0-9.-]^\backslash+\.\{a-z\}\{2,6\}$ +%$[a-z0-9._]^+@[a-z0-9.-]^+\.[a-z]{2,6}$. \end{center} -Using this, regular expression matchers and lexers are able to extract -the domain names by the use of \verb|[a-zA-Z0-9.-]+|. -Consequently, they are an indispensible components in text processing tools -of software systems such as compilers, IDEs, and firewalls. +\marginpar{Simplified example, but the distinction between . and escaped . is correct +and therefore left unchanged.} + +%Using this, regular expression matchers and lexers are able to extract +%the domain names by the use of \verb|[a-zA-Z0-9.-]+|. +\marginpar{Rewrote explanation for the expression.} +The bracketed sub-expressions are used to extract specific parts of an email address. +The local part is recognised by the expression enclosed in +the first pair of brackets: $[a-z0-9._]$, and after the ``@'' sign +is the part that recognises the domain, where $[a-z]{2, 6}$ specifically +matches the top-level domain. +%Consequently, they are an indispensible components in text processing tools +%of software systems such as compilers, IDEs, and firewalls. The study of regular expressions is ongoing due to an issue known as catastrophic backtracking. @@ -226,8 +238,8 @@ matching or lexing engine exhibits a disproportionately long runtime despite the simplicity of the input and expression. -The origin of catastrophic backtracking is rooted in the -ambiguities of lexing. +One cause of catastrophic backtracking lies within the +ambiguities of lexing.\marginpar{rephrased "the origin of catastrophic ...} In the process of matching a multi-character string with a regular expression that encompasses several sub-expressions, different positions can be designated to mark @@ -239,7 +251,8 @@ Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}). -Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph: +Various disambiguation strategies are +employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph: %Regular expressions @@ -283,6 +296,7 @@ %There have been prose definitions like the following %by Kuklewicz \cite{KuklewiczHaskell}: %described the POSIX rule as (section 1, last paragraph): +\marginpar{\em Deleted some peripheral specifications.} \begin{quote} \begin{itemize} \item @@ -291,16 +305,17 @@ \item higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ \item -REs have right associative concatenation which can be changed with parenthesis\\ -\item -parenthesized subexpressions return the match from their last usage\\ -\item -text of component subexpressions must be contained in the text of the -higher-level subexpressions\\ -\item -if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ -\item -if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ + $\ldots$ +%REs have right associative concatenation which can be changed with parenthesis\\ +%\item +%parenthesized subexpressions return the match from their last usage\\ +%\item +%text of component subexpressions must be contained in the text of the +%higher-level subexpressions\\ +%\item +%if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ +%\item +%if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ \end{itemize} \end{quote} However, the author noted that various lexers that claim to be POSIX