# HG changeset patch # User Chengsong # Date 1685140935 -3600 # Node ID d15a0b7d6d9029f901bcc4e9c20e25ec28365bbd # Parent 70c10dc41606a7ee92794f2e462f3b5b5627881e new amend diff -r 70c10dc41606 -r d15a0b7d6d90 ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri May 26 08:10:17 2023 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri May 26 23:42:15 2023 +0100 @@ -204,40 +204,85 @@ %TODO: look up snort rules to use here--give readers idea of what regexes look like +Regular expressions, since their inception in the 1940s, +have been subject to extensive study and implementation. +Their primary application lies in lexing, +a process whereby an unstructured input string is disassembled into a tree of tokens +with their guidance. +This is often used to match strings that comprises of numerous fields, +where certain fields may recur or be omitted. +For example, the regular expression for recognising email addresses is +\begin{center} +\verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|. +\end{center} +Using this, regular expression matchers and lexers are able to extract +for example the domain names by the use of \verb|[a-zA-Z0-9.-]+|. +Consequently, they are an indispensible component in text processing tools +of software systems such as compilers, IDEs, and firewalls. +The study of regular expressions is ongoing due to an +issue known as catastrophic backtracking. +This phenomenon refers to scenarios in which the regular expression +matching or lexing engine exhibits a disproportionately long +runtime despite the simplicity of the input and expression. + +The origin of catastrophic backtracking is rooted in the +ambiguities of lexing. +In the process of matching a multi-character string with +a regular expression that encompasses several sub-expressions, +different positions can be designated to mark +the boundaries of corresponding substrings of the sub-expressions. +For instance, in matching the string $aaa$ with the +regular expression $(a+aa)^*$, the divide between +the initial match and the subsequent iteration could either be +set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities. + +Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}). + +Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph: -Regular expressions -have been extensively studied and -implemented since their invention in 1940s. -It is primarily used in lexing, where an unstructured input string is broken -down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression. -Being able to delimit different subsections of a string and potentially -extract information from inputs, regular expression -matchers and lexers serve as an indispensible component in modern software systems' text processing tasks -such as compilers, IDEs, and firewalls. -Research on them is far from over due to the well-known issue called catastrophic-backtracking. -This stems from the ambiguities of lexing: when matching a multiple-character string with a regular -exression that includes serveral sub-expressions, there might be different positions to set -the border between sub-expressions' corresponding sub-strings. -For example, matching the string $aaa$ against the regular expression -$(a+aa)^*$, the border between the initial match and the second iteration -could be between the first and second character ($a | aa$) -or between the second and third character ($aa | a$). -As the size of the input string and the structural complexity -of the regular expression increase, -the number of different combinations of setting delimiters can grow exponentially, and -algorithms that explore these possibilities unwisely will also see an exponential complexity. - -Catastrophic backtracking allow a class of computationally inexpensive attacks called -Regular expression Denial of Service attacks (ReDoS). -These attacks, be it deliberate or not, have caused real-world systems to go down (see more -details of this in the practical overview section in chapter \ref{Introduction}). -There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX. -The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible. -There have been prose definitions like the following -by Kuklewicz \cite{KuklewiczHaskell}: -described the POSIX rule as (section 1, last paragraph): +%Regular expressions +%have been extensively studied and +%implemented since their invention in 1940s. +%It is primarily used in lexing, where an unstructured input string is broken +%down into a tree of tokens. +%That tree's construction is guided by the shape of the regular expression. +%This is particularly useful in expressing the shape of a string +%with many fields, where certain fields might repeat or be omitted. +%Regular expression matchers and Lexers allow us to +%identify and delimit different subsections of a string and potentially +%extract information from inputs, making them +%an indispensible component in modern software systems' text processing tasks +%such as compilers, IDEs, and firewalls. +%Research on them is far from over due to the well-known issue called catastrophic-backtracking, +%which means the regular expression matching or lexing engine takes an unproportional time to run +%despite the input and the expression being relatively simple. +% +%Catastrophic backtracking stems from the ambiguities of lexing: +%when matching a multiple-character string with a regular +%exression that includes serveral sub-expressions, there might be different positions to set +%the border between sub-expressions' corresponding sub-strings. +%For example, matching the string $aaa$ against the regular expression +%$(a+aa)^*$, the border between the initial match and the second iteration +%could be between the first and second character ($a | aa$) +%or between the second and third character ($aa | a$). +%As the size of the input string and the structural complexity +%of the regular expression increase, +%the number of different combinations of delimiters can grow exponentially, and +%algorithms that explore these possibilities unwisely will also see an exponential complexity. +% +%Catastrophic backtracking allow a class of computationally inexpensive attacks called +%Regular expression Denial of Service attacks (ReDoS), in which the hacker +%simply sends out a small attack string to a server, +%triggering high-complexity behaviours in its regular expression engine. +%These attacks, be it deliberate or not, have caused real-world systems to go down (see more +%details of this in the practical overview section in chapter \ref{Introduction}). +%There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX. +%The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible. +%There have been prose definitions like the following +%by Kuklewicz \cite{KuklewiczHaskell}: +%described the POSIX rule as (section 1, last paragraph): \begin{quote} \begin{itemize} \item @@ -258,19 +303,25 @@ if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ \end{itemize} \end{quote} -But the author noted that lexers are rarely correct according to this standard. +However, the author noted that lexers are rarely correct according to this standard. +Being able to formally define and capture the idea of POSIX rules and matching/lexing algorithms and prove +the correctness of such algorithms against the POSIX semantics definitions +would be valuable. -Formal proofs, such as the one written in Isabelle/HOL, is a powerful means +Formal proofs are machine checked programs +such as the ones written in Isabelle/HOL, is a powerful means for computer scientists to be certain about correctness of their algorithms and correctness properties. -This is done by automatically checking the end goal is derived solely from a list of axioms, definitions +This is done by automatically checking that the end goal is derived solely from a list of axioms, definitions and proof rules that are known to be correct. The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such methods as their implementation and complexity analysis tend to be error-prone. +There have been many interesting steps taken in the direction of formal proofs and regular expressions +lexing and matching. - +(ends) %Regular expressions are widely used in computer science: %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;