lexing: changeset 648:d15a0b7d6d90

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri May 26 08:10:17 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri May 26 23:42:15 2023 +0100
@@ -204,40 +204,85 @@
 %TODO: look up snort rules to use here--give readers idea of what regexes look like
 
 
+Regular expressions, since their inception in the 1940s, 
+have been subject to extensive study and implementation. 
+Their primary application lies in lexing, 
+a process whereby an unstructured input string is disassembled into a tree of tokens
+with their guidance.
+This is often used to match strings that comprises of numerous fields, 
+where certain fields may recur or be omitted. 
+For example, the regular expression for recognising email addresses is
+\begin{center}
+\verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
+\end{center}
+Using this, regular expression matchers and lexers are able to extract 
+for example the domain names by the use of \verb|[a-zA-Z0-9.-]+|. 
+Consequently, they are an indispensible component in text processing tools 
+of software systems such as compilers, IDEs, and firewalls.
 
+The study of regular expressions is ongoing due to an
+issue known as catastrophic backtracking. 
+This phenomenon refers to scenarios in which the regular expression 
+matching or lexing engine exhibits a disproportionately long 
+runtime despite the simplicity of the input and expression.
+
+The origin of catastrophic backtracking is rooted in the 
+ambiguities of lexing. 
+In the process of matching a multi-character string with 
+a regular expression that encompasses several sub-expressions, 
+different positions can be designated to mark 
+the boundaries of corresponding substrings of the sub-expressions. 
+For instance, in matching the string $aaa$ with the 
+regular expression $(a+aa)^*$, the divide between 
+the initial match and the subsequent iteration could either be 
+set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities.
+
+Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
+
+Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
 
 
-Regular expressions 
-have been extensively studied and
-implemented since their invention in 1940s.
-It is primarily used in lexing, where an unstructured input string is broken
-down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression.
-Being able to delimit different subsections of a string and potentially 
-extract information from inputs, regular expression 
-matchers and lexers serve as an indispensible component in modern software systems' text processing tasks
-such as compilers, IDEs, and firewalls.
-Research on them is far from over due to the well-known issue called catastrophic-backtracking.
-This stems from the ambiguities of lexing: when matching a multiple-character string with a regular 
-exression that includes serveral sub-expressions, there might be different positions to set 
-the border between sub-expressions' corresponding sub-strings.
-For example, matching the string $aaa$ against the regular expression
-$(a+aa)^*$, the border between the initial match and the second iteration
-could be between the first and second character ($a | aa$) 
-or between the second and third character ($aa | a$).
-As the size of the input string and the structural complexity 
-of the regular expression increase,
-the number of different combinations of setting delimiters can grow exponentially, and
-algorithms that explore these possibilities unwisely will also see an exponential complexity.
-
-Catastrophic backtracking allow a class of computationally inexpensive attacks called
-Regular expression Denial of Service attacks (ReDoS). 
-These attacks, be it deliberate or not, have caused real-world systems to go down (see more 
-details of this in the practical overview section in chapter \ref{Introduction}).
-There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
-The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
-There have been prose definitions like the following
-by Kuklewicz \cite{KuklewiczHaskell}: 
-described the POSIX rule as (section 1, last paragraph):
+%Regular expressions 
+%have been extensively studied and
+%implemented since their invention in 1940s.
+%It is primarily used in lexing, where an unstructured input string is broken
+%down into a tree of tokens.
+%That tree's construction is guided by the shape of the regular expression.
+%This is particularly useful in expressing the shape of a string
+%with many fields, where certain fields might repeat or be omitted.
+%Regular expression matchers and Lexers allow us to 
+%identify and delimit different subsections of a string and potentially 
+%extract information from inputs, making them
+%an indispensible component in modern software systems' text processing tasks
+%such as compilers, IDEs, and firewalls.
+%Research on them is far from over due to the well-known issue called catastrophic-backtracking,
+%which means the regular expression matching or lexing engine takes an unproportional time to run 
+%despite the input and the expression being relatively simple.
+%
+%Catastrophic backtracking stems from the ambiguities of lexing: 
+%when matching a multiple-character string with a regular 
+%exression that includes serveral sub-expressions, there might be different positions to set 
+%the border between sub-expressions' corresponding sub-strings.
+%For example, matching the string $aaa$ against the regular expression
+%$(a+aa)^*$, the border between the initial match and the second iteration
+%could be between the first and second character ($a | aa$) 
+%or between the second and third character ($aa | a$).
+%As the size of the input string and the structural complexity 
+%of the regular expression increase,
+%the number of different combinations of delimiters can grow exponentially, and
+%algorithms that explore these possibilities unwisely will also see an exponential complexity.
+%
+%Catastrophic backtracking allow a class of computationally inexpensive attacks called
+%Regular expression Denial of Service attacks (ReDoS), in which the hacker 
+%simply sends out a small attack string to a server, 
+%triggering high-complexity behaviours in its regular expression engine. 
+%These attacks, be it deliberate or not, have caused real-world systems to go down (see more 
+%details of this in the practical overview section in chapter \ref{Introduction}).
+%There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
+%The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
+%There have been prose definitions like the following
+%by Kuklewicz \cite{KuklewiczHaskell}: 
+%described the POSIX rule as (section 1, last paragraph):
 \begin{quote}
 	\begin{itemize}
 		\item
@@ -258,19 +303,25 @@
 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
 \end{itemize}
 \end{quote}
-But the author noted that lexers are rarely correct according to this standard.
+However, the author noted that lexers are rarely correct according to this standard.
+Being able to formally define and capture the idea of POSIX rules and matching/lexing algorithms and prove 
+the correctness of such algorithms against the POSIX semantics definitions
+would be valuable.
 
-Formal proofs, such as the one written in Isabelle/HOL, is a powerful means 
+Formal proofs are machine checked programs
+such as the ones written in Isabelle/HOL, is a powerful means 
 for computer scientists to be certain about correctness of their algorithms and correctness properties.
-This is done by automatically checking the end goal is derived solely from a list of axioms, definitions
+This is done by automatically checking that the end goal is derived solely from a list of axioms, definitions
 and proof rules that are known to be correct.
 The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
 methods as their implementation and complexity analysis tend to be error-prone.
 
+There have been many interesting steps taken in the direction of formal proofs and regular expressions
+lexing and matching.
 
 
 
-
+(ends)
 
 %Regular expressions are widely used in computer science: 
 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
author	Chengsong
	Fri, 26 May 2023 23:42:15 +0100
changeset 648	d15a0b7d6d90
parent 647	70c10dc41606
child 649	ef2b8abcbc55