lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:70c10dc41606
+:d15a0b7d6d90
 %and a bit-coded lexing algorithm with proven correctness and time bounds.
 %TODO: look up snort rules to use here--give readers idea of what regexes look like
+Regular expressions, since their inception in the 1940s,
+have been subject to extensive study and implementation.
+Their primary application lies in lexing,
-Regular expressions
+a process whereby an unstructured input string is disassembled into a tree of tokens
-have been extensively studied and
+with their guidance.
-implemented since their invention in 1940s.
+This is often used to match strings that comprises of numerous fields,
-It is primarily used in lexing, where an unstructured input string is broken
+where certain fields may recur or be omitted.
-down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression.
+For example, the regular expression for recognising email addresses is
-Being able to delimit different subsections of a string and potentially
+\begin{center}
-extract information from inputs, regular expression
+\verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
-matchers and lexers serve as an indispensible component in modern software systems' text processing tasks
+\end{center}
-such as compilers, IDEs, and firewalls.
+Using this, regular expression matchers and lexers are able to extract
-Research on them is far from over due to the well-known issue called catastrophic-backtracking.
+for example the domain names by the use of \verb|[a-zA-Z0-9.-]+|.
-This stems from the ambiguities of lexing: when matching a multiple-character string with a regular
+Consequently, they are an indispensible component in text processing tools
-exression that includes serveral sub-expressions, there might be different positions to set
+of software systems such as compilers, IDEs, and firewalls.
-the border between sub-expressions' corresponding sub-strings.
-For example, matching the string $aaa$ against the regular expression
+The study of regular expressions is ongoing due to an
-$(a+aa)^*$, the border between the initial match and the second iteration
+issue known as catastrophic backtracking.
-could be between the first and second character ($a | aa$)
+This phenomenon refers to scenarios in which the regular expression
-or between the second and third character ($aa | a$).
+matching or lexing engine exhibits a disproportionately long
-As the size of the input string and the structural complexity
+runtime despite the simplicity of the input and expression.
-of the regular expression increase,
-the number of different combinations of setting delimiters can grow exponentially, and
+The origin of catastrophic backtracking is rooted in the
-algorithms that explore these possibilities unwisely will also see an exponential complexity.
+ambiguities of lexing.
+In the process of matching a multi-character string with
-Catastrophic backtracking allow a class of computationally inexpensive attacks called
+a regular expression that encompasses several sub-expressions,
-Regular expression Denial of Service attacks (ReDoS).
+different positions can be designated to mark
-These attacks, be it deliberate or not, have caused real-world systems to go down (see more
+the boundaries of corresponding substrings of the sub-expressions.
-details of this in the practical overview section in chapter \ref{Introduction}).
+For instance, in matching the string $aaa$ with the
-There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
+regular expression $(a+aa)^*$, the divide between
-The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
+the initial match and the subsequent iteration could either be
-There have been prose definitions like the following
+set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities.
-by Kuklewicz \cite{KuklewiczHaskell}:
-described the POSIX rule as (section 1, last paragraph):
+Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
+Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
+%Regular expressions
+%have been extensively studied and
+%implemented since their invention in 1940s.
+%It is primarily used in lexing, where an unstructured input string is broken
+%down into a tree of tokens.
+%That tree's construction is guided by the shape of the regular expression.
+%This is particularly useful in expressing the shape of a string
+%with many fields, where certain fields might repeat or be omitted.
+%Regular expression matchers and Lexers allow us to
+%identify and delimit different subsections of a string and potentially
+%extract information from inputs, making them
+%an indispensible component in modern software systems' text processing tasks
+%such as compilers, IDEs, and firewalls.
+%Research on them is far from over due to the well-known issue called catastrophic-backtracking,
+%which means the regular expression matching or lexing engine takes an unproportional time to run
+%despite the input and the expression being relatively simple.
+%
+%Catastrophic backtracking stems from the ambiguities of lexing:
+%when matching a multiple-character string with a regular
+%exression that includes serveral sub-expressions, there might be different positions to set
+%the border between sub-expressions' corresponding sub-strings.
+%For example, matching the string $aaa$ against the regular expression
+%$(a+aa)^*$, the border between the initial match and the second iteration
+%could be between the first and second character ($a | aa$)
+%or between the second and third character ($aa | a$).
+%As the size of the input string and the structural complexity
+%of the regular expression increase,
+%the number of different combinations of delimiters can grow exponentially, and
+%algorithms that explore these possibilities unwisely will also see an exponential complexity.
+%
+%Catastrophic backtracking allow a class of computationally inexpensive attacks called
+%Regular expression Denial of Service attacks (ReDoS), in which the hacker
+%simply sends out a small attack string to a server,
+%triggering high-complexity behaviours in its regular expression engine.
+%These attacks, be it deliberate or not, have caused real-world systems to go down (see more
+%details of this in the practical overview section in chapter \ref{Introduction}).
+%There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
+%The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
+%There have been prose definitions like the following
+%by Kuklewicz \cite{KuklewiczHaskell}:
+%described the POSIX rule as (section 1, last paragraph):
 \begin{quote}
 	\begin{itemize}
 		\item
 regular expressions (REs) take the leftmost starting match, and the longest match starting there
 earlier subpatterns have leftmost-longest priority over later subpatterns\\
 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
 \item
 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
 \end{itemize}
 \end{quote}
-But the author noted that lexers are rarely correct according to this standard.
+However, the author noted that lexers are rarely correct according to this standard.
+Being able to formally define and capture the idea of POSIX rules and matching/lexing algorithms and prove
-Formal proofs, such as the one written in Isabelle/HOL, is a powerful means
+the correctness of such algorithms against the POSIX semantics definitions
+would be valuable.
+Formal proofs are machine checked programs
+such as the ones written in Isabelle/HOL, is a powerful means
 for computer scientists to be certain about correctness of their algorithms and correctness properties.
-This is done by automatically checking the end goal is derived solely from a list of axioms, definitions
+This is done by automatically checking that the end goal is derived solely from a list of axioms, definitions
 and proof rules that are known to be correct.
 The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
 methods as their implementation and complexity analysis tend to be error-prone.
+There have been many interesting steps taken in the direction of formal proofs and regular expressions
+lexing and matching.
+(ends)
 %Regular expressions are widely used in computer science:
 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
 %command-line tools like $\mathit{grep}$ that facilitate easy
 %text-processing \cite{grep}; network intrusion

changeset 648	d15a0b7d6d90
parent 646	56057198e4f5
child 649	ef2b8abcbc55