--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri Dec 30 23:41:44 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri May 26 08:09:30 2023 +0100
@@ -207,25 +207,89 @@
+Regular expressions
+have been extensively studied and
+implemented since their invention in 1940s.
+It is primarily used in lexing, where an unstructured input string is broken
+down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression.
+Being able to delimit different subsections of a string and potentially
+extract information from inputs, regular expression
+matchers and lexers serve as an indispensible component in modern software systems' text processing tasks
+such as compilers, IDEs, and firewalls.
+Research on them is far from over due to the well-known issue called catastrophic-backtracking.
+This stems from the ambiguities of lexing: when matching a multiple-character string with a regular
+exression that includes serveral sub-expressions, there might be different positions to set
+the border between sub-expressions' corresponding sub-strings.
+For example, matching the string $aaa$ against the regular expression
+$(a+aa)^*$, the border between the initial match and the second iteration
+could be between the first and second character ($a | aa$)
+or between the second and third character ($aa | a$).
+As the size of the input string and the structural complexity
+of the regular expression increase,
+the number of different combinations of setting delimiters can grow exponentially, and
+algorithms that explore these possibilities unwisely will also see an exponential complexity.
-Regular expressions are widely used in computer science:
-be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
-command-line tools like $\mathit{grep}$ that facilitate easy
-text-processing \cite{grep}; network intrusion
-detection systems that inspect suspicious traffic; or compiler
-front ends.
-Given their usefulness and ubiquity, one would assume that
-modern regular expression matching implementations
-are mature and fully studied.
-Indeed, in many popular programming languages' regular expression engines,
-one can supply a regular expression and a string, and in most cases get
-get the matching information in a very short time.
-Those engines can sometimes be blindingly fast--some
-network intrusion detection systems
-use regular expression engines that are able to process
-hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-However, those engines can sometimes also exhibit a surprising security vulnerability
-under a certain class of inputs.
+Catastrophic backtracking allow a class of computationally inexpensive attacks called
+Regular expression Denial of Service attacks (ReDoS).
+These attacks, be it deliberate or not, have caused real-world systems to go down (see more
+details of this in the practical overview section in chapter \ref{Introduction}).
+There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
+The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
+There have been prose definitions like the following
+by Kuklewicz \cite{KuklewiczHaskell}:
+described the POSIX rule as (section 1, last paragraph):
+\begin{quote}
+ \begin{itemize}
+ \item
+regular expressions (REs) take the leftmost starting match, and the longest match starting there
+earlier subpatterns have leftmost-longest priority over later subpatterns\\
+\item
+higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+\item
+REs have right associative concatenation which can be changed with parenthesis\\
+\item
+parenthesized subexpressions return the match from their last usage\\
+\item
+text of component subexpressions must be contained in the text of the
+higher-level subexpressions\\
+\item
+if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+\item
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
+\end{itemize}
+\end{quote}
+But the author noted that lexers are rarely correct according to this standard.
+
+Formal proofs, such as the one written in Isabelle/HOL, is a powerful means
+for computer scientists to be certain about correctness of their algorithms and correctness properties.
+This is done by automatically checking the end goal is derived solely from a list of axioms, definitions
+and proof rules that are known to be correct.
+The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
+methods as their implementation and complexity analysis tend to be error-prone.
+
+
+
+
+
+
+%Regular expressions are widely used in computer science:
+%be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+%command-line tools like $\mathit{grep}$ that facilitate easy
+%text-processing \cite{grep}; network intrusion
+%detection systems that inspect suspicious traffic; or compiler
+%front ends.
+%Given their usefulness and ubiquity, one would assume that
+%modern regular expression matching implementations
+%are mature and fully studied.
+%Indeed, in many popular programming languages' regular expression engines,
+%one can supply a regular expression and a string, and in most cases get
+%get the matching information in a very short time.
+%Those engines can sometimes be blindingly fast--some
+%network intrusion detection systems
+%use regular expression engines that are able to process
+%hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+%However, those engines can sometimes also exhibit a surprising security vulnerability
+%under a certain class of inputs.
%However, , this is not the case for $\mathbf{all}$ inputs.
%TODO: get source for SNORT/BRO's regex matching engine/speed
@@ -972,7 +1036,7 @@
the length of matched strings in the various subexpressions.''
\end{quote}
%\noindent
-We think the implementation complexity of POSIX rules also come from
+People have recognised that the implementation complexity of POSIX rules also come from
the specification being not very precise.
The developers of the regexp package of Go
\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
@@ -1010,6 +1074,12 @@
%The text above
%is trying to capture something very precise,
%and is crying out for formalising.
+Ribeiro and Du Bois \cite{RibeiroAgda2017} have
+formalised the notion of bit-coded regular expressions
+and proved their relations with simple regular expressions in
+the dependently-typed proof assistant Agda.
+They also proved the soundness and completeness of a matching algorithm
+based on the bit-coded regular expressions.
Ausaf et al. \cite{AusafDyckhoffUrban2016}
are the first to
give a quite simple formalised POSIX
@@ -1025,6 +1095,7 @@
Their original specification and proof were unfixable
according to Ausaf et al.
+
In the next section, we will briefly
introduce Brzozowski derivatives and Sulzmann
and Lu's algorithm, which the main point of this thesis builds on.
@@ -1262,6 +1333,9 @@
%----------------------------------------------------------------------------------------
\section{Contribution}
+{\color{red} \rule{\linewidth}{0.5mm}}
+\textbf{issue with this part: way too short, not enough details of what I have done.}
+{\color{red} \rule{\linewidth}{0.5mm}}
In this thesis,
we propose a solution to catastrophic
@@ -1269,6 +1343,46 @@
regular expression lexing algorithm
that is both fast
and correct.
+{\color{red} \rule{\linewidth}{0.5mm}}
+\HandRight Added content:
+%Package \verb`pifont`: \ding{43}
+%Package \verb`utfsym`: \usym{261E}
+%Package \verb`dingbat`: \leftpointright
+%Package \verb`dingbat`: \rightpointright
+
+In particular, the main problem we solved on top of previous work was
+coming up with a formally verified algorithm called $\blexersimp$ based
+on Brzozowski derivatives. It calculates a POSIX
+lexical value from a string and a regular expression. This algorithm was originally
+by Sulzmann and Lu \cite{Sulzmann2014}, but we made the key observation that its $\textit{nub}$
+function does not really simplify intermediate results where it needs to and improved the
+algorithm accordingly.
+We have proven our $\blexersimp$'s internal data structure does not grow beyond a constant $N_r$
+depending on the input regular expression $r$, thanks to the aggressive simplifications of $\blexersimp$:
+\begin{theorem}
+$|\blexersimp \; r \; s | \leq N_r$
+\end{theorem}
+The simplifications applied in each step of $\blexersimp$
+
+\begin{center}
+ $\blexersimp
+ $
+\end{center}
+keeps the derivatives small, but presents a
+challenge
+
+
+establishing a correctness theorem of the below form:
+%TODO: change this to "theorem to prove"
+\begin{theorem}
+ If the POSIX lexical value of matching regular expression $r$ with string $s$ is $v$,
+ then $\blexersimp\; r \; s = \Some \; v$. Otherwise
+ $\blexersimp \; r \; s = \None$.
+\end{theorem}
+
+
+
+
The result is %a regular expression lexing algorithm that comes with
\begin{itemize}
\item
@@ -1287,6 +1401,7 @@
maintained.
\end{itemize}
\noindent
+{\color{red} \rule{\linewidth}{0.5mm}}
With a formal finiteness bound in place,
we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
The Isabelle/HOL code for our formalisation can be