ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 646 56057198e4f5
parent 638 dd9dde2d902b
child 648 d15a0b7d6d90
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Dec 30 23:41:44 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri May 26 08:09:30 2023 +0100
@@ -207,25 +207,89 @@
 
 
 
+Regular expressions 
+have been extensively studied and
+implemented since their invention in 1940s.
+It is primarily used in lexing, where an unstructured input string is broken
+down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression.
+Being able to delimit different subsections of a string and potentially 
+extract information from inputs, regular expression 
+matchers and lexers serve as an indispensible component in modern software systems' text processing tasks
+such as compilers, IDEs, and firewalls.
+Research on them is far from over due to the well-known issue called catastrophic-backtracking.
+This stems from the ambiguities of lexing: when matching a multiple-character string with a regular 
+exression that includes serveral sub-expressions, there might be different positions to set 
+the border between sub-expressions' corresponding sub-strings.
+For example, matching the string $aaa$ against the regular expression
+$(a+aa)^*$, the border between the initial match and the second iteration
+could be between the first and second character ($a | aa$) 
+or between the second and third character ($aa | a$).
+As the size of the input string and the structural complexity 
+of the regular expression increase,
+the number of different combinations of setting delimiters can grow exponentially, and
+algorithms that explore these possibilities unwisely will also see an exponential complexity.
 
-Regular expressions are widely used in computer science: 
-be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
-command-line tools like $\mathit{grep}$ that facilitate easy 
-text-processing \cite{grep}; network intrusion
-detection systems that inspect suspicious traffic; or compiler
-front ends.
-Given their usefulness and ubiquity, one would assume that
-modern regular expression matching implementations
-are mature and fully studied.
-Indeed, in many popular programming languages' regular expression engines, 
-one can supply a regular expression and a string, and in most cases get
-get the matching information in a very short time.
-Those engines can sometimes be blindingly fast--some 
-network intrusion detection systems
-use regular expression engines that are able to process 
-hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-However, those engines can sometimes also exhibit a surprising security vulnerability
-under a certain class of inputs.
+Catastrophic backtracking allow a class of computationally inexpensive attacks called
+Regular expression Denial of Service attacks (ReDoS). 
+These attacks, be it deliberate or not, have caused real-world systems to go down (see more 
+details of this in the practical overview section in chapter \ref{Introduction}).
+There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
+The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
+There have been prose definitions like the following
+by Kuklewicz \cite{KuklewiczHaskell}: 
+described the POSIX rule as (section 1, last paragraph):
+\begin{quote}
+	\begin{itemize}
+		\item
+regular expressions (REs) take the leftmost starting match, and the longest match starting there
+earlier subpatterns have leftmost-longest priority over later subpatterns\\
+\item
+higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+\item
+REs have right associative concatenation which can be changed with parenthesis\\
+\item
+parenthesized subexpressions return the match from their last usage\\
+\item
+text of component subexpressions must be contained in the text of the 
+higher-level subexpressions\\
+\item
+if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+\item
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
+\end{itemize}
+\end{quote}
+But the author noted that lexers are rarely correct according to this standard.
+
+Formal proofs, such as the one written in Isabelle/HOL, is a powerful means 
+for computer scientists to be certain about correctness of their algorithms and correctness properties.
+This is done by automatically checking the end goal is derived solely from a list of axioms, definitions
+and proof rules that are known to be correct.
+The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
+methods as their implementation and complexity analysis tend to be error-prone.
+
+
+
+
+
+
+%Regular expressions are widely used in computer science: 
+%be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+%command-line tools like $\mathit{grep}$ that facilitate easy 
+%text-processing \cite{grep}; network intrusion
+%detection systems that inspect suspicious traffic; or compiler
+%front ends.
+%Given their usefulness and ubiquity, one would assume that
+%modern regular expression matching implementations
+%are mature and fully studied.
+%Indeed, in many popular programming languages' regular expression engines, 
+%one can supply a regular expression and a string, and in most cases get
+%get the matching information in a very short time.
+%Those engines can sometimes be blindingly fast--some 
+%network intrusion detection systems
+%use regular expression engines that are able to process 
+%hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+%However, those engines can sometimes also exhibit a surprising security vulnerability
+%under a certain class of inputs.
 %However, , this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
 
@@ -972,7 +1036,7 @@
 	the length of matched strings in the various subexpressions.''
 \end{quote}
 %\noindent
-We think the implementation complexity of POSIX rules also come from
+People have recognised that the implementation complexity of POSIX rules also come from
 the specification being not very precise.
 The developers of the regexp package of Go 
 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
@@ -1010,6 +1074,12 @@
 %The text above 
 %is trying to capture something very precise,
 %and is crying out for formalising.
+Ribeiro and Du Bois \cite{RibeiroAgda2017} have 
+formalised the notion of bit-coded regular expressions
+and proved their relations with simple regular expressions in
+the dependently-typed proof assistant Agda.
+They also proved the soundness and completeness of a matching algorithm
+based on the bit-coded regular expressions.
 Ausaf et al. \cite{AusafDyckhoffUrban2016}
 are the first to
 give a quite simple formalised POSIX
@@ -1025,6 +1095,7 @@
 Their original specification and proof were unfixable
 according to Ausaf et al.
 
+
 In the next section, we will briefly
 introduce Brzozowski derivatives and Sulzmann
 and Lu's algorithm, which the main point of this thesis builds on.
@@ -1262,6 +1333,9 @@
 
 %----------------------------------------------------------------------------------------
 \section{Contribution}
+{\color{red} \rule{\linewidth}{0.5mm}}
+\textbf{issue with this part: way too short, not enough details of what I have done.}
+{\color{red} \rule{\linewidth}{0.5mm}}
 
 In this thesis,
 we propose a solution to catastrophic
@@ -1269,6 +1343,46 @@
 regular expression lexing algorithm
 that is both fast
 and correct. 
+{\color{red} \rule{\linewidth}{0.5mm}}
+\HandRight Added content:
+%Package \verb`pifont`: \ding{43}
+%Package \verb`utfsym`: \usym{261E} 
+%Package \verb`dingbat`: \leftpointright 
+%Package \verb`dingbat`: \rightpointright 
+
+In particular, the main problem we solved on top of previous work was
+coming up with a formally verified algorithm called $\blexersimp$ based
+on Brzozowski derivatives. It calculates a POSIX
+lexical value from a string and a regular expression. This algorithm was originally
+by Sulzmann and Lu \cite{Sulzmann2014}, but we made the key observation that its $\textit{nub}$
+function does not really simplify intermediate results where it needs to and improved the
+algorithm accordingly.
+We have proven our $\blexersimp$'s internal data structure does not grow beyond a constant $N_r$
+depending on the input regular expression $r$, thanks to the aggressive simplifications of $\blexersimp$:
+\begin{theorem}
+$|\blexersimp \; r \; s | \leq N_r$
+\end{theorem}
+The simplifications applied in each step of $\blexersimp$ 
+
+\begin{center}
+	$\blexersimp
+	$
+\end{center}
+keeps the derivatives small, but presents a 
+challenge
+
+
+establishing a correctness theorem of the below form:
+%TODO: change this to "theorem to prove"
+\begin{theorem}
+	If the POSIX lexical value of matching regular expression $r$ with string $s$ is $v$, 
+	then $\blexersimp\; r \; s = \Some \; v$. Otherwise 
+	$\blexersimp \; r \; s = \None$.
+\end{theorem}
+
+
+
+
 The result is %a regular expression lexing algorithm that comes with 
 \begin{itemize}
 \item
@@ -1287,6 +1401,7 @@
 maintained.
 \end{itemize}
 \noindent
+{\color{red} \rule{\linewidth}{0.5mm}}
 With a formal finiteness bound in place,
 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
 The Isabelle/HOL code for our formalisation can be