diff -r bd1354127574 -r 56057198e4f5 ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri Dec 30 23:41:44 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri May 26 08:09:30 2023 +0100 @@ -207,25 +207,89 @@ +Regular expressions +have been extensively studied and +implemented since their invention in 1940s. +It is primarily used in lexing, where an unstructured input string is broken +down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression. +Being able to delimit different subsections of a string and potentially +extract information from inputs, regular expression +matchers and lexers serve as an indispensible component in modern software systems' text processing tasks +such as compilers, IDEs, and firewalls. +Research on them is far from over due to the well-known issue called catastrophic-backtracking. +This stems from the ambiguities of lexing: when matching a multiple-character string with a regular +exression that includes serveral sub-expressions, there might be different positions to set +the border between sub-expressions' corresponding sub-strings. +For example, matching the string $aaa$ against the regular expression +$(a+aa)^*$, the border between the initial match and the second iteration +could be between the first and second character ($a | aa$) +or between the second and third character ($aa | a$). +As the size of the input string and the structural complexity +of the regular expression increase, +the number of different combinations of setting delimiters can grow exponentially, and +algorithms that explore these possibilities unwisely will also see an exponential complexity. -Regular expressions are widely used in computer science: -be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; -command-line tools like $\mathit{grep}$ that facilitate easy -text-processing \cite{grep}; network intrusion -detection systems that inspect suspicious traffic; or compiler -front ends. -Given their usefulness and ubiquity, one would assume that -modern regular expression matching implementations -are mature and fully studied. -Indeed, in many popular programming languages' regular expression engines, -one can supply a regular expression and a string, and in most cases get -get the matching information in a very short time. -Those engines can sometimes be blindingly fast--some -network intrusion detection systems -use regular expression engines that are able to process -hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. -However, those engines can sometimes also exhibit a surprising security vulnerability -under a certain class of inputs. +Catastrophic backtracking allow a class of computationally inexpensive attacks called +Regular expression Denial of Service attacks (ReDoS). +These attacks, be it deliberate or not, have caused real-world systems to go down (see more +details of this in the practical overview section in chapter \ref{Introduction}). +There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX. +The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible. +There have been prose definitions like the following +by Kuklewicz \cite{KuklewiczHaskell}: +described the POSIX rule as (section 1, last paragraph): +\begin{quote} + \begin{itemize} + \item +regular expressions (REs) take the leftmost starting match, and the longest match starting there +earlier subpatterns have leftmost-longest priority over later subpatterns\\ +\item +higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ +\item +REs have right associative concatenation which can be changed with parenthesis\\ +\item +parenthesized subexpressions return the match from their last usage\\ +\item +text of component subexpressions must be contained in the text of the +higher-level subexpressions\\ +\item +if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ +\item +if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ +\end{itemize} +\end{quote} +But the author noted that lexers are rarely correct according to this standard. + +Formal proofs, such as the one written in Isabelle/HOL, is a powerful means +for computer scientists to be certain about correctness of their algorithms and correctness properties. +This is done by automatically checking the end goal is derived solely from a list of axioms, definitions +and proof rules that are known to be correct. +The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such +methods as their implementation and complexity analysis tend to be error-prone. + + + + + + +%Regular expressions are widely used in computer science: +%be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; +%command-line tools like $\mathit{grep}$ that facilitate easy +%text-processing \cite{grep}; network intrusion +%detection systems that inspect suspicious traffic; or compiler +%front ends. +%Given their usefulness and ubiquity, one would assume that +%modern regular expression matching implementations +%are mature and fully studied. +%Indeed, in many popular programming languages' regular expression engines, +%one can supply a regular expression and a string, and in most cases get +%get the matching information in a very short time. +%Those engines can sometimes be blindingly fast--some +%network intrusion detection systems +%use regular expression engines that are able to process +%hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. +%However, those engines can sometimes also exhibit a surprising security vulnerability +%under a certain class of inputs. %However, , this is not the case for $\mathbf{all}$ inputs. %TODO: get source for SNORT/BRO's regex matching engine/speed @@ -972,7 +1036,7 @@ the length of matched strings in the various subexpressions.'' \end{quote} %\noindent -We think the implementation complexity of POSIX rules also come from +People have recognised that the implementation complexity of POSIX rules also come from the specification being not very precise. The developers of the regexp package of Go \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} @@ -1010,6 +1074,12 @@ %The text above %is trying to capture something very precise, %and is crying out for formalising. +Ribeiro and Du Bois \cite{RibeiroAgda2017} have +formalised the notion of bit-coded regular expressions +and proved their relations with simple regular expressions in +the dependently-typed proof assistant Agda. +They also proved the soundness and completeness of a matching algorithm +based on the bit-coded regular expressions. Ausaf et al. \cite{AusafDyckhoffUrban2016} are the first to give a quite simple formalised POSIX @@ -1025,6 +1095,7 @@ Their original specification and proof were unfixable according to Ausaf et al. + In the next section, we will briefly introduce Brzozowski derivatives and Sulzmann and Lu's algorithm, which the main point of this thesis builds on. @@ -1262,6 +1333,9 @@ %---------------------------------------------------------------------------------------- \section{Contribution} +{\color{red} \rule{\linewidth}{0.5mm}} +\textbf{issue with this part: way too short, not enough details of what I have done.} +{\color{red} \rule{\linewidth}{0.5mm}} In this thesis, we propose a solution to catastrophic @@ -1269,6 +1343,46 @@ regular expression lexing algorithm that is both fast and correct. +{\color{red} \rule{\linewidth}{0.5mm}} +\HandRight Added content: +%Package \verb`pifont`: \ding{43} +%Package \verb`utfsym`: \usym{261E} +%Package \verb`dingbat`: \leftpointright +%Package \verb`dingbat`: \rightpointright + +In particular, the main problem we solved on top of previous work was +coming up with a formally verified algorithm called $\blexersimp$ based +on Brzozowski derivatives. It calculates a POSIX +lexical value from a string and a regular expression. This algorithm was originally +by Sulzmann and Lu \cite{Sulzmann2014}, but we made the key observation that its $\textit{nub}$ +function does not really simplify intermediate results where it needs to and improved the +algorithm accordingly. +We have proven our $\blexersimp$'s internal data structure does not grow beyond a constant $N_r$ +depending on the input regular expression $r$, thanks to the aggressive simplifications of $\blexersimp$: +\begin{theorem} +$|\blexersimp \; r \; s | \leq N_r$ +\end{theorem} +The simplifications applied in each step of $\blexersimp$ + +\begin{center} + $\blexersimp + $ +\end{center} +keeps the derivatives small, but presents a +challenge + + +establishing a correctness theorem of the below form: +%TODO: change this to "theorem to prove" +\begin{theorem} + If the POSIX lexical value of matching regular expression $r$ with string $s$ is $v$, + then $\blexersimp\; r \; s = \Some \; v$. Otherwise + $\blexersimp \; r \; s = \None$. +\end{theorem} + + + + The result is %a regular expression lexing algorithm that comes with \begin{itemize} \item @@ -1287,6 +1401,7 @@ maintained. \end{itemize} \noindent +{\color{red} \rule{\linewidth}{0.5mm}} With a formal finiteness bound in place, we can greatly reduce the attack surface of servers in terms of ReDoS attacks. The Isabelle/HOL code for our formalisation can be