lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:bd1354127574
+:56057198e4f5
+Regular expressions
-Regular expressions are widely used in computer science:
+have been extensively studied and
-be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+implemented since their invention in 1940s.
-command-line tools like $\mathit{grep}$ that facilitate easy
+It is primarily used in lexing, where an unstructured input string is broken
-text-processing \cite{grep}; network intrusion
+down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression.
-detection systems that inspect suspicious traffic; or compiler
+Being able to delimit different subsections of a string and potentially
-front ends.
+extract information from inputs, regular expression
-Given their usefulness and ubiquity, one would assume that
+matchers and lexers serve as an indispensible component in modern software systems' text processing tasks
-modern regular expression matching implementations
+such as compilers, IDEs, and firewalls.
-are mature and fully studied.
+Research on them is far from over due to the well-known issue called catastrophic-backtracking.
-Indeed, in many popular programming languages' regular expression engines,
+This stems from the ambiguities of lexing: when matching a multiple-character string with a regular
-one can supply a regular expression and a string, and in most cases get
+exression that includes serveral sub-expressions, there might be different positions to set
-get the matching information in a very short time.
+the border between sub-expressions' corresponding sub-strings.
-Those engines can sometimes be blindingly fast--some
+For example, matching the string $aaa$ against the regular expression
-network intrusion detection systems
+$(a+aa)^*$, the border between the initial match and the second iteration
-use regular expression engines that are able to process
+could be between the first and second character ($a | aa$)
-hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+or between the second and third character ($aa | a$).
-However, those engines can sometimes also exhibit a surprising security vulnerability
+As the size of the input string and the structural complexity
-under a certain class of inputs.
+of the regular expression increase,
+the number of different combinations of setting delimiters can grow exponentially, and
+algorithms that explore these possibilities unwisely will also see an exponential complexity.
+Catastrophic backtracking allow a class of computationally inexpensive attacks called
+Regular expression Denial of Service attacks (ReDoS).
+These attacks, be it deliberate or not, have caused real-world systems to go down (see more
+details of this in the practical overview section in chapter \ref{Introduction}).
+There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
+The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
+There have been prose definitions like the following
+by Kuklewicz \cite{KuklewiczHaskell}:
+described the POSIX rule as (section 1, last paragraph):
+\begin{quote}
+	\begin{itemize}
+		\item
+regular expressions (REs) take the leftmost starting match, and the longest match starting there
+earlier subpatterns have leftmost-longest priority over later subpatterns\\
+\item
+higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+\item
+REs have right associative concatenation which can be changed with parenthesis\\
+\item
+parenthesized subexpressions return the match from their last usage\\
+\item
+text of component subexpressions must be contained in the text of the
+higher-level subexpressions\\
+\item
+if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+\item
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
+\end{itemize}
+\end{quote}
+But the author noted that lexers are rarely correct according to this standard.
+Formal proofs, such as the one written in Isabelle/HOL, is a powerful means
+for computer scientists to be certain about correctness of their algorithms and correctness properties.
+This is done by automatically checking the end goal is derived solely from a list of axioms, definitions
+and proof rules that are known to be correct.
+The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
+methods as their implementation and complexity analysis tend to be error-prone.
+%Regular expressions are widely used in computer science:
+%be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+%command-line tools like $\mathit{grep}$ that facilitate easy
+%text-processing \cite{grep}; network intrusion
+%detection systems that inspect suspicious traffic; or compiler
+%front ends.
+%Given their usefulness and ubiquity, one would assume that
+%modern regular expression matching implementations
+%are mature and fully studied.
+%Indeed, in many popular programming languages' regular expression engines,
+%one can supply a regular expression and a string, and in most cases get
+%get the matching information in a very short time.
+%Those engines can sometimes be blindingly fast--some
+%network intrusion detection systems
+%use regular expression engines that are able to process
+%hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+%However, those engines can sometimes also exhibit a surprising security vulnerability
+%under a certain class of inputs.
 %However, , this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
 Consider for example the regular expression $(a^*)^*\,b$ and
 	``The POSIX strategy is more complicated than the
 	greedy because of the dependence on information about
 	the length of matched strings in the various subexpressions.''
 \end{quote}
 %\noindent
-We think the implementation complexity of POSIX rules also come from
+People have recognised that the implementation complexity of POSIX rules also come from
 the specification being not very precise.
 The developers of the regexp package of Go
 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
 commented that
 \begin{quote}\it
 \end{itemize}
 \end{quote}
 %The text above
 %is trying to capture something very precise,
 %and is crying out for formalising.
+Ribeiro and Du Bois \cite{RibeiroAgda2017} have
+formalised the notion of bit-coded regular expressions
+and proved their relations with simple regular expressions in
+the dependently-typed proof assistant Agda.
+They also proved the soundness and completeness of a matching algorithm
+based on the bit-coded regular expressions.
 Ausaf et al. \cite{AusafDyckhoffUrban2016}
 are the first to
 give a quite simple formalised POSIX
 specification in Isabelle/HOL, and also prove
 that their specification coincides with an earlier (unformalised)
 They also found that the informal POSIX
 specification by Sulzmann and Lu needs to be substantially
 revised in order for the correctness proof to go through.
 Their original specification and proof were unfixable
 according to Ausaf et al.
 In the next section, we will briefly
 introduce Brzozowski derivatives and Sulzmann
 and Lu's algorithm, which the main point of this thesis builds on.
 %We give a taste of what they
 %as a DFA.
 %----------------------------------------------------------------------------------------
 \section{Contribution}
+{\color{red} \rule{\linewidth}{0.5mm}}
+\textbf{issue with this part: way too short, not enough details of what I have done.}
+{\color{red} \rule{\linewidth}{0.5mm}}
 In this thesis,
 we propose a solution to catastrophic
 backtracking and error-prone matchers: a formally verified
 regular expression lexing algorithm
 that is both fast
 and correct.
+{\color{red} \rule{\linewidth}{0.5mm}}
+\HandRight Added content:
+%Package \verb`pifont`: \ding{43}
+%Package \verb`utfsym`: \usym{261E}
+%Package \verb`dingbat`: \leftpointright
+%Package \verb`dingbat`: \rightpointright
+In particular, the main problem we solved on top of previous work was
+coming up with a formally verified algorithm called $\blexersimp$ based
+on Brzozowski derivatives. It calculates a POSIX
+lexical value from a string and a regular expression. This algorithm was originally
+by Sulzmann and Lu \cite{Sulzmann2014}, but we made the key observation that its $\textit{nub}$
+function does not really simplify intermediate results where it needs to and improved the
+algorithm accordingly.
+We have proven our $\blexersimp$'s internal data structure does not grow beyond a constant $N_r$
+depending on the input regular expression $r$, thanks to the aggressive simplifications of $\blexersimp$:
+\begin{theorem}
+$|\blexersimp \; r \; s | \leq N_r$
+\end{theorem}
+The simplifications applied in each step of $\blexersimp$
+\begin{center}
+	$\blexersimp
+	$
+\end{center}
+keeps the derivatives small, but presents a
+challenge
+establishing a correctness theorem of the below form:
+%TODO: change this to "theorem to prove"
+\begin{theorem}
+	If the POSIX lexical value of matching regular expression $r$ with string $s$ is $v$,
+	then $\blexersimp\; r \; s = \Some \; v$. Otherwise
+	$\blexersimp \; r \; s = \None$.
+\end{theorem}
 The result is %a regular expression lexing algorithm that comes with
 \begin{itemize}
 \item
 an improved version of  Sulzmann and Lu's bit-coded algorithm using
 derivatives with simplifications,
 and an extension to
 the bounded repetition constructs with the correctness and finiteness property
 maintained.
 \end{itemize}
 \noindent
+{\color{red} \rule{\linewidth}{0.5mm}}
 With a formal finiteness bound in place,
 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
 The Isabelle/HOL code for our formalisation can be
 found at
 \begin{center}

changeset 646	56057198e4f5
parent 638	dd9dde2d902b
child 648	d15a0b7d6d90