lexing: changeset 646:56057198e4f5

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Fri Dec 30 23:41:44 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Fri May 26 08:09:30 2023 +0100
@@ -769,6 +769,7 @@
 value $v$ given a string $s$ and 
 regular expression $r$.
 }
+\label{fig:POSIXDef}
 \end{figure}\afterpage{\clearpage}
 \noindent

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Dec 30 23:41:44 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri May 26 08:09:30 2023 +0100
@@ -207,25 +207,89 @@
 
 
 
+Regular expressions 
+have been extensively studied and
+implemented since their invention in 1940s.
+It is primarily used in lexing, where an unstructured input string is broken
+down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression.
+Being able to delimit different subsections of a string and potentially 
+extract information from inputs, regular expression 
+matchers and lexers serve as an indispensible component in modern software systems' text processing tasks
+such as compilers, IDEs, and firewalls.
+Research on them is far from over due to the well-known issue called catastrophic-backtracking.
+This stems from the ambiguities of lexing: when matching a multiple-character string with a regular 
+exression that includes serveral sub-expressions, there might be different positions to set 
+the border between sub-expressions' corresponding sub-strings.
+For example, matching the string $aaa$ against the regular expression
+$(a+aa)^*$, the border between the initial match and the second iteration
+could be between the first and second character ($a | aa$) 
+or between the second and third character ($aa | a$).
+As the size of the input string and the structural complexity 
+of the regular expression increase,
+the number of different combinations of setting delimiters can grow exponentially, and
+algorithms that explore these possibilities unwisely will also see an exponential complexity.
 
-Regular expressions are widely used in computer science: 
-be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
-command-line tools like $\mathit{grep}$ that facilitate easy 
-text-processing \cite{grep}; network intrusion
-detection systems that inspect suspicious traffic; or compiler
-front ends.
-Given their usefulness and ubiquity, one would assume that
-modern regular expression matching implementations
-are mature and fully studied.
-Indeed, in many popular programming languages' regular expression engines, 
-one can supply a regular expression and a string, and in most cases get
-get the matching information in a very short time.
-Those engines can sometimes be blindingly fast--some 
-network intrusion detection systems
-use regular expression engines that are able to process 
-hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-However, those engines can sometimes also exhibit a surprising security vulnerability
-under a certain class of inputs.
+Catastrophic backtracking allow a class of computationally inexpensive attacks called
+Regular expression Denial of Service attacks (ReDoS). 
+These attacks, be it deliberate or not, have caused real-world systems to go down (see more 
+details of this in the practical overview section in chapter \ref{Introduction}).
+There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
+The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
+There have been prose definitions like the following
+by Kuklewicz \cite{KuklewiczHaskell}: 
+described the POSIX rule as (section 1, last paragraph):
+\begin{quote}
+	\begin{itemize}
+		\item
+regular expressions (REs) take the leftmost starting match, and the longest match starting there
+earlier subpatterns have leftmost-longest priority over later subpatterns\\
+\item
+higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+\item
+REs have right associative concatenation which can be changed with parenthesis\\
+\item
+parenthesized subexpressions return the match from their last usage\\
+\item
+text of component subexpressions must be contained in the text of the 
+higher-level subexpressions\\
+\item
+if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+\item
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
+\end{itemize}
+\end{quote}
+But the author noted that lexers are rarely correct according to this standard.
+
+Formal proofs, such as the one written in Isabelle/HOL, is a powerful means 
+for computer scientists to be certain about correctness of their algorithms and correctness properties.
+This is done by automatically checking the end goal is derived solely from a list of axioms, definitions
+and proof rules that are known to be correct.
+The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
+methods as their implementation and complexity analysis tend to be error-prone.
+
+
+
+
+
+
+%Regular expressions are widely used in computer science: 
+%be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+%command-line tools like $\mathit{grep}$ that facilitate easy 
+%text-processing \cite{grep}; network intrusion
+%detection systems that inspect suspicious traffic; or compiler
+%front ends.
+%Given their usefulness and ubiquity, one would assume that
+%modern regular expression matching implementations
+%are mature and fully studied.
+%Indeed, in many popular programming languages' regular expression engines, 
+%one can supply a regular expression and a string, and in most cases get
+%get the matching information in a very short time.
+%Those engines can sometimes be blindingly fast--some 
+%network intrusion detection systems
+%use regular expression engines that are able to process 
+%hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+%However, those engines can sometimes also exhibit a surprising security vulnerability
+%under a certain class of inputs.
 %However, , this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
 
@@ -972,7 +1036,7 @@
 	the length of matched strings in the various subexpressions.''
 \end{quote}
 %\noindent
-We think the implementation complexity of POSIX rules also come from
+People have recognised that the implementation complexity of POSIX rules also come from
 the specification being not very precise.
 The developers of the regexp package of Go 
 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
@@ -1010,6 +1074,12 @@
 %The text above 
 %is trying to capture something very precise,
 %and is crying out for formalising.
+Ribeiro and Du Bois \cite{RibeiroAgda2017} have 
+formalised the notion of bit-coded regular expressions
+and proved their relations with simple regular expressions in
+the dependently-typed proof assistant Agda.
+They also proved the soundness and completeness of a matching algorithm
+based on the bit-coded regular expressions.
 Ausaf et al. \cite{AusafDyckhoffUrban2016}
 are the first to
 give a quite simple formalised POSIX
@@ -1025,6 +1095,7 @@
 Their original specification and proof were unfixable
 according to Ausaf et al.
 
+
 In the next section, we will briefly
 introduce Brzozowski derivatives and Sulzmann
 and Lu's algorithm, which the main point of this thesis builds on.
@@ -1262,6 +1333,9 @@
 
 %----------------------------------------------------------------------------------------
 \section{Contribution}
+{\color{red} \rule{\linewidth}{0.5mm}}
+\textbf{issue with this part: way too short, not enough details of what I have done.}
+{\color{red} \rule{\linewidth}{0.5mm}}
 
 In this thesis,
 we propose a solution to catastrophic
@@ -1269,6 +1343,46 @@
 regular expression lexing algorithm
 that is both fast
 and correct. 
+{\color{red} \rule{\linewidth}{0.5mm}}
+\HandRight Added content:
+%Package \verb`pifont`: \ding{43}
+%Package \verb`utfsym`: \usym{261E} 
+%Package \verb`dingbat`: \leftpointright 
+%Package \verb`dingbat`: \rightpointright 
+
+In particular, the main problem we solved on top of previous work was
+coming up with a formally verified algorithm called $\blexersimp$ based
+on Brzozowski derivatives. It calculates a POSIX
+lexical value from a string and a regular expression. This algorithm was originally
+by Sulzmann and Lu \cite{Sulzmann2014}, but we made the key observation that its $\textit{nub}$
+function does not really simplify intermediate results where it needs to and improved the
+algorithm accordingly.
+We have proven our $\blexersimp$'s internal data structure does not grow beyond a constant $N_r$
+depending on the input regular expression $r$, thanks to the aggressive simplifications of $\blexersimp$:
+\begin{theorem}
+$|\blexersimp \; r \; s | \leq N_r$
+\end{theorem}
+The simplifications applied in each step of $\blexersimp$ 
+
+\begin{center}
+	$\blexersimp
+	$
+\end{center}
+keeps the derivatives small, but presents a 
+challenge
+
+
+establishing a correctness theorem of the below form:
+%TODO: change this to "theorem to prove"
+\begin{theorem}
+	If the POSIX lexical value of matching regular expression $r$ with string $s$ is $v$, 
+	then $\blexersimp\; r \; s = \Some \; v$. Otherwise 
+	$\blexersimp \; r \; s = \None$.
+\end{theorem}
+
+
+
+
 The result is %a regular expression lexing algorithm that comes with 
 \begin{itemize}
 \item
@@ -1287,6 +1401,7 @@
 maintained.
 \end{itemize}
 \noindent
+{\color{red} \rule{\linewidth}{0.5mm}}
 With a formal finiteness bound in place,
 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
 The Isabelle/HOL code for our formalisation can be

--- a/ChengsongTanPhdThesis/main.tex	Fri Dec 30 23:41:44 2022 +0000
+++ b/ChengsongTanPhdThesis/main.tex	Fri May 26 08:09:30 2023 +0100
@@ -41,8 +41,9 @@
 \usepackage[utf8]{inputenc} % Required for inputting international characters
 \usepackage[T1]{fontenc} % Output font encoding for international characters
 %\usepackage{fdsymbol} % Loads unicode-math
-
 \usepackage{cancel}
+\usepackage{fontawesome5}
+\usepackage{bbding,pifont,dingbat}
 
 \usepackage{listings}
 \usepackage{xcolor}
@@ -290,6 +291,31 @@
 \begin{abstract}
 \addchaptertocentry{\abstractname} % Add the abstract to the table of contents
 %\addchap{Abstract} 
+\textbf{Problem: not like an abstract, more like a summary}
+\textbf{Goal for new abstract: more high-level and abstract, tell the problem and solution in a concise way.}
+
+New abstract:\\
+%Modern computer systems rely on lexing for essential applications such as
+%compilers, IDEs, file systems, and network intrusion detection systems.
+%These applications require correctness with respect to
+%the POSIX standard and high performance.
+%%While existing implementations of lexers often achieve high performance,
+%Existing implementations had drawbacks such as bugs and catastrophic backtracking,
+%preventing them from solving the problem once
+%and for all.
+%To address these drawbacks,
+%this thesis offers an algorithm with formally proven correctness and internal data structures' size bound. 
+%These mechanised proofs ensure that our algorithm is fast and correct in \textbf{all} cases.
+%Our proofs use term-rewriting relations to establish invariants during derivatives and simplifications,
+%which is extensible and friendly to theorem provers.
+
+POSIX is the most widely used disambiguation strategy for regular expression matching.  There are some difficulties associated with the POSIX strategy and according to tests conducted by Kulkewitz, many regular expression matchers implementing this strategy produce incorrect results. This thesis is concerned with an POSIX regular expression matching algorithm introduced by Sulzmann and Lu. This algorithm uses bitcoded regular expressions and is based on the idea of Brzozowski derivatives. The algorithm generates POSIX values which encode the information of how a regular expression matches a string - that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens.
+
+While a formalised correctness proof for Sulzmann and Lu's algorithm already exists, this proof does not include any of the crucial simplification rules. These simplification rules are however necessary in order to have an acceptable runtime for this algorithm. Our version of the simplification rules includes a number of fixes and improvements: one problem we fix has to do with their use of the nub function that does not remove non-trivial duplicates. We improve the simplification rules by formulating them as simple recursive function and also by simplifying more instances of regular expressions.  As a result we can establish a bound on the size of derivatives. Our proofs are formalised in Isabelle/HOL.
+
+
+
+Old abstract:
 This thesis is about regular expressions and derivatives. It combines functional algorithms and their formal verification in the Isabelle/HOL theorem prover. 
 Classic results say that regular expression matching should be 
 linear with respect to the input. The size of the regular expressions

author	Chengsong
	Fri, 26 May 2023 08:09:30 +0100
changeset 646	56057198e4f5
parent 640	bd1354127574
child 647	70c10dc41606

ChengsongTanPhdThesis/Chapters/Inj.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Introduction.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/main.tex		file \| annotate \| diff \| comparison \| revisions