lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:50e590823220
+:8016a2480704
 \newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
 \newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
 \newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
+\def\Some{\textit{Some}}
+\def\None{\textit{None}}
 \def\code{\textit{code}}
 \def\decode{\textit{decode}}
 \def\internalise{\textit{internalise}}
 \def\lexer{\mathit{lexer}}
 \def\mkeps{\textit{mkeps}}
 \def\AZERO{\textit{AZERO}}
 \def\AONE{\textit{AONE}}
 \def\ACHAR{\textit{ACHAR}}
+\def\fuse{\textit{fuse}}
+\def\bder{\textit{bder}}
 \def\POSIX{\textit{POSIX}}
 \def\ALTS{\textit{ALTS}}
 \def\ASTAR{\textit{ASTAR}}
 \def\DFA{\textit{DFA}}
+\def\NFA{\textit{NFA}}
 \def\bmkeps{\textit{bmkeps}}
 \def\retrieve{\textit{retrieve}}
 \def\blexer{\textit{blexer}}
 \def\flex{\textit{flex}}
 \def\inj{\mathit{inj}}
 \def\rrexp{\textit{rrexp}}
 \newcommand\rnullable[1]{\textit{rnullable}(#1)}
 \newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
 \newcommand\asize[1]{\llbracket #1 \rrbracket}
 \newcommand\rerase[1]{ (#1)\downarrow_r}
+\newcommand\ChristianComment[1]{\textcolor{blue}{#1}\\}
 \def\erase{\textit{erase}}
 \def\STAR{\textit{STAR}}
 \def\flts{\textit{flts}}
 \newcommand\RSEQ[2]{#1 \cdot #2}
 \newcommand\RALTS[1]{\oplus #1}
 \newcommand\RSTAR[1]{#1^*}
 \newcommand\vsuf[2]{\textit{vsuf} \;#1\;#2}
+\pgfplotsset{
+myplotstyle/.style={
+legend style={draw=none, font=\small},
+legend cell align=left,
+legend pos=north east,
+ylabel style={align=center, font=\bfseries\boldmath},
+xlabel style={align=center, font=\bfseries\boldmath},
+x tick label style={font=\bfseries\boldmath},
+y tick label style={font=\bfseries\boldmath},
+scaled ticks=true,
+every axis plot/.append style={thick},
+},
+}
 %----------------------------------------------------------------------------------------
 %This part is about regular expressions, Brzozowski derivatives,
 %and a bit-coded lexing algorithm with proven correctness and time bounds.
 %TODO: look up snort rules to use here--give readers idea of what regexes look like
-Regular expressions are widely used in computer science:
-be it in text-editors\parencite{atomEditor} with syntax highlighting and auto-completion,
-command-line tools like $\mathit{grep}$ that facilitate easy
-text processing, network intrusion
-detection systems that reject suspicious traffic, or compiler
-front ends--the majority of the solutions to these tasks
-involve lexing with regular
-expressions.
-Given its usefulness and ubiquity, one would imagine that
-modern regular expression matching implementations
-are mature and fully studied.
-Indeed, in a popular programming language' regex engine,
-supplying it with regular expressions and strings, one can
-get rich matching information in a very short time.
-Some network intrusion detection systems
-use regex engines that are able to process
-megabytes or even gigabytes of data per second\parencite{Turo_ov__2020}.
-Unfortunately, this is not the case for $\mathbf{all}$ inputs.
-%TODO: get source for SNORT/BRO's regex matching engine/speed
-Take $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ match this regular
-expression. Obviously this is not the case---the expected $b$ in the last
-position is missing. One would expect that modern regular expression
-matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8, even with strings of a small
-length, say around 30 $a$'s, one discovers that
-this decision takes crazy time to finish given the simplicity of the problem.
-This is clearly exponential behaviour, and
-is triggered by some relatively simple regex patterns, as the graphs
-below show:
 \begin{figure}
 \centering
 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
 \begin{tikzpicture}
 \end{figure}
-This superlinear blowup in matching algorithms sometimes cause
-considerable grief in real life: for example on 20 July 2016 one evil
+Regular expressions are widely used in computer science:
+be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+command-line tools like $\mathit{grep}$ that facilitate easy
+text-processing; network intrusion
+detection systems that reject suspicious traffic; or compiler
+front ends--the majority of the solutions to these tasks
+involve lexing with regular
+expressions.
+Given its usefulness and ubiquity, one would imagine that
+modern regular expression matching implementations
+are mature and fully studied.
+Indeed, in a popular programming language' regex engine,
+supplying it with regular expressions and strings, one can
+get rich matching information in a very short time.
+Some network intrusion detection systems
+use regex engines that are able to process
+megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+Unfortunately, this is not the case for $\mathbf{all}$ inputs.
+%TODO: get source for SNORT/BRO's regex matching engine/speed
+Take $(a^*)^*\,b$ and ask whether
+strings of the form $aa..a$ match this regular
+expression. Obviously this is not the case---the expected $b$ in the last
+position is missing. One would expect that modern regular expression
+matching engines can find this out very quickly. Alas, if one tries
+this example in JavaScript, Python or Java 8, even with strings of a small
+length, say around 30 $a$'s, one discovers that
+this decision takes crazy time to finish given the simplicity of the problem.
+This is clearly exponential behaviour, and
+is triggered by some relatively simple regex patterns, as the graphs
+in \ref{fig:aStarStarb} show.
+\ChristianComment{Superlinear I just leave out the explanation
+which I find once used would distract the flow. Plus if i just say exponential
+here the 2016 event in StackExchange was not exponential, but just quardratic so would be
+in accurate}
+This superlinear blowup in regular expression engines
+had repeatedly caused grief in real life.
+For example, on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
-knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
+knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
 In this instance, a regular expression intended to just trim white
 spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU-resources---causing web servers to grind to a
+massive amounts of CPU resources---causing web servers to grind to a
-halt. This happened when a post with 20,000 white spaces was submitted,
+halt. In this example, the time needed to process
-but importantly the white spaces were neither at the beginning nor at
-the end. As a result, the regular expression matching engine needed to
-backtrack over many choices. In this example, the time needed to process
 the string was $O(n^2)$ with respect to the string length. This
 quadratic overhead was enough for the homepage of Stack Exchange to
 respond so slowly that the load balancer assumed a $\mathit{DoS}$
 attack and therefore stopped the servers from responding to any
 requests. This made the whole site become unavailable.
 A more recent example is a global outage of all Cloudflare servers on 2 July
 2019. A poorly written regular expression exhibited exponential
 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
 had several causes, at the heart was a regular expression that
 was used to monitor network
-traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
+traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
 %TODO: data points for some new versions of languages
 These problems with regular expressions
 are not isolated events that happen
 very occasionally, but actually widespread.
 They occur so often that they get a
 name--Regular-Expression-Denial-Of-Service (ReDoS)
 attack.
-Davis et al. \parencite{Davis18} detected more
+\citeauthor{Davis18} detected more
 than 1000 super-linear (SL) regular expressions
 in Node.js, Python core libraries, and npm and pypi.
 They therefore concluded that evil regular expressions
-are problems more than "a parlour trick", but one that
+are problems "more than a parlour trick", but one that
 requires
 more research attention.
-\section{The Problem Behind Slow Cases}
+But the problems are not limited to slowness on certain
+cases.
+Another thing about these libraries is that there
+is no correctness guarantee.
+In some cases, they either fail to generate a lexing result when there exists a match,
+or give results that are inconsistent with the $\POSIX$ standard.
+A concrete example would be
+the regex
+\begin{verbatim}
+(aba|ab|a)*
+\end{verbatim}
+and the string
+\begin{verbatim}
+ababa
+\end{verbatim}
+The correct $\POSIX$ match for the above would be
+with the entire string $ababa$,
+split into two Kleene star iterations, $[ab] [aba]$ at positions
+$[0, 2), [2, 5)$
+respectively.
+But trying this out in regex101\parencite{regex101}
+with different language engines would yield
+the same two fragmented matches: $[aba]$ at $[0, 3)$
+and $a$ at $[4, 5)$.
+Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
+correctly implementing the POSIX (maximum-munch)
+rule of regular expression matching.
+As Grathwohl\parencite{grathwohl2014crash} commented,
+\begin{center}
+	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
+\end{center}
+To summarise the above, regular expressions are important.
+They are popular and programming languages' library functions
+for them are very fast on non-catastrophic cases.
+But there are problems with current practical implementations.
+First thing is that the running time might blow up.
+The second problem is that they might be error-prone on certain
+cases that are very easily spotted by a human.
+In the next part of the chapter, we will look into reasons why
+certain regex engines are running horribly slow on the "catastrophic"
+cases and propose a solution that addresses both of these problems
+based on Brzozowski and Sulzmann and Lu's work.
+\section{Why are current regex engines slow?}
 %find literature/find out for yourself that REGEX->DFA on basic regexes
 %does not blow up the size
 Shouldn't regular expression matching be linear?
 How can one explain the super-linear behaviour of the
 regex matching engines we have?
 The time cost of regex matching algorithms in general
-involve two phases:
+involve two different phases, and different things can go differently wrong on
-the construction phase, in which the algorithm builds some
+these phases.
-suitable data structure from the input regex $r$, we denote
+$\DFA$s usually have problems in the first (construction) phase
-the time cost by $P_1(r)$.
+, whereas $\NFA$s usually run into trouble
-The lexing
+on the second phase.
-phase, when the input string $s$ is read and the data structure
-representing that regex $r$ is being operated on. We represent the time
+\subsection{Different Phases of a Matching/Lexing Algorithm}
+Most lexing algorithms can be roughly divided into
+two phases during its run.
+The first phase is the "construction" phase,
+in which the algorithm builds some
+suitable data structure from the input regex $r$, so that
+it can be easily operated on later.
+We denote
+the time cost for such a phase by $P_1(r)$.
+The second phase is the lexing phase, when the input string
+$s$ is read and the data structure
+representing that regex $r$ is being operated on.
+We represent the time
 it takes by $P_2(r, s)$.\\
-In the case of a $\mathit{DFA}$,
+For $\mathit{DFA}$,
 we have $P_2(r, s) = O( |s| )$,
 because we take at most $|s|$ steps,
 and each step takes
 at most one transition--
 a deterministic-finite-automata
 by definition has at most one state active and at most one
 transition upon receiving an input symbol.
 But unfortunately in the  worst case
-$P_1(r) = O(exp^{|r|})$. An example will be given later. \\
+$P_1(r) = O(exp^{|r|})$. An example will be given later.
 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
 expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
 On the other hand, if backtracking is used, the worst-case time bound bloats
-to $|r| * 2^|s|$ .
+to $|r| * 2^|s|$.
 %on the input
 %And when calculating the time complexity of the matching algorithm,
 %we are assuming that each input reading step requires constant time.
 %which translates to that the number of
 %states active and transitions taken each time is bounded by a
 %such as negation, intersection,
 %bounded repetitions and back-references.
 %And de-sugaring these "extended" regular expressions
 %into basic ones might bloat the size exponentially.
 %TODO: more reference for exponential size blowup on desugaring.
-\subsection{Tools that uses $\mathit{DFA}$s}
-%TODO:more tools that use DFAs?
+\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
-$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
-in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
-lexers. The user provides a set of regular expressions
-and configurations to such lexer generators, and then
-gets an output program encoding a minimized $\mathit{DFA}$
-that can be compiled and run.
 The good things about $\mathit{DFA}$s is that once
 generated, they are fast and stable, unlike
 backtracking algorithms.
-However, they do not scale well with bounded repetitions.\\
+However, they do not scale well with bounded repetitions.
+\subsubsection{Problems with Bounded Repetitions}
 Bounded repetitions, usually written in the form
 $r^{\{c\}}$ (where $c$ is a constant natural number),
 denotes a regular expression accepting strings
 that can be divided into $c$ substrings, where each
 substring is in $r$.
 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
 would require at least $2^{n+1}$ states, if $r$ contains
 more than 1 string.
 This is to represent all different
 scenarios which "countdown" states are active.
-For those regexes, tools such as $\mathit{JFLEX}$
+For those regexes, tools that uses $\DFA$s will get
-would generate gigantic $\mathit{DFA}$'s or
 out of memory errors.
+\subsubsection{Tools that uses $\mathit{DFA}$s}
+%TODO:more tools that use DFAs?
+$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
+in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
+lexers. The user provides a set of regular expressions
+and configurations to such lexer generators, and then
+gets an output program encoding a minimized $\mathit{DFA}$
+that can be compiled and run.
+When given the above countdown regular expression,
+a small number $n$ would result in a determinised automata
+with millions of states.
 For this reason, regex libraries that support
 bounded repetitions often choose to use the $\mathit{NFA}$
 approach.
-\subsection{The $\mathit{NFA}$ approach to regex matching}
-One can simulate the $\mathit{NFA}$ running in two ways:
+\subsection{Why $\mathit{NFA}$s can be slow in the second phase}
+When one constructs an $\NFA$ out of a regular expression
+there is often very little to be done in the first phase, one simply
+construct the $\NFA$ states based on the structure of the input regular expression.
+In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
 one by keeping track of all active states after consuming
 a character, and update that set of states iteratively.
 This can be viewed as a breadth-first-search of the $\mathit{NFA}$
 for a path terminating
 at an accepting state.
 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
-type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
+type of $\mathit{NFA}$ simulation and guarantees a linear runtime
 in terms of input string length.
 %TODO:try out these lexers
 The other way to use $\mathit{NFA}$ for matching is choosing
 a single transition each time, keeping all the other options in
 a queue or stack, and backtracking if that choice eventually
 is efficient in a lot of cases, but could end up
 with exponential run time.\\
 %TODO:COMPARE java python lexer speed with Rust and Go
 The reason behind backtracking algorithms in languages like
 Java and Python is that they support back-references.
-\subsection{Back References in Regex--Non-Regular part}
+\subsubsection{Back References}
 If we have a regular expression like this (the sequence
 operator is omitted for brevity):
 \begin{center}
 	$r_1(r_2(r_3r_4))$
 \end{center}
 of formal languages.
 Solving the back-reference expressions matching problem
 is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
 efficient solution is not known to exist.
 %TODO:read a bit more about back reference algorithms
 It seems that languages like Java and Python made the trade-off
 to support back-references at the expense of having to backtrack,
 even in the case of regexes not involving back-references.\\
 Summing these up, we can categorise existing
 practical regex libraries into the ones  with  linear
 time guarantees like Go and Rust, which impose restrictions
 on the user input (not allowing back-references,
-bounded repetitions canno exceed 1000 etc.), and ones
+bounded repetitions cannot exceed 1000 etc.), and ones
 that allows the programmer much freedom, but grinds to a halt
 in some non-negligible portion of cases.
 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
 % For example, the Rust regex engine claims to be linear,
 % but does not support lookarounds and back-references.
 % Java and Python both support back-references, but shows
 %catastrophic backtracking behaviours on inputs without back-references(
 %when the language is still regular).
 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
 %TODO: verify the fact Rust does not allow 1000+ reps
-%TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
+\ChristianComment{Comment required: Java 17 updated graphs? Is it ok to still use Java 8 graphs?}
-\section{Buggy Regex Engines}
+So we have practical implementations
-Another thing about these libraries is that there
+on regular expression matching/lexing which are fast
-is no correctness guarantee.
+but do not come with any guarantees that it will not grind to a halt
-In some cases, they either fail to generate a lexing result when there exists a match,
+or give wrong answers.
-or give the wrong way of matching.
+Our goal is to have a regex lexing algorithm that comes with
+\begin{itemize}
+\item
+proven correctness
+\item
+proven non-catastrophic properties
+\item
+easy extensions to
+constructs like
+bounded repetitions, negation,  lookarounds, and even back-references.
+\end{itemize}
-It turns out that regex libraries not only suffer from
-exponential backtracking problems,
-but also undesired (or even buggy) outputs.
-%TODO: comment from who
-Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
-correctly implementing the POSIX (maximum-munch)
-rule of regular expression matching.
-This experience is echoed by the writer's
-tryout of a few online regex testers:
-A concrete example would be
-the regex
-\begin{verbatim}
-(((((a*a*)b*)b){20})*)c
-\end{verbatim}
-and the string
-\begin{verbatim}
-baabaabababaabaaaaaaaaababaa
-aababababaaaabaaabaaaaaabaab
-aabababaababaaaaaaaaababaaaa
-babababaaaaaaaaaaaaac
-\end{verbatim}
-This seemingly complex regex simply says "some $a$'s
-followed by some $b$'s then followed by 1 single $b$,
-and this iterates 20 times, finally followed by a $c$.
-And a POSIX match would involve the entire string,"eating up"
-all the $b$'s in it.
-%TODO: give a coloured example of how this matches POSIXly
-This regex would trigger catastrophic backtracking in
-languages like Python and Java,
-whereas it gives a non-POSIX  and uninformative
-match in languages like Go or .NET--The match with only
-character $c$.
-As Grathwohl\parencite{grathwohl2014crash} commented,
-\begin{center}
-	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
-\end{center}
-%\section{How people solve problems with regexes}
-When a regular expression does not behave as intended,
-people usually try to rewrite the regex to some equivalent form
-or they try to avoid the possibly problematic patterns completely,
-for which many false positives exist\parencite{Davis18}.
-Animated tools to "debug" regular expressions
-are also popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101}
-to name a few.
-We are also aware of static analysis work on regular expressions that
-aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
-\parencite{Rathnayake2014StaticAF} proposed an algorithm
-that detects regular expressions triggering exponential
-behavious on backtracking matchers.
-Weideman \parencite{Weideman2017Static} came up with
-non-linear polynomial worst-time estimates
-for regexes, attack string that exploit the worst-time
-scenario, and "attack automata" that generates
-attack strings.
-%Arguably these methods limits the programmers' freedom
-%or productivity when all they want is to come up with a regex
-%that solves the text processing problem.
-%TODO:also the regex101 debugger
 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
-Is it possible to have a regex lexing algorithm with proven correctness and
+We propose Brzozowski derivatives on regular expressions as
-time complexity, which allows easy extensions to
+a solution to this.
-constructs like
+In the last fifteen or so years, Brzozowski's derivatives of regular
-bounded repetitions, negation,  lookarounds, and even back-references?
+expressions have sparked quite a bit of interest in the functional
+programming and theorem prover communities.
+\subsection{Motivation}
-We propose Brzozowski derivatives on regular expressions as
+Derivatives give a simple solution
-a solution to this.
+to the problem of matching a string $s$ with a regular
+expression $r$: if the derivative of $r$ w.r.t.\ (in
-In the last fifteen or so years, Brzozowski's derivatives of regular
+succession) all the characters of the string matches the empty string,
-expressions have sparked quite a bit of interest in the functional
+then $r$ matches $s$ (and {\em vice versa}).
-programming and theorem prover communities.  The beauty of
+The beauty of
 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
 expressible in any functional language, and easily definable and
 reasoned about in theorem provers---the definitions just consist of
 inductive datatypes and simple recursive functions.
 And an algorithms based on it by
 Suzmann and Lu  \parencite{Sulzmann2014} allows easy extension
 to include  extended regular expressions and
 simplification of internal data structures
 eliminating the exponential behaviours.
-\section{Motivation}
-Derivatives give a simple solution
-to the problem of matching a string $s$ with a regular
-expression $r$: if the derivative of $r$ w.r.t.\ (in
-succession) all the characters of the string matches the empty string,
-then $r$ matches $s$ (and {\em vice versa}).
 However, two difficulties with derivative-based matchers exist:
+\subsubsection{Problems with Current Brzozowski Matchers}
 First, Brzozowski's original matcher only generates a yes/no answer
 for whether a regular expression matches a string or not.  This is too
 little information in the context of lexing where separate tokens must
 be identified and also classified (for example as keywords
 or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
 difficulty by cleverly extending Brzozowski's matching
 algorithm. Their extended version generates additional information on
 \emph{how} a regular expression matches a string following the POSIX
 rules for regular expression matching. They achieve this by adding a
 second ``phase'' to Brzozowski's algorithm involving an injection
-function.  In our own earlier work we provided the formal
+function.  In our own earlier work, we provided the formal
 specification of what POSIX matching means and proved in Isabelle/HOL
 the correctness
 of Sulzmann and Lu's extended algorithm accordingly
 \cite{AusafDyckhoffUrban2016}.
 above: the growth is slowed, but the derivatives can still grow rather
 quickly beyond any finite bound.
 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
-\cite{Sulzmann2014} where they introduce bitcoded
+\cite{Sulzmann2014} where they introduce bit-coded
 regular expressions. In this version, POSIX values are
-represented as bitsequences and such sequences are incrementally generated
+represented as bit sequences and such sequences are incrementally generated
 when derivatives are calculated. The compact representation
-of bitsequences and regular expressions allows them to define a more
+of bit sequences and regular expressions allows them to define a more
 ``aggressive'' simplification method that keeps the size of the
 derivatives finite no matter what the length of the string is.
 They make some informal claims about the correctness and linear behaviour
 of this version, but do not provide any supporting proof arguments, not
-even ``pencil-and-paper'' arguments. They write about their bitcoded
+even ``pencil-and-paper'' arguments. They write about their bit-coded
 \emph{incremental parsing method} (that is the algorithm to be formalised
-in this paper):
+in this dissertation)
 \begin{quote}\it
 ``Correctness Claim: We further claim that the incremental parsing
 This thesis implements the aggressive simplifications envisioned
 by Ausaf and Urban,
 and gives a formal proof of the correctness with those simplifications.
 %----------------------------------------------------------------------------------------
 \section{Contribution}
 This work addresses the vulnerability of super-linear and
 of Brzozowski's derivatives and interactive theorem proving.
 We give an
 improved version of  Sulzmann and Lu's bit-coded algorithm using
 derivatives, which come with a formal guarantee in terms of correctness and
 running time as an Isabelle/HOL proof.
-Then we improve the algorithm with an even stronger version of
+Further improvements to the algorithm with an even stronger version of
-simplification, and prove a time bound linear to input and
+simplification is made.
+We have not yet come up with one, but believe that it leads to a
+formalised proof with a time bound linear to input and
 cubic to regular expression size using a technique by
-Antimirov.
+Antimirov\cite{Antimirov}.
-The main contribution of this thesis is a proven correct lexing algorithm
+The main contribution of this thesis is
-with formalized time bounds.
+\begin{itemize}
+\item
+a proven correct lexing algorithm
+\item
+with formalized finite bounds on internal data structures' sizes.
+\end{itemize}
 To our best knowledge, no lexing libraries using Brzozowski derivatives
 have a provable time guarantee,
 and claims about running time are usually speculative and backed by thin empirical
 evidence.
 %TODO: give references
 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
 and concluded that the algorithm is quadratic in terms of input length.
 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
 the time it took to lex only 40 $a$'s was 5 minutes.
-We  believe our results of a proof of performance on general
-inputs rather than specific examples a novel contribution.\\
 \subsection{Related Work}
 We are aware
 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
-%We propose Brzozowski's derivatives as a solution to this problem.
-% about Lexing Using Brzozowski derivatives
+When a regular expression does not behave as intended,
+people usually try to rewrite the regex to some equivalent form
+or they try to avoid the possibly problematic patterns completely,
+for which many false positives exist\parencite{Davis18}.
+Animated tools to "debug" regular expressions such as
+\parencite{regexploit2021} \parencite{regex101} are also popular.
+We are also aware of static analysis work on regular expressions that
+aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
+\parencite{Rathnayake2014StaticAF} proposed an algorithm
+that detects regular expressions triggering exponential
+behavious on backtracking matchers.
+Weideman \parencite{Weideman2017Static} came up with
+non-linear polynomial worst-time estimates
+for regexes, attack string that exploit the worst-time
+scenario, and "attack automata" that generates
+attack strings.
 \section{Structure of the thesis}
-In chapter 2 \ref{Chapter2} we will introduce the concepts
+In chapter 2 \ref{Inj} we will introduce the concepts
 and notations we
 use for describing the lexing algorithm by Sulzmann and Lu,
-and then give the algorithm and its variant, and discuss
+and then give the lexing algorithm.
-why more aggressive simplifications are needed.
+We will give its variant in \ref{Bitcoded1}.
-Then we illustrate in Chapter 3\ref{Chapter3}
+Then we illustrate in \ref{Bitcoded2}
 how the algorithm without bitcodes falls short for such aggressive
 simplifications and therefore introduce our version of the
-bitcoded algorithm and
+bit-coded algorithm and
 its correctness proof .
-In Chapter 4 \ref{Chapter4} we give the second guarantee
+In \ref{Finite} we give the second guarantee
 of our bitcoded algorithm, that is a finite bound on the size of any
 regex's derivatives.
-In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
+In \ref{Cubic} we discuss stronger simplifications to improve the finite bound
-in Chapter 4 to a polynomial one, and demonstrate how one can extend the
+in \ref{Finite} to a polynomial one, and demonstrate how one can extend the
 algorithm to include constructs such as bounded repetitions and negations.

changeset 538	8016a2480704
parent 537	50e590823220
child 542	a7344c9afbaf