lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:3bedbdce3a3b
+:6da4516ea87d
 % Chapter Template
-\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
+\chapter{Regular Expressions and POSIX Lexing-Preliminaries} % Main chapter title
 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
 %and notations we
 % used for describing the lexing algorithm by Sulzmann and Lu,
 %and then give the algorithm and its variant and discuss
 %why more aggressive simplifications are needed.
+We start with a technical overview of the catastrophic backtracking problem,
-In this chapter, we define the basic notions
+motivating rigorous approaches to regular expression matching and lexing.
-for regular languages and regular expressions.
+Then we introduce basic notations and definitions of our problem.
-This is essentially a description in ``English''
-the functions and datatypes of our formalisation in Isabelle/HOL.
-We also define what $\POSIX$ lexing means,
-followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
-that produces the output conforming
-to the $\POSIX$ standard\footnote{In what follows
-we choose to use the Isabelle-style notation
-for function applications, where
-the parameters of a function are not enclosed
-inside a pair of parentheses (e.g. $f \;x \;y$
-instead of $f(x,\;y)$). This is mainly
-to make the text visually more concise.}.
 \section{Technical Overview}
 Consider for example the regular expression $(a^*)^*\,b$ and
 strings of the form $aa..a$. These strings cannot be matched by this regular
-expression: Obviously the expected $b$ in the last
+expression: obviously the expected $b$ in the last
 position is missing. One would assume that modern regular expression
 matching engines can find this out very quickly. Surprisingly, if one tries
 this example in JavaScript, Python or Java 8, even with small strings,
 say of lenght of around 30 $a$'s,
-the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
+the decision takes large amounts of time to finish.
+This is inproportional
+to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}).
 The algorithms clearly show exponential behaviour, and as can be seen
 is triggered by some relatively simple regular expressions.
 Java 9 and newer
-versions improve this behaviour somewhat, but are still slow compared
+versions mitigates this behaviour by several magnitudes,
+but are still slow compared
 with the approach we are going to use in this thesis.
 This superlinear blowup in regular expression engines
 legend pos=north west,
 legend cell align=left]
 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
 \end{axis}
 \end{tikzpicture}\\
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,-0.05)}},
+ylabel={time in secs},
+enlargelimits=true,
+%xtick={0,5,...,30},
+%xmax=33,
+%ymax=35,
+restrict x to domain*=0:60000,
+restrict y to domain*=0:35,
+%ytick={0,5,...,30},
+%scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend entries={Scala},
+legend pos=north west,
+legend cell align=left]
+\addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,-0.05)}},
+%ylabel={time in secs},
+enlargelimits=true,
+%xtick={0,5000,...,40000},
+%xmax=40000,
+%ymax=35,
+restrict x to domain*=0:60000,
+restrict y to domain*=0:45,
+%ytick={0,5,...,30},
+%scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend style={cells={align=left}},
+legend entries={Isabelle \\ Extracted},
+legend pos=north west,
+legend cell align=left]
+\addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data};
+\end{axis}
+\end{tikzpicture}\\
 \multicolumn{2}{c}{Graphs}
 \end{tabular}
 \end{center}
 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
-The reason for their superlinear behaviour is that they do a depth-first-search
+The reason for their fast growth %superlinear behaviour
+is that they do a depth-first-search
 using NFAs.
 If the string does not match, the regular expression matching
 engine starts to explore all possibilities.
+The last two graphs are for our implementation in Scala, one manual
+and one extracted from the verified lexer in Isabelle by $\textit{codegen}$.
+Our lexer
+performs better in this case, and
+is formally verified.
+Despite being almost identical, the codegen-generated lexer
+% is generated directly from Isabelle using $\textit{codegen}$,
+is slower than the manually written version since the code synthesised by
+$\textit{codegen}$ does not use native integer or string
+types of the target language.
+%Note that we are comparing our \emph{lexer} against other languages' matchers.
 }\label{fig:aStarStarb}
 \end{figure}\afterpage{\clearpage}
 A more recent example is a global outage of all Cloudflare servers on 2 July
 2019. A poorly written regular expression exhibited catastrophic backtracking
 can occur in traditional regular expression engines
 and (ii) why we choose our
 approach based on Brzozowski derivatives and formal proofs.
+In this chapter, we define the basic notions
+for regular languages and regular expressions.
+This is essentially a description in ``English''
+the functions and datatypes of our formalisation in Isabelle/HOL.
+We also define what $\POSIX$ lexing means,
+followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
+that produces the output conforming
+to the $\POSIX$ standard.
+This is a preliminary chapter which describes the results of
+Sulzmann and Lu and Ausaf et al \ref{AusafDyckhoffUrban2016},
+but the proof details and function definitions are needed to motivate our work
+as we show in chapter \ref{Bitcoded2} how the proofs break down when
+simplifications are applied.
+%TODO: Actually show this in chapter 4.
+In what follows
+we choose to use the Isabelle-style notation
+for function and datatype constructor applications, where
+the parameters of a function are not enclosed
+inside a pair of parentheses (e.g. $f \;x \;y$
+instead of $f(x,\;y)$). This is mainly
+to make the text visually more concise.
 \section{Preliminaries}%Regex, and the Problems with Regex Matchers}
 Regular expressions and regular expression matchers
 have clearly been studied for many, many years.
 Theoretical results in automata theory state
 that basic regular expression matching should be linear
 automata need to expand $r^{\{n\}}$ into $n$ connected
 copies of the automaton for $r$. This leads to very inefficient matching
 algorithms  or algorithms that consume large amounts of memory.
 Implementations using $\DFA$s will
 in such situations
-either become excruciatingly slow
+either become very slow
 (for example Verbatim++ \cite{Verbatimpp}) or run
 out of memory (for example $\mathit{LEX}$ and
 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
 in C and JAVA that generate $\mathit{DFA}$-based
 lexers. The user provides a set of regular expressions
 or identifiers).
 Second, derivative-based matchers need to be more efficient in terms
 of the sizes of derivatives.
 Elegant and beautiful
 as many implementations are,
-they can be excruciatingly slow.
+they can be still quite slow.
 For example, Sulzmann and Lu
 claim a linear running time of their proposed algorithm,
 but that was falsified by our experiments. The running time
 is actually $\Omega(2^n)$ in the worst case.
 A similar claim about a theoretical runtime of $O(n^2)$
 $ using Brzozowski's original algorithm}\label{NaiveMatcher}
 \end{center}
 \end{figure}
 \noindent
 If we implement the above algorithm naively, however,
-the algorithm can be excruciatingly slow, as shown in
+the algorithm can be as slow as backtracking lexers, as shown in
 \ref{NaiveMatcher}.
 Note that both axes are in logarithmic scale.
 Around two dozen characters
 this algorithm already ``explodes'' with the regular expression
 $(a^*)^*b$.
 $\dn$  & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
 $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$  & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
 $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
 \end{tabular}
 \end{center}
 \noindent
 The function recurses on
 the shape of regular
 expressions and values.
 Intuitively, each clause analyses
 to inject the character back into.
 Once the character is
 injected back to that sub-value;
 $\inj$ assembles all parts
 to form a new value.
+\subsection{An Example of how Injection Works}
+We will provide a few examples on why
+\begin{center}
+	\begin{tabular}{lcl}
+		$\inj \; (a\cdot b)\cdot c \; \Seq \; \ONE \; b$ & $=$ & $(a+e)$\\
+		$A$ & $B$ & $C$
+	\end{tabular}
+\end{center}
 For instance, the last clause is an
 injection into a sequence value $v_{i+1}$
 whose second child
 value is a star and the shape of the
 regular expression $r_i$ before injection

changeset 666	6da4516ea87d
parent 657	00171b627b8d
child 667	660cf698eb26