lexing: comparison text/proposal.tex

equal deleted inserted replaced

-:bfab5aded21d
+:c4d821c6309d
+\documentclass{article}
+\usepackage{url}
+\usepackage{color}\definecolor{darkblue}{rgb}{0,0,0.5}
+\usepackage[a4paper,colorlinks=true,linkcolor=darkblue,citecolor=darkblue,filecolor=darkblue,pagecolor=darkblue,urlcolor=darkblue]{hyperref}
+\usepackage{style}
+\usepackage{langs}
+\usepackage{tikz}
+\usepackage{pgf}
+\usepackage{pgfplots}
+\usepackage{stackengine}
+\addtolength{\oddsidemargin}{-6mm}
+\addtolength{\evensidemargin}{-6mm}
+\addtolength{\textwidth}{12mm}
+%% \usepackage{accents}
+\newcommand\barbelow[1]{\stackunder[1.2pt]{#1}{\raisebox{-4mm}{\boldmath$\uparrow$}}}
+\begin{filecontents}{re-python2.data}
+1 0.033
+5 0.036
+10 0.034
+15 0.036
+18 0.059
+19 0.084
+20 0.141
+21 0.248
+22 0.485
+23 0.878
+24 1.71
+25 3.40
+26 7.08
+27 14.12
+28 26.69
+\end{filecontents}
+\begin{filecontents}{re-java.data}
+5  0.00298
+10  0.00418
+15  0.00996
+16  0.01710
+17  0.03492
+18  0.03303
+19  0.05084
+20  0.10177
+21  0.19960
+22  0.41159
+23  0.82234
+24  1.70251
+25  3.36112
+26  6.63998
+27  13.35120
+28  29.81185
+\end{filecontents}
+\begin{filecontents}{re-java9.data}
+1000  0.01410
+2000  0.04882
+3000  0.10609
+4000  0.17456
+5000  0.27530
+6000  0.41116
+7000  0.53741
+8000  0.70261
+9000  0.93981
+10000 0.97419
+11000 1.28697
+12000 1.51387
+14000 2.07079
+16000 2.69846
+20000 4.41823
+24000 6.46077
+26000 7.64373
+30000 9.99446
+34000 12.966885
+38000 16.281621
+42000 19.180228
+46000 21.984721
+50000 26.950203
+60000 43.0327746
+\end{filecontents}
+\begin{document}
+\section*{Proposal:\\
+Fast Regular Expression Matching\\
+with Brzozowski Derivatives}
+\subsubsection*{Summary}
+If you want to connect a computer directly to the Internet, it must be
+immediately \mbox{hardened} against outside attacks. The current
+technology for this is to use regular expressions in order to
+automatically scan all incoming network traffic for signs when a
+computer is under attack and if found, to take appropriate
+counter-actions. One possible action could be to slow down the traffic
+from sites where repeated password-guesses originate. Well-known
+network intrusion prevention systems that use regular expressions for
+traffic analysis are Snort and Bro.
+Given the large
+volume of Internet traffic even the smallest servers can handle
+nowadays, the regular expressions for traffic analysis have become a
+real bottleneck.  This is not just a nuisance, but actually a security
+vulnerability in itself: it can lead to denial-of-service attacks.
+The proposed project aims to remove this bottleneck by using a little-known
+technique of building derivatives of regular expressions. These
+derivatives have not yet been used in the area of network traffic
+analysis, but have the potential to solve some tenacious problems with
+existing approaches. The project will require theoretical work, but
+also a practical implementation (a proof-of-concept at least).  The
+work will build on earlier work by Ausaf and Urban from King's College
+London \cite{AusafDyckhoffUrban2016}.
+\subsubsection*{Background}
+If every 10 minutes a thief turned up at your car and rattled
+violently at the doorhandles to see if he could get in, you would move
+your car to a safer neighbourhood. Computers, however, need to stay
+connected to the Internet all the time and there they have to withstand such
+attacks. A rather typical instance of an attack can be seen in the
+following log entries from a server at King's:
+{\small
+\begin{verbatim}
+Jan 2 00:53:19 talisker sshd: Received disconnect from 110.53.183.227: [preauth]
+Jan 2 00:58:31 talisker sshd: Received disconnect from 110.53.183.252: [preauth]
+Jan 2 01:01:28 talisker sshd: Received disconnect from 221.194.47.236: [preauth]
+Jan 2 01:03:59 talisker sshd: Received disconnect from 110.53.183.228: [preauth]
+Jan 2 01:06:53 talisker sshd: Received disconnect from 221.194.47.245: [preauth]
+...
+\end{verbatim}}
+\noindent
+This is a record of the network activities from the server
+\href{http://talisker.inf.kcl.ac.uk/cgi-bin/repos.cgi}{talisker.inf.kcl.ac.uk},
+which hosts lecture material for students. The log indicates several
+unsuccessful login attempts via the \texttt{ssh} program from
+computers with the addresses \texttt{110.53.183.227} and so on. This
+is a clear sign of a \emph{brute-force attack} where hackers
+systematically try out password candidates.  Such attacks are blunt,
+but unfortunately very powerful.  They can originate from anywhere in
+the World; they are automated and often conducted from computers that
+have been previously hijacked.  Once the attacker ``hits'' on the
+correct password, then he or she gets access to the server. The only
+way to prevent this methodical password-guessing is to spot the
+corresponding traces in the log and then slow down the traffic from
+such sites in a way that perhaps only one password per hour can be
+tried out. This does not affect ``normal'' operations of the server,
+but drastically decreases the chances of hitting the correct password
+by a brute-force attack.
+Server administrators use \emph{regular expressions} for scanning log
+files. The purpose of these expressions is to specify patterns of
+interest (for example \texttt{Received disconnect from 110.53.183.227}
+where the address needs to be variable). A program will
+then continuously scan for such patterns and trigger an action if a
+pattern is found. Popular examples of such programs are Snort and Bro
+\cite{snort,bro}.  Clearly, there are also other kinds of
+vulnerabilities hackers try to take advantage of---mainly programming
+mistakes that can be abused to gain access to a computer. This means
+server administrators have a suite of sometimes thousands of regular
+expressions, all prescribing some suspicious pattern for when a
+computer is under attack.
+The underlying algorithmic problem is to decide whether a string in
+the logs matches the patterns determined by the regular
+expressions. Such decisions need to be done as fast as possible,
+otherwise the scanning programs would not be useful as a hardening
+technique for servers. The quest for speed with these decisions is
+presently a rather big challenge for the amount of traffic typical
+serves have to cope with and the large number of patterns that might
+be of interest.  The problem is that when thousands of regular
+expressions are involved, the process of detecting whether a string
+matches a regular expression can be excruciating slow. This might not
+happen in most instances, but in some small number of
+instances. However in these small number of instances the algorithm
+for regular expression matching can grind to a halt.  This phenomenon
+is called \emph{catastrophic backtracking} \cite{catast}.
+Unfortunately, catastrophic backtracking is not just a nuisance, but a
+security vulnerability in itself. The reason is that attackers look
+for these instances where the scan is very slow and then trigger a
+\emph{denial-of-service attack} against the server\ldots meaning the
+server will not be reachable anymore to normal users.
+Existing approaches try to mitigate the speed problem with regular
+expressions by implementing the matching algorithms in hardware,
+rather than in software \cite{hardware}. Although this solution offers
+speed, it is often unappealing because attack patterns can change or
+need to be augmented. A software solution is clearly more desirable in
+these circumstances.  Also given that regular expressions were
+introduced in 1950 by Kleene, one might think regular expressions have
+since been studied and implemented to death. But this would definitely
+be a mistake: in fact they are still an active research area and
+off-the-shelf implementations are still extremely prone to the
+problem of catastrophic backtracking. This can be seen in the
+following two graphs:
+\begin{center}
+\begin{tabular}{@{}cc@{}}
+%\multicolumn{2}{c}{Graph: $(a^*)^*\cdot b$ and strings
+%           $\underbrace{a\ldots a}_{n}$}\smallskip\\
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
+ylabel={time in secs},
+y label style={at={(0.06,0.5)}},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=45,
+ytick={0,10,...,40},
+scaled ticks=false,
+axis lines=left,
+width=6cm,
+height=4.5cm,
+legend entries={Python, Java 8},
+legend pos=north west]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
+ylabel={time in secs},
+y label style={at={(0.06,0.5)}},
+%enlargelimits=false,
+%xtick={0,5000,...,30000},
+xmax=65000,
+ymax=45,
+ytick={0,10,...,40},
+scaled ticks=false,
+axis lines=left,
+width=6cm,
+height=4.5cm,
+legend entries={Java 9},
+legend pos=north west]
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java9.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
+\noindent
+These graphs show how long strings can be when using the rather simple
+regular expression $(a^*)^* \,b$ and the built-in regular expression
+matchers in the popular programming languages Python and Java (Version
+8 and Version 9). There are many more regular expressions that show
+the same bahaviour.  On the left-hand side the graphs show that for a
+string consisting of just 28 $a\,$'s, Python and the older Java (which
+was the latest version of Java until September 2017) already need 30
+seconds to decide whether this string is matched or not. This is an
+abysmal result given that Python and Java are very popular programming
+languages. We already know that this problem can be decided much, much
+faster.  If these slow regular expression matchers would be employed
+in network traffic analysis, then this example is already an
+opportunity for mounting an easy denial-of-service attack: one just
+has to send to the server a large enough string of a's. The newer
+version of Java is better---it can match strings of around 50,000
+$a\,$'s in 30 seconds---however this would still not be good enough
+for being a useful tool for network traffic analysis. What is
+interesting is that even a relatively simple-minded regular expression
+matcher based on Brzozowski derivatives can out-compete the regular
+expressions matchers in Java and Python on such catastrophic
+backtracking examples. This gain in speed is the motivation and
+starting point for the proposed project. \smallskip
+\subsection*{Research Plan}
+Regular expressions are already an old and well-studied topic in
+Computer Science. They were originally introduced by Kleene in 1951
+and are utilised in text search algorithms for ``find'' or ``find and
+replace'' operations \cite{Brian,Kleene51}. Because of their wide
+range of applications, many programming languages provide capabilities
+for regular expression matching via libraries.  The classic theory
+of regular expressions involves just 6 different kinds of regular
+expressions, often defined in terms of the following grammar:
+\begin{center}\small
+\begin{tabular}{lcll}
+$r$ & $::=$ & $\ZERO$     & cannot match anything\\
+&   $|$ & $\ONE$      & can only match the empty string\\
+&   $|$ & $c$         & can match a single character (in this case $c$)\\
+&   $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\
+&   $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\
+&  & & then the second part with $r_2$\\
+&   $|$ & $r^*$       & can match zero or more times $r$\\
+\end{tabular}
+\end{center}
+\noindent
+For practical purposes, however, regular expressions have been
+extended in various ways in order to make them even more powerful and
+even more useful for applications. Some of the problems to do with
+catastrophic backtracking stem, unfortunately, from these extensions.
+The crux in this project is to not use automata for regular expression
+matching (the traditional technique), but to use derivatives of
+regular expressions instead.  These derivatives were introduced by
+Brzozowski in 1964 \cite{Brzozowski1964}, but somehow had been lost
+``in the sands of time'' \cite{Owens2009}. However, they recently have
+become again a ``hot'' research topic with numerous research papers
+appearing in the last five years. One reason for this interest is that
+Brzozowski derivatives straightforwardly extend to more general
+regular expression constructors, which cannot be easily treated with
+standard techniques. They can also be implemented with ease in any
+functional programming language: the definition for derivatives
+consists of just two simple recursive functions shown in
+Figure~\ref{f}. Moreover, it can be easily shown (by a mathematical
+proof) that the resulting regular matcher is in fact always producing
+a correct answer. This is an area where Urban can provide a lot of
+expertise for this project: he is one of the main developers of the
+Isabelle theorem prover, which is a program designed for such proofs
+\cite{Isabelle}.
+\begin{figure}[t]
+\begin{center}\small
+\begin{tabular}{lcl}
+$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\
+$\textit{nullable}(\ONE)$  & $\dn$ & $\textit{true}$\\
+$\textit{nullable}(c)$     & $\dn$ & $\textit{false}$\\
+$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\
+$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
+$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\medskip\\
+$\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\
+$\textit{der}\;c\;(\ONE)$  & $\dn$ & $\ZERO$\\
+$\textit{der}\;c\;(d)$     & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\
+$\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\
+$\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\
+& & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\
+& & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\
+$\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\[-5mm]
+\end{tabular}
+\end{center}
+\caption{The complete definition of derivatives of regular expressions
+\cite{Brzozowski1964}.
+This definition can be implemented with ease in functional programming languages and can be easily extended to regular expressions used in practice.
+The are more flexible for applications and easier to manipulate in
+mathematical proofs, than traditional techniques for regular expression
+matching.\medskip\label{f}}
+\end{figure}
+There are a number of avenues for research on Brzozowski derivatives.
+One problem I like to immediately tackle in this project is how to
+handle \emph{back-references} in regular expressions. It is known that
+the inclusion of back-references causes the underlying algorithmic
+problem to become NP-hard \cite{Aho}, and is the main reason why
+regular expression matchers suffer from the catastrophic backtracking
+problem. However, the full generality of back-references are not
+needed in practical applications. The goal therefore is to sufficiently
+restrict the problem so that an efficient algorithm can be
+designed. There seem to be good indications that Brzozowski
+derivatives are useful for that.
+Another problem is \emph{how} regular expressions match a string
+\cite{Sulzmann2014}. In this case the algorithm does not just give a
+yes/no answer about whether the regular expression matches the string,
+but also generates a value that encodes which part of the string is
+matched by which part of the regular expression. This is important in
+applications, like network analysis where from a matching log entry a
+specific computer address needs to be extracted.  Also compilers make
+extensive use of such an extension when they lex their source
+programs, i.e.~break up code into word-like components.  Again Ausaf
+and Urban from King's made some initial progress about proving the
+correctness of such an extension, but their algorithm is not yet fast
+enough for practical purposes \cite{AusafDyckhoffUrban2016}. It would
+be desirable to study optimisations that make the algorithm faster,
+but preserve the correctness guaranties obtained by Ausaf and Urban.
+\mbox{}\\[-11mm]
+\subsubsection*{Conclusion}
+Much research has already gone into regular expressions, and one might
+think they have been studied and implemented to death. But this is far
+from the truth. In fact regular expressions have become a ``hot''
+research topic in recent years because of problems with the
+traditional methods (in applications such as network traffic
+analysis), but also because the technique of derivatives of regular
+expression has been re-discovered. These derivatives provide an
+alternative approach to regular expression matching. Their potential
+lies in the fact that they are more flexible in implementations and
+much easier to manipulate in mathematical proofs. Therefore I like to
+research them in this project.\medskip
+\noindent {\bf Impact:} Regular expression matching is a core
+technology in Computer Science with numerous applications. I will
+focus on the area of network traffic analysis and also lexical
+analysis. In these areas this project can have a quite large impact:
+for example the program Bro has been downloaded at least 10,000 times and Snort
+even has 5 million downloads and over 600,000 registered users
+\cite{snort,bro}. Lexical analysis is a core component in every
+compiler, which are the bedrock on which nearly all programming is
+built on.
+% {\small
+% \begin{itemize}
+% \item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
+% \item[$\bullet$] \url{https://vimeo.com/112065252}
+% \item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/}
+% \end{itemize}}
+\mbox{}\\[-11mm]
+\small
+\bibliographystyle{plain}
+\renewcommand{\refname}{References\\[-7mm]\mbox{}}
+\bibliography{proposal}
+\end{document}
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: t
+%%% End:

changeset 283	c4d821c6309d
child 284	913205c1fd0d