text/proposal.tex
author Chengsong
Mon, 04 Feb 2019 12:56:56 +0000
changeset 303 877834bfe53b
parent 290 ed3169a567ea
child 305 6e2cef17a9b3
permissions -rw-r--r--
test

\documentclass{article}
\usepackage{url}
\usepackage{color}\definecolor{darkblue}{rgb}{0,0,0.5}
\usepackage[a4paper,colorlinks=true,linkcolor=darkblue,citecolor=darkblue,filecolor=darkblue,pagecolor=darkblue,urlcolor=darkblue]{hyperref}
\usepackage{style}
\usepackage{langs}
\usepackage{tikz}
\usepackage{pgf}
\usepackage{pgfplots}
\usepackage{stackengine}

\addtolength{\oddsidemargin}{-6mm}
\addtolength{\evensidemargin}{-6mm}
\addtolength{\textwidth}{12mm}



%% \usepackage{accents}
\newcommand\barbelow[1]{\stackunder[1.2pt]{#1}{\raisebox{-4mm}{\boldmath$\uparrow$}}}

\begin{filecontents}{re-python2.data}
1 0.033
5 0.036
10 0.034
15 0.036
18 0.059
19 0.084 
20 0.141
21 0.248
22 0.485
23 0.878
24 1.71
25 3.40
26 7.08
27 14.12
28 26.69
\end{filecontents}

\begin{filecontents}{re-java.data}
5  0.00298
10  0.00418
15  0.00996
16  0.01710
17  0.03492
18  0.03303
19  0.05084
20  0.10177
21  0.19960
22  0.41159
23  0.82234
24  1.70251
25  3.36112
26  6.63998
27  13.35120
28  29.81185
\end{filecontents} 

\begin{filecontents}{re-java9.data}
1000  0.01410
2000  0.04882
3000  0.10609
4000  0.17456
5000  0.27530
6000  0.41116
7000  0.53741
8000  0.70261
9000  0.93981
10000 0.97419
11000 1.28697
12000 1.51387
14000 2.07079
16000 2.69846
20000 4.41823
24000 6.46077
26000 7.64373
30000 9.99446
34000 12.966885
38000 16.281621
42000 19.180228
46000 21.984721
50000 26.950203
60000 43.0327746
\end{filecontents}


\begin{document}

  
\section*{PPProposal: \\
  Fast Regular Expression Matching\\
  with Brzozowski Derivatives}

\subsubsection*{Summary}

If you want to connect a computer directly to the Internet, it must be
immediately \mbox{hardened} against outside attacks. The current
technology for this is to use regular expressions in order to
automatically scan all incoming network traffic for signs when a
computer is under attack and if found, to take appropriate
counter-actions. One possible action could be to slow down the traffic
from sites where repeated password-guesses originate. Well-known
network intrusion prevention systems that use regular expressions for
traffic analysis are Snort and Bro.

Given the large
volume of Internet traffic even the smallest servers can handle
nowadays, the regular expressions for traffic analysis have become a
real bottleneck.  This is not just a nuisance, but actually a security
vulnerability in itself: it can lead to denial-of-service attacks.
The proposed project aims to remove this bottleneck by using a little-known
technique of building derivatives of regular expressions. These
derivatives have not yet been used in the area of network traffic
analysis, but have the potential to solve some tenacious problems with
existing approaches. The project will require theoretical work, but
also a practical implementation (a proof-of-concept at least).  The
work will build on earlier work by Ausaf and Urban from King's College
London \cite{AusafDyckhoffUrban2016}.


\subsubsection*{Background}

If every 10 minutes a thief turned up at your car and rattled
violently at the doorhandles to see if he could get in, you would move
your car to a safer neighbourhood. Computers, however, need to stay
connected to the Internet all the time and there they have to withstand such
attacks. A rather typical instance of an attack can be seen in the
following log entries from a server at King's:

{\small
\begin{verbatim}
 Jan 2 00:53:19 talisker sshd: Received disconnect from 110.53.183.227: [preauth]
 Jan 2 00:58:31 talisker sshd: Received disconnect from 110.53.183.252: [preauth]
 Jan 2 01:01:28 talisker sshd: Received disconnect from 221.194.47.236: [preauth]
 Jan 2 01:03:59 talisker sshd: Received disconnect from 110.53.183.228: [preauth]
 Jan 2 01:06:53 talisker sshd: Received disconnect from 221.194.47.245: [preauth]
 ...
\end{verbatim}}
  

\noindent
This is a record of the network activities from the server
\href{http://talisker.inf.kcl.ac.uk/cgi-bin/repos.cgi}{talisker.inf.kcl.ac.uk},
which hosts lecture material for students. The log indicates several
unsuccessful login attempts via the \texttt{ssh} program from
computers with the addresses \texttt{110.53.183.227} and so on. This
is a clear sign of a \emph{brute-force attack} where hackers
systematically try out password candidates.  Such attacks are blunt,
but unfortunately very powerful.  They can originate from anywhere in
the World; they are automated and often conducted from computers that
have been previously hijacked.  Once the attacker ``hits'' on the
correct password, then he or she gets access to the server. The only
way to prevent this methodical password-guessing is to spot the
corresponding traces in the log and then slow down the traffic from
such sites in a way that perhaps only one password per hour can be
tried out. This does not affect ``normal'' operations of the server,
but drastically decreases the chances of hitting the correct password
by a brute-force attack.


Server administrators use \emph{regular expressions} for scanning log
files. The purpose of these expressions is to specify patterns of
interest (for example \texttt{Received disconnect from 110.53.183.227}
where the address needs to be variable). A program will
then continuously scan for such patterns and trigger an action if a
pattern is found. Popular examples of such programs are Snort and Bro
\cite{snort,bro}.  Clearly, there are also other kinds of
vulnerabilities hackers try to take advantage of---mainly programming
mistakes that can be abused to gain access to a computer. This means
server administrators have a suite of sometimes thousands of regular
expressions, all prescribing some suspicious pattern for when a
computer is under attack.


The underlying algorithmic problem is to decide whether a string in
the logs matches the patterns determined by the regular
expressions. Such decisions need to be done as fast as possible,
otherwise the scanning programs would not be useful as a hardening
technique for servers. The quest for speed with these decisions is
presently a rather big challenge for the amount of traffic typical
serves have to cope with and the large number of patterns that might
be of interest.  The problem is that when thousands of regular
expressions are involved, the process of detecting whether a string
matches a regular expression can be excruciating slow. This might not
happen in most instances, but in some small number of
instances. However in these small number of instances the algorithm
for regular expression matching can grind to a halt.  This phenomenon
is called \emph{catastrophic backtracking} \cite{catast}.
Unfortunately, catastrophic backtracking is not just a nuisance, but a
security vulnerability in itself. The reason is that attackers look
for these instances where the scan is very slow and then trigger a
\emph{denial-of-service attack} against the server\ldots meaning the
server will not be reachable anymore to normal users.

Existing approaches try to mitigate the speed problem with regular
expressions by implementing the matching algorithms in hardware,
rather than in software \cite{hardware}. Although this solution offers
speed, it is often unappealing because attack patterns can change or
need to be augmented. A software solution is clearly more desirable in
these circumstances.  Also given that regular expressions were
introduced in 1950 by Kleene, one might think regular expressions have
since been studied and implemented to death. But this would definitely
be a mistake: in fact they are still an active research area and
off-the-shelf implementations are still extremely prone to the
problem of catastrophic backtracking. This can be seen in the
following two graphs:

\begin{center}
\begin{tabular}{@{}cc@{}}
%\multicolumn{2}{c}{Graph: $(a^*)^*\cdot b$ and strings 
%           $\underbrace{a\ldots a}_{n}$}\smallskip\\  
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,0.0)}},
    ylabel={time in secs},
    y label style={at={(0.06,0.5)}},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=45,
    ytick={0,10,...,40},
    scaled ticks=false,
    axis lines=left,
    width=6cm,
    height=4.5cm, 
    legend entries={Python, Java 8},  
    legend pos=north west]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
\end{axis}
\end{tikzpicture}
  & 
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,0.0)}},
    ylabel={time in secs},
    y label style={at={(0.06,0.5)}},
    %enlargelimits=false,
    %xtick={0,5000,...,30000},
    xmax=65000,
    ymax=45,
    ytick={0,10,...,40},
    scaled ticks=false,
    axis lines=left,
    width=6cm,
    height=4.5cm, 
    legend entries={Java 9},  
    legend pos=north west]
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java9.data};
\end{axis}
\end{tikzpicture}
\end{tabular}  
\end{center}

\noindent
These graphs show how long strings can be when using the rather simple
regular expression $(a^*)^* \,b$ and the built-in regular expression
matchers in the popular programming languages Python and Java (Version
8 and Version 9). There are many more regular expressions that show
the same behaviour.  On the left-hand side the graphs show that for a
string consisting of just 28 $a\,$'s, Python and the older Java (which
was the latest version of Java until September 2017) already need 30
seconds to decide whether this string is matched or not. This is an
abysmal result given that Python and Java are very popular programming
languages. We already know that this problem can be decided much, much
faster.  If these slow regular expression matchers would be employed
in network traffic analysis, then this example is already an
opportunity for mounting an easy denial-of-service attack: one just
has to send to the server a large enough string of a's. The newer
version of Java is better---it can match strings of around 50,000
$a\,$'s in 30 seconds---however this would still not be good enough
for being a useful tool for network traffic analysis. What is
interesting is that even a relatively simple-minded regular expression
matcher based on Brzozowski derivatives can out-compete the regular
expressions matchers in Java and Python on such catastrophic
backtracking examples. This gain in speed is the motivation and
starting point for the proposed project. \smallskip

\subsection*{Research Plan}

Regular expressions are already an old and well-studied topic in
Computer Science. They were originally introduced by Kleene in 1951
and are utilised in text search algorithms for ``find'' or ``find and
replace'' operations \cite{Brian,Kleene51}. Because of their wide
range of applications, many programming languages provide capabilities
for regular expression matching via libraries.  The classic theory
of regular expressions involves just 6 different kinds of regular
expressions, often defined in terms of the following grammar:

\begin{center}\small
\begin{tabular}{lcll}
  $r$ & $::=$ & $\ZERO$     & cannot match anything\\
      &   $|$ & $\ONE$      & can only match the empty string\\
      &   $|$ & $c$         & can match a single character (in this case $c$)\\
      &   $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\
  &   $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\
          &  & & then the second part with $r_2$\\
      &   $|$ & $r^*$       & can match zero or more times $r$\\
\end{tabular}
\end{center}

\noindent
For practical purposes, however, regular expressions have been
extended in various ways in order to make them even more powerful and
even more useful for applications. Some of the problems to do with
catastrophic backtracking stem, unfortunately, from these extensions.

The crux in this project is to not use automata for regular expression
matching (the traditional technique), but to use derivatives of
regular expressions instead.  These derivatives were introduced by
Brzozowski in 1964 \cite{Brzozowski1964}, but somehow had been lost
``in the sands of time'' \cite{Owens2009}. However, they recently have
become again a ``hot'' research topic with numerous research papers
appearing in the last five years. One reason for this interest is that
Brzozowski derivatives straightforwardly extend to more general
regular expression constructors, which cannot be easily treated with
standard techniques. They can also be implemented with ease in any
functional programming language: the definition for derivatives
consists of just two simple recursive functions shown in
Figure~\ref{f}. Moreover, it can be easily shown (by a mathematical
proof) that the resulting regular matcher is in fact always producing
a correct answer. This is an area where Urban can provide a lot of
expertise for this project: he is one of the main developers of the
Isabelle theorem prover, which is a program designed for such proofs
\cite{Isabelle}.


\begin{figure}[t]
\begin{center}\small
\begin{tabular}{lcl}
$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\
$\textit{nullable}(\ONE)$  & $\dn$ & $\textit{true}$\\
$\textit{nullable}(c)$     & $\dn$ & $\textit{false}$\\
$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\
$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\medskip\\

$\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\
$\textit{der}\;c\;(\ONE)$  & $\dn$ & $\ZERO$\\
$\textit{der}\;c\;(d)$     & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\
$\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\
$\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\
      & & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\
      & & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\
$\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\[-5mm]
\end{tabular}
\end{center}
\caption{The complete definition of derivatives of regular expressions
  \cite{Brzozowski1964}.
  This definition can be implemented with ease in functional programming languages and can be easily extended to regular expressions used in practice.
  The are more flexible for applications and easier to manipulate in
  mathematical proofs, than traditional techniques for regular expression
  matching.\medskip\label{f}}
\end{figure}

There are a number of avenues for research on Brzozowski derivatives.
One problem I like to immediately tackle in this project is how to
handle \emph{back-references} in regular expressions. It is known that
the inclusion of back-references causes the underlying algorithmic
problem to become NP-hard \cite{Aho}, and is the main reason why
regular expression matchers suffer from the catastrophic backtracking
problem. However, the full generality of back-references are not
needed in practical applications. The goal therefore is to sufficiently
restrict the problem so that an efficient algorithm can be
designed. There seem to be good indications that Brzozowski
derivatives are useful for that.

Another problem is \emph{how} regular expressions match a string
\cite{Sulzmann2014}. In this case the algorithm does not just give a
yes/no answer about whether the regular expression matches the string,
but also generates a value that encodes which part of the string is
matched by which part of the regular expression. This is important in
applications, like network analysis where from a matching log entry a
specific computer address needs to be extracted.  Also compilers make
extensive use of such an extension when they lex their source
programs, i.e.~break up code into word-like components.  Again Ausaf
and Urban from King's made some initial progress about proving the
correctness of such an extension, but their algorithm is not yet fast
enough for practical purposes \cite{AusafDyckhoffUrban2016}. It would
be desirable to study optimisations that make the algorithm faster,
but preserve the correctness guaranties obtained by Ausaf and Urban.

\mbox{}\\[-11mm]

\subsubsection*{Conclusion}

Much research has already gone into regular expressions, and one might
think they have been studied and implemented to death. But this is far
from the truth. In fact regular expressions have become a ``hot''
research topic in recent years because of problems with the
traditional methods (in applications such as network traffic
analysis), but also because the technique of derivatives of regular
expression has been re-discovered. These derivatives provide an
alternative approach to regular expression matching. Their potential
lies in the fact that they are more flexible in implementations and
much easier to manipulate in mathematical proofs. Therefore I like to
research them in this project.\medskip

\noindent {\bf Impact:} Regular expression matching is a core
technology in Computer Science with numerous applications. I will
focus on the area of network traffic analysis and also lexical
analysis. In these areas this project can have a quite large impact:
for example the program Bro has been downloaded at least 10,000 times and Snort
even has 5 million downloads and over 600,000 registered users
\cite{snort,bro}. Lexical analysis is a core component in every
compiler, which are the bedrock on which nearly all programming is
built on.

% {\small
% \begin{itemize}
% \item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
% \item[$\bullet$] \url{https://vimeo.com/112065252}
% \item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/}  
% \end{itemize}}
\mbox{}\\[-11mm]

\small
\bibliographystyle{plain}
\renewcommand{\refname}{References\\[-7mm]\mbox{}}
\bibliography{proposal}

\end{document}


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: